# Apple App Store Review Summarization
Text summarization is the process of condensing a piece of text while retaining its core information and meaning. It aims to generate a concise and coherent summary that captures the key points of the original text. There are generally two types of text summarization: extractive and abstractive.

1. Extractive Summarization:
   - Extractive summarization involves selecting and extracting important sentences or phrases directly from the original text to create a summary.
   - It relies on identifying significant sentences based on criteria such as importance, relevance, and frequency of occurrence.
   - Extractive summarization methods often use techniques like ranking sentences using statistical or heuristic approaches, and then selecting the top-ranked sentences for the summary.
   - While extractive summarization is relatively straightforward and computationally efficient, it may result in less coherent summaries, as it does not generate new sentences but rather extracts existing ones.

2. Abstractive Summarization:
   - Abstractive summarization involves generating *new* sentences that convey the essence of the original text in a more condensed form.
   - This method requires understanding the meaning of the text and rephrasing it in a way that captures the main ideas while potentially using different words or sentence structures.
   - Abstractive summarization often utilizes advanced natural language processing techniques, such as deep learning models like Transformer-based architectures, which have shown promising results in generating human-like summaries.
   - While abstractive summarization can produce more coherent and concise summaries compared to extractive methods, it also poses significant challenges, including maintaining coherence, preserving factual accuracy, and avoiding the generation of incorrect or misleading information.

Challenges in Text Summarization:
1. Semantic Understanding: Ensuring that the summarization algorithm accurately comprehends the meaning and context of the original text is crucial for generating informative summaries.
2. Coherence and Fluency: Generating summaries that are both coherent and fluent poses a challenge, especially in abstractive summarization, where the algorithm needs to produce human-like language.
3. Preserving Key Information: Summarization algorithms must effectively identify and retain the most relevant and important information from the original text while discarding redundant or trivial details.
4. Handling Variability: Texts can vary widely in terms of length, style, and complexity, making it challenging to develop a one-size-fits-all summarization approach that performs well across different types of texts.
5. Evaluation Metrics: Assessing the quality of summaries objectively is difficult, as there may be multiple valid ways to summarize a given text. Developing robust evaluation metrics that capture the essence, relevance, and readability of summaries remains an ongoing research challenge.

In this notebook, we will use the pretrained [PEGASUS](https://huggingface.co/docs/transformers/main/model_doc/pegasus) model to summarize App Store reviews. The model will be trained on 792,259 book reviews.

In [2]:
!pip install transformers[sentencepiece] --upgrade --quiet
!pip install datasets --upgrade --quiet
!pip install pyarrow==11.0.0 --quiet
!pip install nltk --quiet
!pip install rouge_score --upgrade --quiet
!pip install evaluate --upgrade --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.3/38.3 MB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m170.9/170.9 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m51.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━

In [None]:
import pandas as pd
# import evaluate
import numpy as np
from tqdm import tqdm
from datasets import load_dataset
from google.colab import drive
from transformers import TFAutoModelForSeq2SeqLM
from huggingface_hub import notebook_login
from transformers import DataCollatorForSeq2Seq
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
from transformers import create_optimizer
from transformers import pipeline
from transformers.keras_callbacks import PushToHubCallback
import tensorflow as tf
import nltk
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

## Preliminaries
The book reviews are stored on Google Drive. Below, we mount the drive and provide the paths to the book review training, validation, and test sets. In addition, we log into the Hugging Face Hub.

In [None]:
drive.mount("/content/drive")
fp_train = "drive/My Drive/appvoc/reviews/books/books_train.pkl"
fp_val = "drive/My Drive/appvoc/reviews/books/books_val.pkl"
fp_test = "drive/My Drive/appvoc/reviews/books/books_test.pkl"
notebook_login()


## App Store Book Review Corpus

### Create Huggingface Datasets

In [None]:
data_files = {"train": fp_train, "validation": fp_val, "test": fp_test}
reviews_dataset = load_dataset("pandas", data_files=data_files)
reviews_dataset

## Sample Dataset

In [None]:
def show_samples(dataset, num_samples=3, seed=42):
    sample = dataset["train"].shuffle(seed=seed).select(range(num_samples))
    for example in sample:
        print(f"\n'>> Title: {example['title']}'")
        print(f"'>> Review: {example['content']}'")

In [None]:
show_samples(reviews_dataset)

## Preprocess the Data

In [None]:
max_input_length = 512
max_target_length = 64
model_checkpoint = "google/pegasus-xsum"
tokenizer = PegasusTokenizer.from_pretrained(model_checkpoint)

In [None]:
def preprocess_function(reviews):
    model_inputs = tokenizer(
        reviews["content"],
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(
        reviews["title"], max_length=max_target_length, truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
tokenized_reviews = reviews_dataset.map(preprocess_function, batched=True)

Next, we remove the column names from the tokenized reviews.

## Create Baseline

In [None]:

def five_sentence_summary(text):
  return "\n".join(sent_tokenize(text)[:5])

In [None]:
def evaluate_baseline(dataset, metric):
    summaries = [five_sentence_summary(text) for text in dataset["content"]]
    return metric.compute(predictions=summaries, references=dataset["title"])

In [None]:
rouge_score = evaluate.load("rouge")
score = evaluate_baseline(reviews_dataset["validation"], rouge_score)
score

## Fine-Tune Pegasus Model

In [None]:
model = PegasusForConditionalGeneration.from_pretrained(model_checkpoint)

### Create Data Collator
A data collator is an object that batches the data and, in some cases, performs some preprocessing. In this case, the Pegasus model is an encoder-decorder Transformer model; thus, we need to shift the labels to the right by one to ensure that the decoder only sees the previous ground truth labels and not the current or future labels. The DataCollatorForSeq2Seq collator will dynamically pad the inputs and labels accordingly.

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

As the data collator will not know how to pad the column names, they must be removed from the tokenized reviews dataset.

In [None]:
tokenized_reviews = tokenized_reviews.remove_columns(
    reviews_dataset["train"].column_names
)

### Convert Datasets to TensorFlow Datasets
Before we train the Pegasus model, we need to convert the tokenized reviews dataset to a tf.data.Datasets object using the data collator.

In [None]:
tf_train_dataset = model.prepare_tf_dataset(
    tokenized_reviews["train"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=8,
)
tf_val_dataset = model.prepare_tf_dataset(
    tokenized_reviews["validation"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=8,
)

### Compile the Pegasus Model

In [None]:
# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# by the total number of epochs. Note that the tf_train_dataset here is a batched tf.data.Dataset,
# not the original Hugging Face Dataset, so its len() is already num_samples // batch_size.
num_train_epochs = 8
num_train_steps = len(tf_train_dataset) * num_train_epochs
model_name = model_checkpoint.split("/")[-1]

optimizer, schedule = create_optimizer(
    init_lr=5.6e-5,
    num_warmup_steps=0,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)

model.compile(optimizer=optimizer)

# Train in mixed-precision float16
tf.keras.mixed_precision.set_global_policy("mixed_float16")

### Fit Pegasus Model
Finally, we fit the model and use the PushToHUbCallback to save the model to the Hugging Face Hub after each epoch for inference later.

In [None]:
callback = PushToHubCallback(
    output_dir=f"{model_name}-finetuned-appvoc_books-en", tokenizer=tokenizer
)

model.fit(
    tf_train_dataset, validation_data=tf_val_dataset, callbacks=[callback], epochs=8
)

### Evaluate Pegasus Model Performance on Validation Set
We are provided the loss values from training; however, we'd like to see the ROUGE metrics we computed earlier. To get those metrics, we'll need to generate outputs from the model and convert them to strings.

Here, we'll build some lists of labels and predictions for the ROUGE metric to compare. We'll also  compile our generation code with XLA, TensorFlow's accelerated linear algebra compiler. XLA applies various optimizations to the model's computation graph, and results in significant improvements to speed and memory usage.

XLA works best when there is little variation in our input shapes. To handle this, we'll  pad our inputs to multiples of 128, and make a new dataset with the padding collator. Then, we'll apply the @tf.function(jit_compile=True) decorator to our generation function, which marks the whole function for compilation with XLA.



In [None]:

generation_data_collator = DataCollatorForSeq2Seq(
    tokenizer, model=model, return_tensors="tf", pad_to_multiple_of=128
)

tf_generate_dataset = model.prepare_tf_dataset(
    tokenized_datasets["validation"],
    collate_fn=generation_data_collator,
    shuffle=False,
    batch_size=8,
    drop_remainder=True,
)


@tf.function(jit_compile=True)
def generate_with_xla(batch):
    return model.generate(
        input_ids=batch["input_ids"],
        attention_mask=batch["attention_mask"],
        max_new_tokens=32,
    )


all_preds = []
all_labels = []
for batch, labels in tqdm(tf_generate_dataset):
    predictions = generate_with_xla(batch)
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = labels.numpy()
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    decoded_preds = ["\n".join(sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(sent_tokenize(label.strip())) for label in decoded_labels]
    all_preds.extend(decoded_preds)
    all_labels.extend(decoded_labels)

Next, we compute the ROUGE scores.

In [None]:
result = rouge_score.compute(
    predictions=decoded_preds, references=decoded_labels, use_stemmer=True
)
result = {key: round(value.mid.fmeasure * 100,4) for key, value in result.items()}
result_df = pd.DataFrame(result)
result_df.head()

### Evaluate Fine-Tuned Model on Test Set

In [None]:
hub_model_id = f"j2slab/{model_name}-finetuned-appvoc_books-en"
summarizer = pipeline("summarization", model=hub_model_id)

In [None]:
def print_summary(idx):
    review = reviews_dataset["test"][idx]["content"]
    title = reviews_dataset["test"][idx]["title"]
    summary = summarizer(books_dataset["test"][idx]["content"])[0]["summary_text"]
    print(f"'>>> Review: {review}'")
    print(f"\n'>>> Title: {title}'")
    print(f"\n'>>> Summary: {summary}'")

In [None]:
print_summary(100)