# Text Summarization
In this notebook, we try different text summarization models. We will use the cnn daily mail dataset.

In [None]:
# Run the following command if you have not installed the package `datasets` before:
# !pip install datasets

# If you get an error in the cells below (NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.)
# Run the line below - there seems to be an issue with fsspec 2023.10.0:
!pip install fsspec==2023.9.2

In [None]:
from datasets import load_dataset

Here we load the dataset:

In [None]:
dataset = load_dataset('cnn_dailymail', '3.0.0', download_mode='force_redownload', verification_mode='no_checks')

In [None]:
print(f"Features: {dataset['train'].column_names}")

As an example we will use the following (randomly selected) text:

In [None]:
sample = dataset["train"][1]
print(f"""
Article (excerpt of 500 characters, total length: {len(sample["article"])}):
""")
print(sample["article"][:500])
print(f'\nSummary (length: {len(sample["highlights"])}):')
print(sample["highlights"])

We will collect the generated summaries of each model in a dictionary:

In [None]:
sample_text = dataset["train"][1]["article"][:2000]
summaries = {}

### Sentence Tokenization
In summarization, it's common practice to separate the summary sentences by a newline character. A simple approach replacing a full stop, however, would fail for strings like "U.N." or "U.S.". Within the Natural Language Toolkit (NLTK), the function `sent_tokenize` uses more advanced algorithms to differentiate between the end of a sentence and punctuation that occurs within an abbreviation.

In [None]:
!pip install nltk

In [None]:
import nltk
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

Here's an example of how the `sent_tokenize` works:

In [None]:
string = "The U.S. are a country. The U.N. is an organization."
sent_tokenize(string)

## Testing out Different Summarization approaches:

### Summarization Baseline
A simplistic baseline is to just use the first 3 sentences of each article. This approach can easily be implemented using the function `sent_tokenize`:

In [None]:
def three_sentence_summary(text):
    return "\n".join(sent_tokenize(text)[:3])

In [None]:
summaries["baseline"] = three_sentence_summary(sample_text)

### GPT-2
To use version 2 of GPT for text summarization, we engineer a prompt by adding "TL; DR" to the end of the input text. "TL; DR" stands for "too long; don't read" and is often used e.g., in Reddit posts to indicate the short version of a longer post. We thus direct GPT into the direction of a short version of the input text.

In [None]:
from transformers import pipeline, set_seed

set_seed(42)
pipe = pipeline("text-generation", model="gpt2")
gpt2_query = sample_text + "\nTL;DR:\n"
pipe_out = pipe(gpt2_query, max_length=512, clean_up_tokenization_spaces=True)
summaries["gpt2"] = "\n".join(
    sent_tokenize(pipe_out[0]["generated_text"][len(gpt2_query) :]))

### T5
T5, the Text-to-Text Transfer Transformer, is trained using a universal approach to formulate all tasks as text-to-text tasks. T5 was trained on both unsupervised (to reconstruct the masked words) and supervised data, including summarization. Using the same prompts as in the pretraining, we can thus directly do summarization. The input format to summarize a document is `summarize: <ARTICLE>'` (Translation, by the way, can be prompted via `translate English to German: <TEXT>'`).
We write a small function to prepend `summarize` to our texts:

In [None]:
pipe = pipeline("summarization", model="t5-small")
pipe_out = pipe(sample_text)
summaries["t5"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))

In [None]:
print(summaries["t5"])

### BART
The Bidirectional and Auto-Regressive
Transformers (BART) combine the approach of BERT and autoregressive transfomer as in GPT. A checkpoint specifically finetuned for the given dataset is available from the Huggingface Hub:

In [None]:
pipe = pipeline("summarization", model="facebook/bart-large-cnn")
pipe_out = pipe(sample_text)
summaries["bart"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))

## Comparing Summaries
Now, let's look at the different summaries we have obtained using the different models:

In [None]:
print("GROUND TRUTH")
print(dataset["train"][1]["highlights"])
print("")

for key, val in summaries.items():
  print('****')
  print(key.upper())
  print(val)

In [None]:
# !pip install rouge_score
# !pip install sacrebleu

### BLUE Metric

In [None]:
!pip install sacrebleu
from datasets import load_metric

bleu_metric = load_metric("sacrebleu")

In [None]:
import pandas as pd
import numpy as np

bleu_metric.add(
    prediction="the the the the the the",
    reference=["the cat is on the mat"])

results = bleu_metric.compute(smooth_method="floor", smooth_value=0)
results["precisions"] = [np.round(p, 2) for p in results["precisions"]]
pd.DataFrame.from_dict(results, orient="index", columns=["Value"])

In [None]:
bleu_metric.add(prediction="the cat is on mat",
                reference=["the cat is on the mat"])
results = bleu_metric.compute(smooth_method="floor", smooth_value=0)
results["precisions"] = [np.round(p, 2) for p in results["precisions"]]
pd.DataFrame.from_dict(results, orient="index", columns=["Value"])

In [None]:
records = []
reference = dataset["train"][1]["highlights"]
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

for model_name in summaries:
    bleu_metric.add(prediction=summaries[model_name], reference=[reference])
    results = bleu_metric.compute(smooth_method='floor', smooth_value=0)
    results['precisions'] = [np.round(p, 2) for p in results['precisions']]

    bleu_dict = dict(results)
    records.append(bleu_dict)

pd.DataFrame.from_records(records, index=summaries.keys())

We see that bart is by far the best model - however, this is due to the facxt that both gpt2 and t5 do not have an overlap in the n-grams for n larger than 2. Since the overall score is an exponentially weighted product, the overall score of these two methods is 0.

### ROUGE Metric

In [None]:
!pip install rouge_score

rouge_metric = load_metric("rouge")

In [None]:
reference = dataset["train"][1]["highlights"]
records = []
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

for model_name in summaries:
    rouge_metric.add(prediction=summaries[model_name], reference=reference)
    score = rouge_metric.compute()
    rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
    records.append(rouge_dict)
pd.DataFrame.from_records(records, index=summaries.keys())

We see that also based on a proper evaluation, the bart model achieves a rouge scores that are quite a bit better than the baseline approach. gpt2 and t5, however, are again quite a bit below in their performance.