<a href="https://colab.research.google.com/github/rahiakela/transformers-research-and-practice/blob/main/natural-language-processing-with-transformers/06-summarization/summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Summarization

Text summarization is a
difficult task for neural language models, including transformers. Despite these challenges,
text summarization offers the prospect for domain experts to significantly
speed up their workflows and is used by enterprises to condense internal knowledge,
summarize contracts, automatically generate content for social media releases,
and more.

Summarization is a classic
sequence-to-sequence (seq2seq) task with an input text and a target text.

##Setup

In [None]:
!pip -q install transformers
!pip -q install datasets
!pip install sentencepiece

In [1]:
from transformers import pipeline, set_seed
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import DataCollatorForSeq2Seq
from transformers import TrainingArguments, Trainer

from datasets import load_dataset, load_metric

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import torch

import nltk
from nltk.tokenize import sent_tokenize

from tqdm import tqdm

In [None]:
nltk.download("punkt")

##CNN/DailyMail Dataset

The CNN/DailyMail dataset consists of around 300,000 pairs of news articles and
their corresponding summaries, composed from the bullet points that CNN and the
DailyMail attach to their articles.

An important aspect of the dataset is that the summaries are abstractive and not extractive, which means that they consist of new
sentences instead of simple excerpts.

We’ll use version 3.0.0, which is a nonanonymized version set up for summarization.

In [None]:
dataset = load_dataset("cnn_dailymail", version="3.0.0")

In [3]:
print(f"Features: {dataset['train'].column_names}")

Features: ['article', 'highlights', 'id']


Let’s look at an excerpt from an article:

In [5]:
sample = dataset["train"][1]

print(f"""Article (excerpt of 500 characters, total length: {len(sample['article'])})""")
print(sample["article"])

print(f"\nSummary (length: {len(sample['highlights'])})")
print(sample["highlights"])

Article (excerpt of 500 characters, total length: 3192)
(CNN) -- Usain Bolt rounded off the world championships Sunday by claiming his third gold in Moscow as he anchored Jamaica to victory in the men's 4x100m relay. The fastest man in the world charged clear of United States rival Justin Gatlin as the Jamaican quartet of Nesta Carter, Kemar Bailey-Cole, Nickel Ashmeade and Bolt won in 37.36 seconds. The U.S finished second in 37.56 seconds with Canada taking the bronze after Britain were disqualified for a faulty handover. The 26-year-old Bolt has now collected eight gold medals at world championships, equaling the record held by American trio Carl Lewis, Michael Johnson and Allyson Felix, not to mention the small matter of six Olympic titles. The relay triumph followed individual successes in the 100 and 200 meters in the Russian capital. "I'm proud of myself and I'll continue to work to dominate for as long as possible," Bolt said, having previously expressed his intention to carry 

We see that the articles can be very long compared to the target summary; in this particular
case the difference is 17-fold.

Long articles pose a challenge to most transformer
models since the context size is usually limited to 1,000 tokens or so, which is
equivalent to a few paragraphs of text. The standard, yet crude way to deal with this
for summarization is to simply truncate the texts beyond the model’s context size.

##Text Summarization Pipelines

Let’s see how a few of the most popular transformer models for summarization perform
by first looking qualitatively at the outputs for the preceding example.

In [4]:
sample_text = dataset["train"][1]["article"][:2000]  # restrict the input text to 2,000 characters

# We'll collect the generated summaries of each model in a dictionary
summaries = {}

Let's differentiate the end of a sentence
from punctuation that occurs in abbreviations

In [5]:
string = "The U.S. are a country. The U.N. is an organization."
sent_tokenize(string)

['The U.S. are a country.', 'The U.N. is an organization.']

###Summarization Baseline

A common baseline for summarizing news articles is to simply take the first three
sentences of the article.

In [6]:
def three_sentence_summary(text):
  return "\n".join(sent_tokenize(text)[:3])

In [7]:
summaries["baseline"] = three_sentence_summary(sample_text)

###GPT-2

One of GPT-2’s surprising features is that we can also use it to generate summaries
by simply appending `TL;DR` at the end of the input text. 

The expression `TL;DR` (too long; didn’t read) is often used on platforms like Reddit to indicate a short version
of a long post.

In [8]:
set_seed(42)

pipe = pipeline("text-generation", model="gpt2")

gpt2_query = sample_text + "\nTL;DR:\n"
pipe_out = pipe(gpt2_query, max_length=512, clean_up_tokenization_spaces=True)
summaries["gpt2"] = "\n".join(sent_tokenize(pipe_out[0]["generated_text"][len(gpt2_query):]))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


###T5

Next let’s try the T5 transformer.

The T5 checkpoints are trained on a mixture of unsupervised data (to
reconstruct masked words) and supervised data for several tasks, including summarization.
These checkpoints can thus be directly used to perform summarization
without fine-tuning by using the same prompts used during pretraining.

In this
framework, the input format for the model to summarize a document is "summarize:`<ARTICLE>`", and for translation it looks like "translate English to German:`<TEXT>`".

In [9]:
pipe = pipeline("summarization", model="t5-large")

pipe_out = pipe(sample_text)
summaries["t5"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))

###BART

BART also uses an encoder-decoder architecture and is trained to reconstruct corrupted inputs. It combines the pretraining schemes of BERT and GPT-2.

In [10]:
pipe = pipeline("summarization", model="facebook/bart-large-cnn")

pipe_out = pipe(sample_text)
summaries["bart"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))

###PEGASUS

Like BART, PEGASUS is an encoder-decoder transformer.its pretraining objective is to predict masked sentences in multisentence texts.

The
authors argue that the closer the pretraining objective is to the downstream task, the
more effective it is. 

With the aim of finding a pretraining objective that is closer to
summarization than general language modeling, they automatically identified, in a
very large corpus, sentences containing most of the content of their surrounding
paragraphs (using summarization evaluation metrics as a heuristic for content
overlap) and pretrained the PEGASUS model to reconstruct these sentences, thereby
obtaining a state-of-the-art model for text summarization.



In [11]:
pipe = pipeline("summarization", model="google/pegasus-cnn_dailymail")

pipe_out = pipe(sample_text)
summaries["pegasus"] = pipe_out[0]["summary_text"].replace(" .<n>", ".\n")

### Comparing Different Summaries

Now that we have generated summaries with four different models, let’s compare the results.

Keep in mind that one model has not been trained on the dataset at all
(GPT-2), one model has been fine-tuned on this task among others (T5), and two
models have exclusively been fine-tuned on this task (BART and PEGASUS).

Let’s
have a look at the summaries these models have generated:

In [13]:
print("GROUND TRUTH")
print(dataset["train"][1]["highlights"])
print("")

for model_name in summaries:
  print(model_name.upper())
  print("-----------------------------------")
  print(summaries[model_name])
  print("")

GROUND TRUTH
Usain Bolt wins third gold of world championship .
Anchors Jamaica to 4x100m relay victory .
Eighth gold at the championships for Bolt .
Jamaica double up in women's 4x100m relay .

BASELINE
-----------------------------------
(CNN) -- Usain Bolt rounded off the world championships Sunday by claiming his third gold in Moscow as he anchored Jamaica to victory in the men's 4x100m relay.
The fastest man in the world charged clear of United States rival Justin Gatlin as the Jamaican quartet of Nesta Carter, Kemar Bailey-Cole, Nickel Ashmeade and Bolt won in 37.36 seconds.
The U.S finished second in 37.56 seconds with Canada taking the bronze after Britain were disqualified for a faulty handover.

GPT2
-----------------------------------
Usain Bolt triumphed after a difficult final, but then he got into trouble when Canada made him try to block.
It's the last match he's held all of last year with the men, who were later picked in the third and fourth rounds of the men's 100m an

The first thing we notice by looking at the model outputs is that the summary generated
by GPT-2 is quite different from the others. Instead of giving a summary of the
text, it summarizes the characters.

Comparing the other three model summaries against the ground truth, we see that
there is remarkable overlap, with PEGASUS’s output bearing the most striking
resemblance.

Now that we have inspected a few models, let’s try to decide which one we would use
in a production setting.

However, this
is not a systematic way of determining the best model! Ideally, we would define a metric, measure it for all models on some benchmark dataset, and choose the one
with the best performance.

But how do you define a metric for text generation?

The
standard metrics that we’ve seen, like accuracy, recall, and precision, are not easy to
apply to this task.

##Measuring the Quality of Generated Text