<a href="https://colab.research.google.com/github/rahiakela/transformers-research-and-practice/blob/main/natural-language-processing-with-transformers/06-summarization/summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Summarization

Text summarization is a
difficult task for neural language models, including transformers. Despite these challenges,
text summarization offers the prospect for domain experts to significantly
speed up their workflows and is used by enterprises to condense internal knowledge,
summarize contracts, automatically generate content for social media releases,
and more.

Summarization is a classic
sequence-to-sequence (seq2seq) task with an input text and a target text.

##Setup

In [None]:
!pip -q install transformers
!pip -q install datasets

In [2]:
from transformers import pipeline, set_seed
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import DataCollatorForSeq2Seq
from transformers import TrainingArguments, Trainer

from datasets import load_dataset, load_metric

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import torch

import nltk
from nltk.tokenize import sent_tokenize

from tqdm import tqdm

##CNN/DailyMail Dataset

The CNN/DailyMail dataset consists of around 300,000 pairs of news articles and
their corresponding summaries, composed from the bullet points that CNN and the
DailyMail attach to their articles.

An important aspect of the dataset is that the summaries are abstractive and not extractive, which means that they consist of new
sentences instead of simple excerpts.

We’ll use version 3.0.0, which is a nonanonymized version set up for summarization.

In [None]:
dataset = load_dataset("cnn_dailymail", version="3.0.0")

In [5]:
print(f"Features: {dataset['train'].column_names}")

Features: ['article', 'highlights', 'id']


Let’s look at an excerpt from an article:

In [9]:
sample = dataset["train"][1]

print(f"""Article (excerpt of 500 characters, total length: {len(sample['article'])})""")
print(sample["article"])

print(f"\nSummary (length: {len(sample['highlights'])})")
print(sample["highlights"])

Article (excerpt of 500 characters, total length: 3192)
(CNN) -- Usain Bolt rounded off the world championships Sunday by claiming his third gold in Moscow as he anchored Jamaica to victory in the men's 4x100m relay. The fastest man in the world charged clear of United States rival Justin Gatlin as the Jamaican quartet of Nesta Carter, Kemar Bailey-Cole, Nickel Ashmeade and Bolt won in 37.36 seconds. The U.S finished second in 37.56 seconds with Canada taking the bronze after Britain were disqualified for a faulty handover. The 26-year-old Bolt has now collected eight gold medals at world championships, equaling the record held by American trio Carl Lewis, Michael Johnson and Allyson Felix, not to mention the small matter of six Olympic titles. The relay triumph followed individual successes in the 100 and 200 meters in the Russian capital. "I'm proud of myself and I'll continue to work to dominate for as long as possible," Bolt said, having previously expressed his intention to carry 

We see that the articles can be very long compared to the target summary; in this particular
case the difference is 17-fold.

Long articles pose a challenge to most transformer
models since the context size is usually limited to 1,000 tokens or so, which is
equivalent to a few paragraphs of text. The standard, yet crude way to deal with this
for summarization is to simply truncate the texts beyond the model’s context size.

##Text Summarization Pipelines