# Text summarization with T5 on XSum

We are going to fine-tune the [T5 model, implemented by HuggingFace](https://huggingface.co/t5-small), for text summarization on the [Extreme Summarization (XSum)](https://huggingface.co/datasets/xsum) dataset.
The data is composed by news articles and the corresponding summaries.

We will be using the following model sizes available from HuggingFace

| Variant                                     |   Parameters    |
|:-------------------------------------------:|----------------:|
| [T5-small](https://huggingface.co/t5-small) |    60,506,624   | 
| [T5-large](https://huggingface.co/t5-large) |   737,668,096   | 
| [T5-3b](https://huggingface.co/t5-3b)       | 2,851,598,336   | 


More info:
* This notebooks is based on the script [run_summarization_no_trainer.py](https://github.com/huggingface/transformers/blob/v4.12.5/examples/pytorch/summarization/run_summarization_no_trainer.py) from HuggingFace
* [T5 on HuggingFace docs](https://huggingface.co/transformers/model_doc/t5.html)

In [1]:
import os
import datasets
from datasets import load_dataset, load_metric
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, DataCollatorForSeq2Seq
from torch.utils.data import DataLoader

  from .autonotebook import tqdm as notebook_tqdm
2022-10-13 13:00:59.378664: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-13 13:00:59.708382: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-10-13 13:01:01.744177: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /apps/daint/UES/6.0.UP04/sandboxes/sarafael/software/cuDNN/8.1.0/lib:/opt/nvidia/hpc_sdk/Linux_x86_64/21.5/cuda/11.3/compat:/usr/local/cuda-11.3/compat:/opt/nvidi

In [2]:
from datasets.utils import disable_progress_bar
from datasets import disable_caching


disable_progress_bar()
disable_caching()

## The data

In [3]:
hf_dataset = load_dataset('xsum')

Using custom data configuration default
Reusing dataset xsum (/users/class424/.cache/huggingface/datasets/xsum/default/1.2.0/32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934)


In [4]:
hf_dataset

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

In [5]:
sample = 188948

In [6]:
hf_dataset['train']['id'][sample]

'15575668'

In [7]:
hf_dataset['train']['summary'][sample]

'BB King was hailed as one of the greatest blues musicians of all time.'

In [8]:
hf_dataset['train']['document'][sample]

'His vibrato style of playing influenced a generation of rock and blues guitarists, including Eric Clapton, Mike Bloomfield and Stevie Ray Vaughan.\nRolling Stone magazine once ranked BB King in third place in its list of the 100 greatest guitarists of all time, just below Jimi Hendrix and Duane Allman.\nHis output crossed musical barriers, from jazz and blues to mainstream pop.\nHe was born Riley B King in Indianola, Mississippi, on 16 September 1925. His parents were sharecroppers and, as a young boy, he helped them work in the fields.\nThe family struggled. "When you live in a house that you can always peek out of and see what kind of day it is," King later said, "you\'re not doing so well."\nThe sound of his co-workers hollering the blues was his first introduction to the style of music that he was to help take from a purely black American audience into the mainstream.\nHe bought his first guitar when he was barely a teenager so he could play at church services. In 1947 he moved to

## The tokenizer

In [9]:
hf_model = 't5-small'
t5_cache = os.path.join(os.getcwd(), 'cache')

tokenizer = AutoTokenizer.from_pretrained(
    hf_model,
    use_fast=True,
    cache_dir=os.path.join(t5_cache, f'{hf_model}_tokenizer')
)

Downloading: 100%|██████████| 1.17k/1.17k [00:00<00:00, 1.28MB/s]
Downloading: 100%|██████████| 773k/773k [00:00<00:00, 1.32MB/s]
Downloading: 100%|██████████| 1.32M/1.32M [00:00<00:00, 2.29MB/s]
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [10]:
encoded_text = tokenizer("What's up tokenizer!",
                         max_length=1024,
                         padding=False,
                         truncation=True)

In [11]:
encoded_text

{'input_ids': [363, 31, 7, 95, 14145, 8585, 55, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

 * `attention_mask` indicates what's text and what's padding

In [12]:
tokenizer.batch_decode(encoded_text['input_ids'])

['What', "'", 's', 'up', 'token', 'izer', '!', '</s>']

In [13]:
with tokenizer.as_target_tokenizer():
    encoded_text = tokenizer("What's up tokenizer!", max_length=1024,
                             padding=False, truncation=True)

In [14]:
encoded_text

{'input_ids': [363, 31, 7, 95, 14145, 8585, 55, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

## Tokenizing the data

In [15]:
def preprocess_function(examples):    
    inputs = examples['document']
    targets = examples['summary']
    inputs = [f'summarize: {inp}' for inp in inputs]

    model_inputs = tokenizer(inputs, max_length=1024,
                             padding=False, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=128,
                           padding=False, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [16]:
%%time
processed_datasets = hf_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=hf_dataset["train"].column_names,
    desc="Running tokenizer on dataset",
    num_proc=12
)

CPU times: user 978 ms, sys: 328 ms, total: 1.31 s
Wall time: 34.2 s


In [17]:
processed_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 11334
    })
})

In [18]:
# For training Sequence to Sequence models, we need a special kind of data collator,
# which will not only pad the inputs to the maximum length in the batch,
# but also the labels.
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    label_pad_token_id=tokenizer.pad_token_id
)

per_device_train_batch_size = 128

train_dataset = processed_datasets["train"]

train_dataloader = DataLoader(
    train_dataset,
    shuffle=False,
    collate_fn=data_collator,
    batch_size=per_device_train_batch_size
)

In [19]:
for step, batch in enumerate(train_dataloader):
    if step > 15:
        break

In [20]:
type(batch)

transformers.tokenization_utils_base.BatchEncoding

In [21]:
batch.keys()

dict_keys(['input_ids', 'attention_mask', 'labels'])

In [22]:
batch['input_ids'].shape

torch.Size([128, 1024])

In [23]:
batch['input_ids'][0]

tensor([21603,    10,    37,  ...,     0,     0,     0])

In [24]:
batch['attention_mask']  # indicates what's text and what's padding

tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])

In [25]:
batch['attention_mask'][0]

tensor([1, 1, 1,  ..., 0, 0, 0])

In [26]:
tokenizer.decode(batch['input_ids'][0][batch['attention_mask'][0]==1])

'summarize: The 39-year-old former world number one is now ranked 96 in the world and without a PGA Tour title since 2012. But his fifth birdie gave him a four-under 67 and took him to 10 under alongside Canadian Graham DeLaet. Ian Poulter enhanced his quest for a PGA Tour card with a 68 to earn a share of third place, two strokes back. The 41-year-old is playing the penultimate event of his 10-tournament medical exemption and will secure his card with 12th place or better. Find out how to get into golf with our special guide. He had five birdies to reach eight under at the Harbour Town links. Donald, four times a runner-up in the tournament, had three consecutive birdies on the front nine and a superb bunker shot to two feet from a precarious plugged lie helped him to save par at the 17th. He then produced a delicate lofted chip from the right of the 18th fairway that checked and trickled into the cup. "I\'ve always felt like I pitch the ball really well round here," said Donald. "The

In [27]:
batch['labels'][0]

tensor([ 1566,   348, 12020,  7459,  6591,  3138,    16,    21,     3,     9,
         5963,    23,    15,    44,     8,   804,  6356,    12,   698,     8,
        22653,   991,    44,     8,   391,  7645, 11523,    44, 22003,  3642,
           16,  1013,  5089,     5,     1,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0])

In [28]:
tokenizer.batch_decode(batch['labels'])[0]

'Englishman Luke Donald chipped in for a birdie at the final hole to share the halfway lead at the RBC Heritage at Hilton Head in South Carolina.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>'