<a href="https://colab.research.google.com/github/saurograndi/nlp-t5-summarizer/blob/main/T5-cnn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Summarization with T5

In [None]:
!pip install sentencepiece
!pip install transformers
!pip install datasets
!pip install nltk
!pip install rouge_score
!pip install evaluate



In [None]:
from transformers import AutoTokenizer

model_checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [None]:
from google.colab import userdata
from huggingface_hub import login

login(token=userdata.get('token'))

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
import datasets

cnn = datasets.load_dataset("cnn_dailymail", "3.0.0")
cnn

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})

In [None]:
def get_samples(dataset, num_samples=100, seed=42):
    train_sample = dataset["train"].shuffle(seed=seed).select(range(num_samples))
    valid_sample = dataset["validation"].shuffle(seed=seed).select(range(int(num_samples/6.7)))
    test_sample = dataset["test"].shuffle(seed=seed).select(range(int(num_samples/6.7)))
    return datasets.DatasetDict({"train":train_sample, "valid":valid_sample, "test":test_sample})

In [None]:
cnn = get_samples(cnn, num_samples=1000)
cnn

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 1000
    })
    valid: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 149
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 149
    })
})

In [None]:
def show_samples(dataset, num_samples=3, seed=42):
    sample = dataset["train"].shuffle(seed=seed).select(range(num_samples))
    for example in sample:
        print(f"\n'>> article: {example['article']}'")
        print(f"\n'>> highlights: {example['highlights']}'")


show_samples(cnn, num_samples=1)


'>> article: By . Nick Enoch . Star trails sweep over the Giant’s Causeway in Northern Ireland, dust clouds are moulded into colossal arrangements by cosmic radiation thousands of light years away and a bright meteor races across the night sky passing over Indonesia’s smoke-spewing Mount Bromo. These are just some of the incredible photos which have been shortlisted in the 2014 Astronomy Photographer of the Year competition. The contest, run by the Royal Observatory Greenwich in association with BBC Sky at Night Magazine, is now in its sixth year - and a record number of entries from more than 2,500 enthusiastic amateurs and professional photographers have poured in from around the world. Centre of the Heart Nebula by Ivan Eder, Hungary. Situated 7,500 light years away in the W-shaped constellation of Cassiopeia, the Heart Nebula is a vast region of glowing gas, energised by a cluster of young stars at its centre. The image depicts the central region, where dust clouds are being erode

## Preprocessing

In [None]:
prefix = "summarize: "

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["article"]]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True)

    labels = tokenizer(text_target=examples["highlights"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
tokenized_cnn = cnn.map(preprocess_function, batched=True)

Map:   0%|          | 0/149 [00:00<?, ? examples/s]

In [None]:
tokenized_cnn = tokenized_cnn.remove_columns(['id'])

In [None]:
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, return_tensors="tf")

## Evaluation

In [None]:
import numpy as np
import evaluate

rouge = evaluate.load("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

## Training

In [None]:
from transformers import create_optimizer, AdamWeightDecay

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

Load T5 with TFAutoModelForSeq2SeqLM:

Convert your datasets to the tf.data.Dataset format with prepare_tf_dataset():

In [None]:
tf_train_set = model.prepare_tf_dataset(
    tokenized_cnn["train"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_test_set = model.prepare_tf_dataset(
    tokenized_cnn["test"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Transformers models all have a default task-relevant loss function

In [None]:
import tensorflow as tf

model.compile(optimizer=optimizer)  # No loss argument!

In [None]:
from transformers.keras_callbacks import KerasMetricCallback

#  By setting predict_with_generate=True, we will generate text for each sample in the evaluation set. That means we evaluate generated text within the compute_metric function.
metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_test_set, predict_with_generate=True)

callbacks = [metric_callback]

In [None]:
model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=callbacks)

Epoch 1/3



Epoch 2/3
Epoch 3/3


<keras.src.callbacks.History at 0x789d1c1a3010>

## Inference

In [None]:
text = "summarize: Questa è una giornata per recuperare un terreno che non abbiamo saputo calpestare nel modo giusto”. Gianni Cuperlo presenta così la lunga giornata del convegno “La parola Pace – L’utopia che deve farsi realtà”, promosso a Milano dalla sua associazione Promessa democratica. Sul palco si alternano decine di dirigenti del Pd, fino alla segretaria Elly Schlein. E l’evento, oltre ai validi contributi sul Medio Oriente (tra cui quelli di Lucia Annunziata e Domenico Quirico), suona soprattutto come un netto cambio di atteggiamento del partito nei confronti della guerra in Ucraina. Persino un mea culpa, a giudicare dai toni di alcuni contributi. Non solo dall’ala più a sinistra, ma pure da Base Riformista, la corrente guidata da Lorenzo Guerini, seduto in prima fila per tutta la mattinata."

In [None]:
inputs = tokenizer(text, return_tensors="tf").input_ids
outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)
tokenizer.decode(outputs[0], skip_special_tokens=True)

'Gianni Cuperlo presenta cos la lunga giornata del convegno “La parola Pace – L’utopia che deve farsi realtà” Milano dalla sua associazione Promessa democratica.'