<a href="https://colab.research.google.com/github/saurograndi/nlp-t5-summarizer/blob/main/mT5-cnn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Summarization with multilingual T5 (mT5)

In [1]:
!pip install sentencepiece
!pip install transformers
!pip install datasets
!pip install nltk
!pip install rouge_score
!pip install evaluate



In [2]:
from transformers import AutoTokenizer

model_checkpoint = "google/mt5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [3]:
from huggingface_hub import login

login(token="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")

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [4]:
import datasets

cnn = datasets.load_dataset("cnn_dailymail", "3.0.0")
cnn

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})

In [5]:
def get_samples(dataset, num_samples=100, seed=42):
    train_sample = dataset["train"].shuffle(seed=seed).select(range(num_samples))
    valid_sample = dataset["validation"].shuffle(seed=seed).select(range(int(num_samples/6.7)))
    test_sample = dataset["test"].shuffle(seed=seed).select(range(int(num_samples/6.7)))
    return datasets.DatasetDict({"train":train_sample, "valid":valid_sample, "test":test_sample})

In [6]:
cnn_small = get_samples(cnn, num_samples=1000)
cnn_small

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 1000
    })
    valid: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 149
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 149
    })
})

In [7]:
def show_samples(dataset, num_samples=3, seed=42):
    sample = dataset["train"].shuffle(seed=seed).select(range(num_samples))
    for example in sample:
        print(f"\n'>> article: {example['article']}'")
        print(f"\n'>> highlights: {example['highlights']}'")


show_samples(cnn_small, num_samples=1)


'>> article: By . Nick Enoch . Star trails sweep over the Giant’s Causeway in Northern Ireland, dust clouds are moulded into colossal arrangements by cosmic radiation thousands of light years away and a bright meteor races across the night sky passing over Indonesia’s smoke-spewing Mount Bromo. These are just some of the incredible photos which have been shortlisted in the 2014 Astronomy Photographer of the Year competition. The contest, run by the Royal Observatory Greenwich in association with BBC Sky at Night Magazine, is now in its sixth year - and a record number of entries from more than 2,500 enthusiastic amateurs and professional photographers have poured in from around the world. Centre of the Heart Nebula by Ivan Eder, Hungary. Situated 7,500 light years away in the W-shaped constellation of Cassiopeia, the Heart Nebula is a vast region of glowing gas, energised by a cluster of young stars at its centre. The image depicts the central region, where dust clouds are being erode

In [8]:
#max_input_length = 69632
#max_target_length = 5120
max_input_length = 512
max_target_length = 64

def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["article"],
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(
        examples["highlights"], max_length=max_target_length, truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [9]:
tokenized_datasets_small = cnn_small.map(preprocess_function, batched=True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/149 [00:00<?, ? examples/s]

Map:   0%|          | 0/149 [00:00<?, ? examples/s]

In [10]:
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

All model checkpoint layers were used when initializing TFMT5ForConditionalGeneration.

All the layers of TFMT5ForConditionalGeneration were initialized from the model checkpoint at google/mt5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMT5ForConditionalGeneration for predictions without further training.


In [11]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

In [12]:
tf_train_dataset = model.prepare_tf_dataset(
    tokenized_datasets_small["train"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=8,
)
tf_eval_dataset = model.prepare_tf_dataset(
    tokenized_datasets_small["valid"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=8,
)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [13]:
from transformers import create_optimizer
import tensorflow as tf

# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# by the total number of epochs. Note that the tf_train_dataset here is a batched tf.data.Dataset,
# not the original Hugging Face Dataset, so its len() is already num_samples // batch_size.
num_train_epochs = 8
num_train_steps = len(tf_train_dataset) * num_train_epochs
model_name = model_checkpoint.split("/")[-1]

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)


optimizer, schedule = create_optimizer(
    init_lr=5.6e-5,
    num_warmup_steps=0,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)

model.compile(optimizer=optimizer)

# Train in mixed-precision float16
tf.keras.mixed_precision.set_global_policy("mixed_float16")

In [14]:
model.fit(
    tf_train_dataset, validation_data=tf_eval_dataset, epochs=8, verbose=2
)

Epoch 1/8
125/125 - 78s - loss: 13.8164 - val_loss: 6.5159 - 78s/epoch - 623ms/step
Epoch 2/8
125/125 - 31s - loss: 8.2684 - val_loss: 4.2656 - 31s/epoch - 245ms/step
Epoch 3/8
125/125 - 32s - loss: 6.9345 - val_loss: 3.6426 - 32s/epoch - 252ms/step
Epoch 4/8
125/125 - 31s - loss: 6.2786 - val_loss: 3.3991 - 31s/epoch - 251ms/step
Epoch 5/8
125/125 - 31s - loss: 5.9997 - val_loss: 3.2925 - 31s/epoch - 246ms/step
Epoch 6/8
125/125 - 31s - loss: 5.8351 - val_loss: 3.2355 - 31s/epoch - 246ms/step
Epoch 7/8
125/125 - 31s - loss: 5.7012 - val_loss: 3.2027 - 31s/epoch - 252ms/step
Epoch 8/8
125/125 - 31s - loss: 5.6631 - val_loss: 3.1923 - 31s/epoch - 247ms/step


<keras.src.callbacks.History at 0x79a06027fbe0>