<a href="https://colab.research.google.com/github/saurograndi/nlp-t5-summarizer/blob/main/mT5-cnn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
!pip install sentencepiece
!pip install transformers
!pip install datasets
!pip install nltk
!pip install rouge_score
!pip install evaluate



In [1]:
from transformers import AutoTokenizer

model_checkpoint = "google/mt5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [2]:
from huggingface_hub import login

login(token="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")

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
import datasets

cnn = datasets.load_dataset("cnn_dailymail", "3.0.0")
cnn

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})

In [4]:
def get_samples(dataset, num_samples=100, seed=42):
    train_sample = dataset["train"].shuffle(seed=seed).select(range(num_samples))
    valid_sample = dataset["validation"].shuffle(seed=seed).select(range(15))
    test_sample = dataset["test"].shuffle(seed=seed).select(range(15))
    return datasets.DatasetDict({"train":train_sample, "valid":valid_sample, "test":test_sample})

In [5]:
cnn_small = get_samples(cnn)
cnn_small

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 100
    })
    valid: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 15
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 15
    })
})

In [6]:
def show_samples(dataset, num_samples=3, seed=42):
    sample = dataset["train"].shuffle(seed=seed).select(range(num_samples))
    for example in sample:
        print(f"\n'>> article: {example['article']}'")
        print(f"\n'>> highlights: {example['highlights']}'")


show_samples(cnn_small, num_samples=1)


'>> article: The shocking moment an out-of-control car veered into closed lanes in a Boston tunnel and narrowly missed a state police cruiser has been caught on tape. The driver, Jerry Lin, 24, can be seen in the footage released by authorities, crossing several lanes and then suddenly swerving, just missing the parked cruiser that had its emergency lights flashing. 'He was apparently confused with his GPS device. It was a rental car, and he was confused about trying to get the car back to where it belonged,' state police Captain Thomas Reney said of Lin. Scroll down for video . Close call: The shocking moment an out-of-control car, right, veered into closed lanes in a Boston tunnel and narrowly missed a state police cruiser, left, has been caught on tape . 'He was paying more attention to the GPS than the lane closures.' The . motorist has been cited for negligent driving for the October 7 . incident, as GPS isn't covered by the ban on texting, video and Internet . devices. He was al

In [7]:
#max_input_length = 69632
#max_target_length = 5120
max_input_length = 512
max_target_length = 64

def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["article"],
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(
        examples["highlights"], max_length=max_target_length, truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [8]:
tokenized_datasets_small = cnn_small.map(preprocess_function, batched=True)

Map:   0%|          | 0/15 [00:00<?, ? examples/s]

In [9]:
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

All model checkpoint layers were used when initializing TFMT5ForConditionalGeneration.

All the layers of TFMT5ForConditionalGeneration were initialized from the model checkpoint at google/mt5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMT5ForConditionalGeneration for predictions without further training.


In [10]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

In [11]:
tf_train_dataset = model.prepare_tf_dataset(
    tokenized_datasets_small["train"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=8,
)
tf_eval_dataset = model.prepare_tf_dataset(
    tokenized_datasets_small["valid"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=8,
)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [12]:
from transformers import create_optimizer
import tensorflow as tf

# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# by the total number of epochs. Note that the tf_train_dataset here is a batched tf.data.Dataset,
# not the original Hugging Face Dataset, so its len() is already num_samples // batch_size.
num_train_epochs = 8
num_train_steps = len(tf_train_dataset) * num_train_epochs
model_name = model_checkpoint.split("/")[-1]

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)


optimizer, schedule = create_optimizer(
    init_lr=5.6e-5,
    num_warmup_steps=0,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)

model.compile(optimizer=optimizer)

# Train in mixed-precision float16
tf.keras.mixed_precision.set_global_policy("mixed_float16")

In [13]:
model.fit(
    tf_train_dataset, validation_data=tf_eval_dataset, epochs=8, verbose=2
)

Epoch 1/8
12/12 - 46s - loss: 21.3843 - val_loss: 10.3862 - 46s/epoch - 4s/step
Epoch 2/8
12/12 - 3s - loss: 16.8341 - val_loss: 8.6107 - 3s/epoch - 251ms/step
Epoch 3/8
12/12 - 3s - loss: 15.4971 - val_loss: 7.7014 - 3s/epoch - 274ms/step
Epoch 4/8
12/12 - 3s - loss: 13.6581 - val_loss: 7.3737 - 3s/epoch - 254ms/step
Epoch 5/8
12/12 - 3s - loss: 13.7722 - val_loss: 7.1429 - 3s/epoch - 252ms/step
Epoch 6/8
12/12 - 3s - loss: 13.8146 - val_loss: 7.1056 - 3s/epoch - 260ms/step
Epoch 7/8
12/12 - 3s - loss: 13.1552 - val_loss: 7.0214 - 3s/epoch - 240ms/step
Epoch 8/8
12/12 - 3s - loss: 12.8195 - val_loss: 6.9329 - 3s/epoch - 241ms/step


<keras.src.callbacks.History at 0x78ced020d4e0>