# Summarization Using T5 Model

## This notebook outlines the concepts behind finetuning a Summarization model using T-5 variant model

In [None]:
!pip install git+https://github.com/huggingface/accelerate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/huggingface/accelerate
  Cloning https://github.com/huggingface/accelerate to /tmp/pip-req-build-auiwkndu
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/accelerate /tmp/pip-req-build-auiwkndu
  Resolved https://github.com/huggingface/accelerate to commit eba6eb79dc2ab652cd8b44b37165a4852768a8ac
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


## Import Necessary Libraries

In [None]:
! pip install -q datasets transformers rouge-score nltk

In [None]:
import torch
torch.cuda.empty_cache()
from datasets import load_dataset, load_metric
import pandas as pd
import numpy as np
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer

## Loading the dataset

link: https://www.kaggle.com/datasets/sunnysai12345/news-summary <br>
take news summary more file

In [None]:
data = pd.read_csv('data/news_summary_more.csv',nrows=10000)

In [None]:
# Split the data into train and test sets
train_df, test_df = train_test_split(data, test_size=0.2, random_state=42)

# Split the train set further into train and validation sets
train_df, val_df = train_test_split(train_df, test_size=0.2, random_state=42)

In [None]:
# Create a Dataset object for each split
train_dataset = Dataset.from_dict(train_df)
val_dataset = Dataset.from_dict(val_df)
test_dataset = Dataset.from_dict(test_df)

To access an actual element, you need to select a split first, then give an index:

In [None]:
# Create a DatasetDict object with the splits
data = DatasetDict({
    'train': train_dataset,
    'validation': val_dataset,
    'test': test_dataset
})

In [None]:
data

DatasetDict({
    train: Dataset({
        features: ['headlines', 'text'],
        num_rows: 6400
    })
    validation: Dataset({
        features: ['headlines', 'text'],
        num_rows: 1600
    })
    test: Dataset({
        features: ['headlines', 'text'],
        num_rows: 2000
    })
})

###  To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [None]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(data["train"])

Unnamed: 0,headlines,text
0,Indian documentary on menstruation shortlisted for Oscars,"A documentary by Guneet Monga called 'Period. End of Sentence.', which is based on women in India fighting against stigma of menstruation, has been shortlisted by Oscars in their Documentary Short Subject category. The film also revolves around Arunachalam Muruganantham who invented an easy-to-operate machine that makes low-cost sanitary napkins and inspired the Bollywood film 'Padman'."
1,CSK coach Fleming backs Dhoni to play World Cup 2019,"Former Indian captain MS Dhoni's coach from Chennai Super Kings, Stephen Fleming, has backed the Indian wicketkeeper to play in the World Cup 2019 as his strength is ""immeasurable"". ""He needs to have the confidence to go and play like that in the ODIs and I think the big stage is something he is looking forward to,"" Fleming added."
2,"Don't play around with the law, SC warns Karti Chidambaram","The Supreme Court today allowed former Finance Minister P Chidambaram's son Karti Chidambaram to travel abroad but warned him, ""You can go wherever you want...but don't play around with the law."" The court also directed Karti, an accused in the Aircel-Maxis case, to deposit â¹10 crore as surety and warned him of action in case of ""an iota of non-cooperation."""
3,Varun Chakravarthy is a long-term investment: Preity Zinta,"Speaking about Tamil Nadu spinner Varun Chakravarthy, who was bought by KXIP for â¹8.4 crore, KXIP co-owner Preity Zinta said, ""Varun is an underexposed mystery bowler... Varun is a long-term investment for us."" ""I feel that with guidance from coach Mike Hesson, he (Varun) will be able to hone his capabilities and contribute to team's success,"" the Bollywood actor said."
4,Not alliance's opinion: Akhilesh on 'Rahul-for-PM' remark,"Commenting on DMK President MK Stalin's proposal to name Congress President Rahul Gandhi as Prime Ministerial candidate, Samajwadi Party chief Akhilesh Yadav said one's opinion isn't the entire alliance's opinion. The process of formation of opposition alliances and naming candidates will keep going on, Yadav added. He further said, ""Everyone is unhappy with BJP and wants it to go."" n"


The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [None]:
metric = load_metric("rouge")
metric

Metric(name: "rouge", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}, usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each prediction
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLSum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/datasets/issues/617
    use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
    use_aggregator: Return aggregates if this is set to True
Retu

In [None]:
data['train'][0]

{'headlines': "Sushant to star in Bhandarkar's 'Inspector Ghalib': Reports",
 'text': 'Sushant Singh Rajput will be starring in Madhur Bhandarkar\'s next film based on sand mafias titled \'Inspector Ghalib\', as per reports. The story, which is inspired by real-life events, is based in Uttar Pradesh and the major part of the film will be shot there, reports suggested. "\'Inspector Ghalib\' is the story of a cop," stated reports.'}

## Importing Pretrained Model and Tokenizer

In [67]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
model_checkpoint = "t5-base"
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

In [66]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


## Preprocessing the data

In [None]:
max_input_length = 1024
max_target_length = 128
prefix = 'summarize'
def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["headlines"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
preprocess_function(data['train'][:2])



{'input_ids': [[21603, 134, 8489, 288, 16738, 13509, 2562, 56, 36, 3, 22236, 16, 5428, 10666, 272, 2894, 291, 4031, 31, 7, 416, 814, 3, 390, 30, 3, 7, 232, 954, 89, 23, 9, 7, 3, 10920, 3, 31, 1570, 5628, 127, 350, 1024, 6856, 31, 6, 38, 399, 2279, 5, 37, 733, 6, 84, 19, 3555, 57, 490, 18, 4597, 984, 6, 19, 3, 390, 16, 31251, 22660, 11, 8, 779, 294, 13, 8, 814, 56, 36, 2538, 132, 6, 2279, 5259, 5, 96, 31, 1570, 5628, 127, 350, 1024, 6856, 31, 19, 8, 733, 13, 3, 9, 7326, 976, 4568, 2279, 5, 1], [21603, 13035, 343, 3084, 115, 107, 17815, 16332, 1599, 6, 581, 4068, 3, 9, 495, 47, 5132, 16, 1718, 21, 3, 17211, 24, 24084, 15, 7, 45, 112, 3797, 3, 26867, 16, 1010, 17, 14277, 6, 65, 118, 7020, 15794, 5, 16332, 1599, 47, 7020, 15794, 30, 3, 9, 525, 6235, 13, 3, 1439, 2, 15660, 20202, 96, 20829, 861, 924, 11992, 808, 8, 2728, 45, 140, 233, 19055, 887, 113, 3, 342, 82, 3797, 31, 7, 24084, 15, 7, 1891, 3879, 12, 520, 7, 976, 8, 20622, 141, 7760, 383, 3, 9, 452, 13980, 5, 1]], 'attention_mask': [[1

In [None]:
tokenized_datasets = data.map(preprocess_function, batched=True)

Map:   0%|          | 0/6400 [00:00<?, ? examples/s]

Map:   0%|          | 0/1600 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

## Setting Up Arguments of Model for Fine Tuning

In [68]:
batch_size = 4
args = Seq2SeqTrainingArguments(
    "test-summarization",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=True,
)

In [69]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

## Define Compute Metrics

In [70]:
import nltk
import numpy as np
nltk.download('punkt')
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Initialize `Seq2SeqTrainer` for training model on custom dataset:

In [71]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

We can now finetune our model by just calling the `train` method:

In [72]:
trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,1.55,1.35399,51.271,27.4381,46.405,46.4887,16.4506


TrainOutput(global_step=1600, training_loss=1.697555570602417, metrics={'train_runtime': 624.9148, 'train_samples_per_second': 10.241, 'train_steps_per_second': 2.56, 'total_flos': 792839386091520.0, 'train_loss': 1.697555570602417, 'epoch': 1.0})

## Pass Sample text to generate summary of model

In [73]:
sample_text = "Two years ago, swing and seam proved to be India’s bugbears in the World Test Championship (WTC) final against New Zealand. The four-pronged pace attack of Tim Southee, Trent Boult, Neil Wagner and Kyle Jamieson had left the Indians on the mat. In a couple of days time, the Rohit Sharma-led team will again be up against the likes of Pat Cummins and Mitchell Starc, who nonetheless are a notch better than their trans-Tasman rivals. The age-old manual of how to bat in England is simple. Wait. But the Indians don’t do waiting, all that well. Especially when the ball swings around. The swing breaks their soul and about the time the ball starts to curve away or shape in, Indian batsman start to freeze. The balance starts to go topsy-turvy, the bat follows and there they break the thumb rule of batsmanship: Never play away from your body. It will be familiar territory for the Indian batsmen with the trajectory of the swinging ball winking at them. Coming right after the IPL, the lack of preparation might hinder India’s chance to end their ICC title drought."
encoded_input = tokenizer(sample_text, truncation=True, padding=True, max_length=512, return_tensors="pt")


In [74]:
device = 'cpu'

In [75]:
# Move the model to the same device as the input tensors
model = model.to(device)

In [76]:
input_ids = encoded_input["input_ids"].to(device)
attention_mask = encoded_input["attention_mask"].to(device)

In [77]:
# Generate the summary
output = model.generate(input_ids=input_ids, attention_mask=attention_mask)
generated_summary = tokenizer.decode(output[0], skip_special_tokens=True)




In [78]:
generated_summary

'Indians don’t do waiting, all that well: Never play away from your body'