# Summarization Using Pegasus Model

## This notebook outlines the concepts behind finetuning a Summarization model using T-5 variant model

In [1]:
!pip install git+https://github.com/huggingface/accelerate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/huggingface/accelerate
  Cloning https://github.com/huggingface/accelerate to /tmp/pip-req-build-v42amkqu
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/accelerate /tmp/pip-req-build-v42amkqu
  Resolved https://github.com/huggingface/accelerate to commit eba6eb79dc2ab652cd8b44b37165a4852768a8ac
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: accelerate
  Building wheel for accelerate (pyproject.toml) ... [?25l[?25hdone
  Created wheel for accelerate: filename=accelerate-0.20.0.dev0-py3-none-any.whl size=226907 sha256=64cdef954477ca458f497a5a90488bbd2113cedfffeb40a140a26cb4585f2c7f
  Stored in directory: /tmp/pip-ephem-wheel-cache-o791abxu/wheels/f6/c7/9d/

## Import Necessary Libraries

In [2]:
! pip install -q datasets transformers rouge-score nltk

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m74.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m75.2 MB/s[0m 

In [3]:
import torch
torch.cuda.empty_cache()
from datasets import load_dataset, load_metric
import pandas as pd
import numpy as np
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer

## Loading the dataset

link: https://www.kaggle.com/datasets/sunnysai12345/news-summary <br>
take news summary more file

In [4]:
data = pd.read_csv('data/news_summary_more.csv',nrows=10000)

In [5]:
# Split the data into train and test sets
train_df, test_df = train_test_split(data, test_size=0.2, random_state=42)

# Split the train set further into train and validation sets
train_df, val_df = train_test_split(train_df, test_size=0.2, random_state=42)

In [6]:
# Create a Dataset object for each split
train_dataset = Dataset.from_dict(train_df)
val_dataset = Dataset.from_dict(val_df)
test_dataset = Dataset.from_dict(test_df)

To access an actual element, you need to select a split first, then give an index:

In [7]:
# Create a DatasetDict object with the splits
data = DatasetDict({
    'train': train_dataset,
    'validation': val_dataset,
    'test': test_dataset
})

In [8]:
data

DatasetDict({
    train: Dataset({
        features: ['headlines', 'text'],
        num_rows: 6400
    })
    validation: Dataset({
        features: ['headlines', 'text'],
        num_rows: 1600
    })
    test: Dataset({
        features: ['headlines', 'text'],
        num_rows: 2000
    })
})

###  To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [9]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [10]:
show_random_elements(data["train"])

Unnamed: 0,headlines,text
0,Earth to be closest to the Sun for 2019 on January 3,"The Earth will be closest to the Sun at a point in its orbit called perihelion for the year 2019 on January 3, 10:50 am IST. However, the Northern Hemisphere witnesses winter as it is tilted away from the Sun. During perihelion, the Earth lies around 147 million kilometres away from the Sun."
1,"Starc, Healy 1st couple to win top player awards at ICC World Cups","Australian cricketers Alyssa Healy and Mitchell Starc have become the first married pair to win top player awards at ICC World Cups. Alyssa was named player of the tournament in the recently concluded 2018 Women's World T20 for her 225 runs. Meanwhile, Mitchell Starc had won player of the tournament in 2015 men's World Cup for taking 22 wickets."
2,I still feel he can hit the ball in the stands: Ganguly on Dhoni,"Commenting on ex-captain MS Dhoni, who was recently dropped from the T20I squad to give youngsters a chance, ex-captain Sourav Ganguly said that just like everyone else he has to perform. ""I wish him all the luck because we want champions to go on a high. I still feel he can hit the ball in the stands,"" Ganguly added."
3,UP university student who joined militancy returns home,"A Kashmiri student, who went missing from a private university in Uttar Pradesh and joined the militancy, has returned home to Srinagar on Sunday. Immediately after his return, he was taken into police custody. Earlier, an audio had surfaced in which he announced his joining Islamic State in Jammu and Kashmir and pledging allegiance to ISIS chief Abu Bakr al-Baghdadi."
4,Government to make gold hallmarking mandatory soon: Paswan,"Union Minister Ram Vilas Paswan on Thursday said that the government is planning to soon make hallmarking mandatory for gold jewellery sold in the country. The hallmarking of gold is a purity certification of the precious metal. Under the Consumer Affairs Ministry, the Bureau of Indian Standards (BIS) is the administrative authority of hallmarking."


The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [11]:
metric = load_metric("rouge")
metric

  metric = load_metric("rouge")


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Metric(name: "rouge", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}, usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each prediction
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLSum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/datasets/issues/617
    use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
    use_aggregator: Return aggregates if this is set to True
Retu

In [12]:
data['train'][0]

{'headlines': "Sushant to star in Bhandarkar's 'Inspector Ghalib': Reports",
 'text': 'Sushant Singh Rajput will be starring in Madhur Bhandarkar\'s next film based on sand mafias titled \'Inspector Ghalib\', as per reports. The story, which is inspired by real-life events, is based in Uttar Pradesh and the major part of the film will be shot there, reports suggested. "\'Inspector Ghalib\' is the story of a cop," stated reports.'}

In [13]:
model_checkpoint = "google/pegasus-xsum"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/3.52M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

## Preprocessing the data

In [14]:
max_input_length = 1024
max_target_length = 128
prefix = 'summarize'
def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["headlines"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [15]:
preprocess_function(data['train'][:2])



{'input_ids': [[24710, 20159, 59206, 8949, 73524, 138, 129, 11692, 115, 68025, 551, 88689, 10310, 131, 116, 352, 896, 451, 124, 3391, 59803, 116, 6486, 1034, 87460, 38521, 11656, 131, 108, 130, 446, 1574, 107, 139, 584, 108, 162, 117, 2261, 141, 440, 121, 4527, 702, 108, 117, 451, 115, 27652, 12118, 111, 109, 698, 297, 113, 109, 896, 138, 129, 1785, 186, 108, 1574, 3498, 107, 198, 131, 87460, 38521, 11656, 131, 117, 109, 584, 113, 114, 17934, 745, 3163, 1574, 107, 1], [24710, 51208, 2687, 4037, 1271, 86269, 596, 25451, 108, 464, 2901, 114, 437, 140, 3252, 115, 1307, 118, 10431, 120, 49114, 135, 169, 2741, 19556, 25950, 108, 148, 174, 4571, 10168, 107, 596, 25451, 140, 4571, 10168, 124, 114, 510, 4517, 113, 110, 105, 4363, 55778, 198, 19564, 667, 2097, 6242, 635, 109, 2299, 135, 213, 401, 2420, 652, 170, 134, 326, 161, 2741, 131, 116, 49114, 1422, 2755, 112, 7929, 745, 109, 11872, 196, 4620, 333, 114, 481, 8418, 107, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [16]:
tokenized_datasets = data.map(preprocess_function, batched=True)

Map:   0%|          | 0/6400 [00:00<?, ? examples/s]

Map:   0%|          | 0/1600 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

## Importing Pretrained Model and Tokenizer

In [17]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Downloading pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/259 [00:00<?, ?B/s]

## Setting Up Arguments of Model for Fine Tuning

In [18]:
batch_size = 4
args = Seq2SeqTrainingArguments(
    "test-summarization",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=True,
)

In [19]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

## Define Compute Metrics

In [20]:
import nltk
import numpy as np
nltk.download('punkt')
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Initialize `Seq2SeqTrainer` for training model on custom dataset:

In [21]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

We can now finetune our model by just calling the `train` method:

In [22]:
trainer.train()

You're using a PegasusTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,1.6446,1.415789,55.7423,31.8951,50.7161,50.8015,15.3731


TrainOutput(global_step=1600, training_loss=1.8189406776428223, metrics={'train_runtime': 1162.2403, 'train_samples_per_second': 5.507, 'train_steps_per_second': 1.377, 'total_flos': 1532804436885504.0, 'train_loss': 1.8189406776428223, 'epoch': 1.0})

## Pass Sample text to generate summary of model

In [23]:
sample_text = "Two years ago, swing and seam proved to be India’s bugbears in the World Test Championship (WTC) final against New Zealand. The four-pronged pace attack of Tim Southee, Trent Boult, Neil Wagner and Kyle Jamieson had left the Indians on the mat. In a couple of days time, the Rohit Sharma-led team will again be up against the likes of Pat Cummins and Mitchell Starc, who nonetheless are a notch better than their trans-Tasman rivals. The age-old manual of how to bat in England is simple. Wait. But the Indians don’t do waiting, all that well. Especially when the ball swings around. The swing breaks their soul and about the time the ball starts to curve away or shape in, Indian batsman start to freeze. The balance starts to go topsy-turvy, the bat follows and there they break the thumb rule of batsmanship: Never play away from your body. It will be familiar territory for the Indian batsmen with the trajectory of the swinging ball winking at them. Coming right after the IPL, the lack of preparation might hinder India’s chance to end their ICC title drought."
encoded_input = tokenizer(sample_text, truncation=True, padding=True, max_length=512, return_tensors="pt")


In [24]:
device = 'cpu'

In [25]:
# Move the model to the same device as the input tensors
model = model.to(device)

In [26]:
input_ids = encoded_input["input_ids"].to(device)
attention_mask = encoded_input["attention_mask"].to(device)

In [27]:
# Generate the summary
output = model.generate(input_ids=input_ids, attention_mask=attention_mask)
generated_summary = tokenizer.decode(output[0], skip_special_tokens=True)




In [28]:
generated_summary

'Swing and seam: India to face Australia in ICC World Cup final'