# Summarization Using Pegasus Model


In [2]:
!pip install git+https://github.com/huggingface/accelerate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/huggingface/accelerate
  Cloning https://github.com/huggingface/accelerate to /tmp/pip-req-build-5eay5df3
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/accelerate /tmp/pip-req-build-5eay5df3
  Resolved https://github.com/huggingface/accelerate to commit 62357f218f72cce88b8e086cc372b15c119b590b
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


## Import Necessary Libraries

In [3]:
! pip install -q datasets transformers rouge-score nltk

In [4]:
import torch
torch.cuda.empty_cache()
from datasets import load_dataset, load_metric
import pandas as pd
import numpy as np
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer

## Loading the dataset

link: https://www.kaggle.com/datasets/sunnysai12345/news-summary <br>
take news summary more file

In [5]:
data = pd.read_csv('/content/sample_data/news_summary_more.csv',nrows=10000)

In [6]:
# Split the data into train and test sets
train_df, test_df = train_test_split(data, test_size=0.2, random_state=42)

# Split the train set further into train and validation sets
train_df, val_df = train_test_split(train_df, test_size=0.2, random_state=42)

In [7]:
# Create a Dataset object for each split
train_dataset = Dataset.from_dict(train_df)
val_dataset = Dataset.from_dict(val_df)
test_dataset = Dataset.from_dict(test_df)

To access an actual element, you need to select a split first, then give an index:

In [8]:
# Create a DatasetDict object with the splits
data = DatasetDict({
    'train': train_dataset,
    'validation': val_dataset,
    'test': test_dataset
})

In [9]:
data

DatasetDict({
    train: Dataset({
        features: ['headlines', 'text'],
        num_rows: 6400
    })
    validation: Dataset({
        features: ['headlines', 'text'],
        num_rows: 1600
    })
    test: Dataset({
        features: ['headlines', 'text'],
        num_rows: 2000
    })
})

###  To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [10]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [11]:
show_random_elements(data["train"])

Unnamed: 0,headlines,text
0,"Over 5,000 people killed in Philippine President's drug war","The death toll in the Philippines due to the war on drugs initiated by President Rodrigo Duterte has risen above 5,000. The authorities said that at least 5,050 people have lost their lives since the drug war began after Duterte became the President in 2016. Duterte is being investigated for allegedly committing crimes against humanity in the anti-drugs war."
1,I will break your head: AIUDF chief threatens journalist,"All India United Democratic Front (AIUDF) chief Badruddin Ajmal on Wednesday hurled abuses at a journalist and threatened to break his head upon being asked if he will ally with Congress or BJP in future. ""Go dogs, for how much money have you been bought by BJP?"" Ajmal said. He also grabbed a mike and tried to hit the journalist."
2,Social media a brutal place: Bigg Boss 11 winner Shilpa quits Twitter,"Television reality show Bigg Boss 11's winner Shilpa Shinde deleted her Twitter account and said, ""Social media is a brutal place. My fans are extremely possessive about me."" ""When there are negative comments about me or people troll me, my feed's flooded by fan messages,"" she said. ""I'm least bothered by haters...but my fans go reckless about it,"" she added."
3,Priyanka's appointment is Congress admitting Rahul's failure: BJP,"After Priyanka Gandhi Vadra was appointed as Congress General Secretary for Uttar Pradesh East, BJP's Sambit Patra said, ""Congress has basically publicly announced that Rahul Gandhi has failed and needs crutches from within the family."" ""All appointments are from one family. And this is the fundamental difference...In Congress, the family is party. In BJP, the party is family,"" Patra added."
4,Shashi Tharoor introduces bill to regulate online gaming,Congress leader Shashi Tharoor introduced a private member's bill in the Lok Sabha that seeks to regulate online gaming. The bill aims to legitimise online games of skill and allow online gaming websites to apply for licenses to earn revenue. Tharoor said a regulatory framework is required to check the flow of black money and curb related illegal activities.


The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [12]:
metric = load_metric("rouge")
metric

  metric = load_metric("rouge")


Metric(name: "rouge", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}, usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each prediction
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLSum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/datasets/issues/617
    use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
    use_aggregator: Return aggregates if this is set to True
Retu

In [13]:
data['train'][0]

{'headlines': "Sushant to star in Bhandarkar's 'Inspector Ghalib': Reports",
 'text': 'Sushant Singh Rajput will be starring in Madhur Bhandarkar\'s next film based on sand mafias titled \'Inspector Ghalib\', as per reports. The story, which is inspired by real-life events, is based in Uttar Pradesh and the major part of the film will be shot there, reports suggested. "\'Inspector Ghalib\' is the story of a cop," stated reports.'}

In [14]:
model_checkpoint = "google/bigbird-pegasus-large-arxiv"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

## Preprocessing the data

In [15]:
max_input_length = 1024
max_target_length = 128
prefix = 'summarize'
def preprocess_function(examples):
  
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["headlines"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [16]:
preprocess_function(data['train'][:3])



{'input_ids': [[24710, 20159, 59206, 8949, 73524, 138, 129, 11692, 115, 68025, 551, 88689, 10310, 131, 116, 352, 896, 451, 124, 3391, 59803, 116, 6486, 1034, 87460, 38521, 11656, 131, 108, 130, 446, 1574, 107, 139, 584, 108, 162, 117, 2261, 141, 440, 121, 4527, 702, 108, 117, 451, 115, 27652, 12118, 111, 109, 698, 297, 113, 109, 896, 138, 129, 1785, 186, 108, 1574, 3498, 107, 198, 131, 87460, 38521, 11656, 131, 117, 109, 584, 113, 114, 17934, 745, 3163, 1574, 107, 1], [24710, 51208, 2687, 4037, 1271, 86269, 596, 25451, 108, 464, 2901, 114, 437, 140, 3252, 115, 1307, 118, 10431, 120, 49114, 135, 169, 2741, 19556, 25950, 108, 148, 174, 4571, 10168, 107, 596, 25451, 140, 4571, 10168, 124, 114, 510, 4517, 113, 110, 105, 4363, 55778, 198, 19564, 667, 2097, 6242, 635, 109, 2299, 135, 213, 401, 2420, 652, 170, 134, 326, 161, 2741, 131, 116, 49114, 1422, 2755, 112, 7929, 745, 109, 11872, 196, 4620, 333, 114, 481, 8418, 107, 1], [24710, 15450, 1284, 12649, 8991, 11420, 1973, 429, 124, 1408, 257

In [17]:
tokenized_datasets = data.map(preprocess_function, batched=True)

Map:   0%|          | 0/6400 [00:00<?, ? examples/s]

Map:   0%|          | 0/1600 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

## Importing Pretrained Model and Tokenizer

In [18]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Downloading (…)neration_config.json:   0%|          | 0.00/232 [00:00<?, ?B/s]

## Setting Up Arguments of Model for Fine Tuning

In [19]:
batch_size = 5
args = Seq2SeqTrainingArguments(
    "test-summarization",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=True,
)

In [20]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

## Define Compute Metrics

In [21]:
import nltk
import numpy as np
nltk.download('punkt')
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Initialize `Seq2SeqTrainer` for training model on custom dataset:

In [22]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

We can now finetune our model by just calling the `train` method:

In [23]:
trainer.train()

You're using a PegasusTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Attention type 'block_sparse' is not possible if sequence_length: 83 <= num global tokens: 2 * config.block_size + min. num sliding tokens: 3 * config.block_size + config.num_random_blocks * config.block_size + additional buffer: config.num_random_blocks * config.block_size = 704 with config.block_size = 64, config.num_random_blocks = 3. Changing attention type to 'original_full'...


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,5.3837,5.001834,31.4017,11.5746,28.0482,28.0452,15.2106


TrainOutput(global_step=1280, training_loss=5.714573097229004, metrics={'train_runtime': 971.7115, 'train_samples_per_second': 6.586, 'train_steps_per_second': 1.317, 'total_flos': 1547769812705280.0, 'train_loss': 5.714573097229004, 'epoch': 1.0})

## Pass Sample text to generate summary of model

In [24]:
sample_text = "Amazon-owned video platform Twitch streamer 'JesseDStreams' fell asleep for about three hours while live streaming on the platform and woke up to over 200 viewers and multiple money donations. He was filming in the platform's 'Just Chatting' category and fell asleep a few hours into streaming. A clip from the video has since generated over 2 million views on Twitch."
encoded_input = tokenizer(sample_text, truncation=True, padding=True, max_length=512, return_tensors="pt")


In [25]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [26]:
# Move the model to the same device as the input tensors
model = model.to(device)

In [27]:
input_ids = encoded_input["input_ids"].to(device)
attention_mask = encoded_input["attention_mask"].to(device)

In [28]:
# Generate the summary
output = model.generate(input_ids=input_ids, attention_mask=attention_mask)
generated_summary = tokenizer.decode(output[0], skip_special_tokens=True)




In [29]:
generated_summary

'years ago, swing and seam proved to be bugbears in the final against.'