# Summarization Using T5 Model


In [2]:
!pip install git+https://github.com/huggingface/accelerate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/huggingface/accelerate
  Cloning https://github.com/huggingface/accelerate to /tmp/pip-req-build-5eay5df3
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/accelerate /tmp/pip-req-build-5eay5df3
  Resolved https://github.com/huggingface/accelerate to commit 62357f218f72cce88b8e086cc372b15c119b590b
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [32]:
pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99


## Import Necessary Libraries

In [3]:
! pip install -q datasets transformers rouge-score nltk

In [34]:
import torch
torch.cuda.empty_cache()
from datasets import load_dataset, load_metric
import pandas as pd
import numpy as np
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, T5ForConditionalGeneration

## Loading the dataset

link: https://www.kaggle.com/datasets/sunnysai12345/news-summary <br>
take news summary more file

In [5]:
data = pd.read_csv('/content/sample_data/news_summary_more.csv',nrows=10000)

In [6]:
# Split the data into train and test sets
train_df, test_df = train_test_split(data, test_size=0.2, random_state=42)

# Split the train set further into train and validation sets
train_df, val_df = train_test_split(train_df, test_size=0.2, random_state=42)

In [7]:
# Create a Dataset object for each split
train_dataset = Dataset.from_dict(train_df)
val_dataset = Dataset.from_dict(val_df)
test_dataset = Dataset.from_dict(test_df)

To access an actual element, you need to select a split first, then give an index:

In [8]:
# Create a DatasetDict object with the splits
data = DatasetDict({
    'train': train_dataset,
    'validation': val_dataset,
    'test': test_dataset
})

In [9]:
data

DatasetDict({
    train: Dataset({
        features: ['headlines', 'text'],
        num_rows: 6400
    })
    validation: Dataset({
        features: ['headlines', 'text'],
        num_rows: 1600
    })
    test: Dataset({
        features: ['headlines', 'text'],
        num_rows: 2000
    })
})

###  To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [10]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [11]:
show_random_elements(data["train"])

Unnamed: 0,headlines,text
0,"Over 5,000 people killed in Philippine President's drug war","The death toll in the Philippines due to the war on drugs initiated by President Rodrigo Duterte has risen above 5,000. The authorities said that at least 5,050 people have lost their lives since the drug war began after Duterte became the President in 2016. Duterte is being investigated for allegedly committing crimes against humanity in the anti-drugs war."
1,I will break your head: AIUDF chief threatens journalist,"All India United Democratic Front (AIUDF) chief Badruddin Ajmal on Wednesday hurled abuses at a journalist and threatened to break his head upon being asked if he will ally with Congress or BJP in future. ""Go dogs, for how much money have you been bought by BJP?"" Ajmal said. He also grabbed a mike and tried to hit the journalist."
2,Social media a brutal place: Bigg Boss 11 winner Shilpa quits Twitter,"Television reality show Bigg Boss 11's winner Shilpa Shinde deleted her Twitter account and said, ""Social media is a brutal place. My fans are extremely possessive about me."" ""When there are negative comments about me or people troll me, my feed's flooded by fan messages,"" she said. ""I'm least bothered by haters...but my fans go reckless about it,"" she added."
3,Priyanka's appointment is Congress admitting Rahul's failure: BJP,"After Priyanka Gandhi Vadra was appointed as Congress General Secretary for Uttar Pradesh East, BJP's Sambit Patra said, ""Congress has basically publicly announced that Rahul Gandhi has failed and needs crutches from within the family."" ""All appointments are from one family. And this is the fundamental difference...In Congress, the family is party. In BJP, the party is family,"" Patra added."
4,Shashi Tharoor introduces bill to regulate online gaming,Congress leader Shashi Tharoor introduced a private member's bill in the Lok Sabha that seeks to regulate online gaming. The bill aims to legitimise online games of skill and allow online gaming websites to apply for licenses to earn revenue. Tharoor said a regulatory framework is required to check the flow of black money and curb related illegal activities.


The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [12]:
metric = load_metric("rouge")
metric

  metric = load_metric("rouge")


Metric(name: "rouge", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}, usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each prediction
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLSum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/datasets/issues/617
    use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
    use_aggregator: Return aggregates if this is set to True
Retu

In [13]:
data['train'][0]

{'headlines': "Sushant to star in Bhandarkar's 'Inspector Ghalib': Reports",
 'text': 'Sushant Singh Rajput will be starring in Madhur Bhandarkar\'s next film based on sand mafias titled \'Inspector Ghalib\', as per reports. The story, which is inspired by real-life events, is based in Uttar Pradesh and the major part of the film will be shot there, reports suggested. "\'Inspector Ghalib\' is the story of a cop," stated reports.'}

In [35]:
model_name = "JulesBelveze/t5-small-headline-generator"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

## Preprocessing the data

In [36]:
max_input_length = 1024
max_target_length = 128
prefix = 'summarize'
def preprocess_function(examples):
  
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["headlines"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [37]:
preprocess_function(data['train'][:3])



{'input_ids': [[21603, 134, 8489, 288, 16738, 13509, 2562, 56, 36, 3, 22236, 16, 5428, 10666, 272, 2894, 291, 4031, 31, 7, 416, 814, 3, 390, 30, 3, 7, 232, 954, 89, 23, 9, 7, 3, 10920, 3, 31, 1570, 5628, 127, 350, 1024, 6856, 31, 6, 38, 399, 2279, 5, 37, 733, 6, 84, 19, 3555, 57, 490, 18, 4597, 984, 6, 19, 3, 390, 16, 31251, 22660, 11, 8, 779, 294, 13, 8, 814, 56, 36, 2538, 132, 6, 2279, 5259, 5, 96, 31, 1570, 5628, 127, 350, 1024, 6856, 31, 19, 8, 733, 13, 3, 9, 7326, 976, 4568, 2279, 5, 1], [21603, 13035, 343, 3084, 115, 107, 17815, 16332, 1599, 6, 581, 4068, 3, 9, 495, 47, 5132, 16, 1718, 21, 3, 17211, 24, 24084, 15, 7, 45, 112, 3797, 3, 26867, 16, 1010, 17, 14277, 6, 65, 118, 7020, 15794, 5, 16332, 1599, 47, 7020, 15794, 30, 3, 9, 525, 6235, 13, 3, 1439, 2, 15660, 20202, 96, 20829, 861, 924, 11992, 808, 8, 2728, 45, 140, 233, 19055, 887, 113, 3, 342, 82, 3797, 31, 7, 24084, 15, 7, 1891, 3879, 12, 520, 7, 976, 8, 20622, 141, 7760, 383, 3, 9, 452, 13980, 5, 1], [21603, 427, 291, 120,

In [38]:
tokenized_datasets = data.map(preprocess_function, batched=True)

Map:   0%|          | 0/6400 [00:00<?, ? examples/s]

Map:   0%|          | 0/1600 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

## Importing Pretrained Model and Tokenizer

In [46]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

## Setting Up Arguments of Model for Fine Tuning

In [47]:
batch_size = 5
args = Seq2SeqTrainingArguments(
    "test-summarization",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=True,
)

In [48]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

## Define Compute Metrics

In [49]:
import nltk
import numpy as np
nltk.download('punkt')
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Initialize `Seq2SeqTrainer` for training model on custom dataset:

In [50]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

We can now finetune our model by just calling the `train` method:

In [51]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,2.1108,1.820104,47.3598,24.1236,41.8825,41.8988,16.415


TrainOutput(global_step=1280, training_loss=2.1778551578521728, metrics={'train_runtime': 263.1849, 'train_samples_per_second': 24.318, 'train_steps_per_second': 4.864, 'total_flos': 179325243555840.0, 'train_loss': 2.1778551578521728, 'epoch': 1.0})

## Pass Sample text to generate summary of model

In [65]:
sample_text = """Delhi Capitals’ head coach Ricky Ponting during a press conference in Delhi on Friday. | Photo Credit: PTI

Ricky Ponting knows a thing or two about cricket and spotlight and how together, the two can either be a recipe for unprecedented success or unmitigated disaster, depending on how one handles them.

In India, in particular, the pressure to manage both is a lot more than anywhere else and the IPL is at the pinnacle of fan attention. “Well it is a lot different in our country than it is here. The big thing about the IPL is seeing so many younger players getting an opportunity that they are not ready for. And I don’t mean the sport per se. They are ready for the cricket side of it but there are a lot of guys not ready, yet, for the many other things that come with cricket. There wasn’t as much spotlight on me back as a young player as on some of the young Indian players today,” Ponting admitted."""
encoded_input = tokenizer(sample_text, truncation=True, padding=True, max_length=512, return_tensors="pt")


In [66]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [67]:
# Move the model to the same device as the input tensors
model = model.to(device)

In [68]:
input_ids = encoded_input["input_ids"].to(device)
attention_mask = encoded_input["attention_mask"].to(device)

In [69]:
# Generate the summary
output = model.generate(input_ids=input_ids, attention_mask=attention_mask)
generated_summary = tokenizer.decode(output[0], skip_special_tokens=True)


In [70]:
generated_summary

'IPL is a big thing about cricket and spotlight, says PTI Ricky Ponting'