# FINE TUNING OF GOOGLE'S PEGASUS LARGE MODEL FOR NEWS SUMMARIZATION

## Introduction
The goal of this notebook is to generate coherent summaries from news articles.
To perform this task, we started from a model known as PEGASUS and developed by Google.
Quoting Google's research blog on PEGASUS (you can read more [here](https://blog.research.google/2020/06/pegasus-state-of-art-model-for.html)):
> «In PEGASUS pre-training, some whole sentences are removed from the documents and the model is tasked with retrieving them. An example input for pre-training is a document with missing sentences, while the output consists of the missing sentences concatenated together. [...] A challenging task like that encourages the model to learn about language and general facts about the world, as well as how to distill information taken from throughout a document in order to generate output that closely resembles the fine-tuning summarization task.»

With that being said, we can say that Pegasus is not really pre-trained for summarization tasks, but for very similar ones that allow the model to already have good performance in generating summaries.

After some testing with the large model, it was evident that many times the model is not that accurate in a specific task like this. For instance, in some tests it wasn't able to compose a summary that really contains the core of the argument and also couldn't really adapt to produce brief summaries. If a summary length that is considered "too short" by the model is specified, it often truncates sentences.

So we decided to try to fine tune the model on a very small part of the well-known CNN/DailyMail dataset ([here you can find some more informations] (https://huggingface.co/datasets/cnn_dailymail)).

The next notebook cells will show the implementation.


## Install and import libraries
We need some libraries that can be installed in this cell.
*   Standard libraries as Pandas and Numpy
*   Transformers
*   PyTorch
*   Datasets
*   Rouge



In [1]:
!pip install datasets transformers transformers[torch] sentencepiece rouge
import pandas as pd
import numpy as np
from transformers import PegasusForConditionalGeneration, PegasusTokenizer, Trainer, TrainingArguments
from datasets import load_dataset
import torch
from rouge import Rouge

Collecting datasets
  Downloading datasets-2.14.6-py3-none-any.whl (493 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.35.1-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m99.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m78.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Down

## The CNN/DailyMail dataset and the CustomDataset Class
Here the dataset is loaded and some random samples are shown.
The dataset is already split into training, test, and validation sets.
We are interested in:
* The "article" column, containing the full article
* The "highlights" column, containing the summary, that will act as a label

In [2]:
dataset = load_dataset("cnn_dailymail", "3.0.0")
dataset

Downloading builder script:   0%|          | 0.00/8.33k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/9.88k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/15.1k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/159M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/376M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/661k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/572k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})

### A CustomDataset class is created that will represent the Dataset object containing encodings (articles) and decodings (labels)


In [3]:
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels['input_ids'][idx])
        return item
    def __len__(self):
        return len(self.labels['input_ids'])

## Pre-trained model and tokenizer
**Now, we can import the original model and its corresponding tokenizer.**

In [4]:
model_name = 'google/pegasus-large'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)
tokenizer = PegasusTokenizer.from_pretrained(model_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/3.09k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.encoder.embed_positions.weight', 'model.decoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading (…)neration_config.json:   0%|          | 0.00/260 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

## Tokenization and dataset creation
We create a function to tokenize the dataset. We use the PEGASUS tokenizer to tokenize the encodings and labels of the dataset and set a max_length of 1024 tokens for the article (input) and 128 for the summary (output). Then, we can use the CustomDataset class to create the final dataset.
As mentioned in the introduction, we will not use the entire dataset but only a very small portion of 1000 samples. As specified in the Google research blog article, the model showed how even as few as 1000 samples are sufficient to achieve good results in the fine tuning for summarization task.


In [5]:
def tokenize_data(texts, labels):
  encodings = tokenizer(texts, max_length=1024, truncation=True, padding=True)
  decodings = tokenizer(labels, max_length=128, truncation=True, padding=True)
  dataset_tokenized = CustomDataset(encodings, decodings)
  return dataset_tokenized

In [6]:
train_texts, train_labels = dataset['train']['article'][:2000], dataset['train']['highlights'][:2000]
valid_texts, valid_labels = dataset['validation']['article'][:100], dataset['validation']['highlights'][:100]

train_dataset = tokenize_data(train_texts, train_labels)
val_dataset = tokenize_data(valid_texts, valid_labels)

## Finetune the model
Now, through the Trainer API we can configurate the fine tuning process. Different trainings with different parameters were tried. Here we report one of the configurations that take very little time to run on a GPU, are easily reproducible, and still report good results.

In [7]:
freeze_encoder = False

if freeze_encoder:
    for param in model.model.encoder.parameters():
      param.requires_grad = False

training_args = TrainingArguments(
    output_dir="./results",           # output directory
    num_train_epochs=1,           # total number of training epochs
    per_device_train_batch_size=1,   # batch size per device during training, can increase if memory allows
    per_device_eval_batch_size=1,    # batch size for evaluation, can increase if memory allows
    save_steps=1000,                  # number of updates steps before checkpoint saves
    save_total_limit=5,              # limit the total amount of checkpoints and deletes the older checkpoints
    evaluation_strategy='steps',     # evaluation strategy to adopt during training
    eval_steps=1000,                  # number of update steps before evaluation
    warmup_steps=300,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,            # evaluation dataset
    tokenizer=tokenizer,
)

In [8]:
trainer.train()

Step,Training Loss,Validation Loss
1000,1.4359,1.043615
2000,1.1763,1.022425


TrainOutput(global_step=2000, training_loss=2.3039921875, metrics={'train_runtime': 1833.1182, 'train_samples_per_second': 1.091, 'train_steps_per_second': 1.091, 'total_flos': 5778928828416000.0, 'train_loss': 2.3039921875, 'epoch': 1.0})

## Test finetuned model on test dataset
We can easily import the last checkpoint of our training and test the finetuned model on the test set.

In [9]:
model_tuned = PegasusForConditionalGeneration.from_pretrained("results/checkpoint-2000")
tokenizer_tuned = PegasusTokenizer.from_pretrained("results/checkpoint-2000")

model_base = PegasusForConditionalGeneration.from_pretrained("google/pegasus-large")
tokenizer_base = PegasusTokenizer.from_pretrained("google/pegasus-large")

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.encoder.embed_positions.weight', 'model.decoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Some summaries obtained from the model on the test set are shown.

P.S. There are several parameters for text generation. We tried a few configurations and then chose this one, but it can be edited at will to obtain different results.

In [10]:
test_dataset_article = dataset["test"]["article"][:50]
test_dataset_label = dataset["test"]["highlights"][:50]

device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

model_base.to(device)
model_tuned.to(device)

predictions_base = []
predictions_tuned = []

for article in test_dataset_article:
  summary_base = model_base.generate(tokenizer_base(article, truncation=True,
                                                    return_tensors="pt").input_ids.to(device))
  output_base=tokenizer_base.decode(summary_base[0], skip_special_tokens=True)
  predictions_base.append(output_base)

  summary_tuned = model_tuned.generate(tokenizer_tuned(article, truncation=True, return_tensors="pt").input_ids.to(device),
                             max_length=128, early_stopping=True, length_penalty=2.0)
  output_tuned=tokenizer_tuned.decode(summary_tuned[0], skip_special_tokens=True)
  predictions_tuned.append(output_tuned)


cuda


In [11]:
print("Summary of the tuned model: ")
print(predictions_tuned[10])
print("Summary of the base model: ")
print(predictions_base[10])
print("Label summary: ")
print(test_dataset_label[10])

Summary of the tuned model: 
Yahya Rashid, 19, charged with terror offenses after he was arrested as he returned to Britain from Turkey. He's been charged with engaging in conduct in preparation of acts of terrorism and assisting others to commit acts of terrorism.
Summary of the base model: 
London (CNN)A 19-year-old man was charged Wednesday with terror offenses after he was arrested as he returned to Britain from Turkey, London's Metropolitan Police said.
Label summary: 
London's Metropolitan Police say the man was arrested at Luton airport after landing on a flight from Istanbul .
He's been charged with terror offenses allegedly committed since the start of November .


## Calculate Rouge metrics
In conclusion, we calculate the Rouge metrics for both the base model and the fine-tuned model, showing the improvement on all three metrics

In [12]:
rouge = Rouge()
scores_base = rouge.get_scores(predictions_base, test_dataset_label, avg=True)
scores_tuned = rouge.get_scores(predictions_tuned, test_dataset_label, avg=True)

print("Rouge base")
print(f"ROUGE-1: {scores_base['rouge-1']['f']:.4f}")
print(f"ROUGE-2: {scores_base['rouge-2']['f']:.4f}")
print(f"ROUGE-L: {scores_base['rouge-l']['f']:.4f}")
print("Rouge tuned")
print(f"ROUGE-1: {scores_tuned['rouge-1']['f']:.4f}")
print(f"ROUGE-2: {scores_tuned['rouge-2']['f']:.4f}")
print(f"ROUGE-L: {scores_tuned['rouge-l']['f']:.4f}")

Rouge base
ROUGE-1: 0.2246
ROUGE-2: 0.0727
ROUGE-L: 0.2021
Rouge tuned
ROUGE-1: 0.3066
ROUGE-2: 0.1279
ROUGE-L: 0.2756


## Try yourself the model with your news or with a sample of the test set
This is a section where you can test the fine-tuned model on an item of your choice or on random items in the test set. To switch from one option to another just change the input parameter passed to tokenizer inside the summary generation function.

In [14]:
input_text = """ Insert your article. """
#or
test_dataset_article = dataset["test"]["article"][1] #get the first article
test_dataset_label = dataset["test"]["highlights"][1] #get the first label

print("------------------Label-----------------------")
print(test_dataset_label)
print("-----------------------------------------")
print("------------------Generated summary from finetuning model-----------------------")
summary_tuned = model_tuned.generate(tokenizer_tuned(test_dataset_article, truncation=True, return_tensors="pt").input_ids.to(device),
                              max_length=128, early_stopping=True, length_penalty=2.0)
print(tokenizer_tuned.decode(summary_tuned[0], skip_special_tokens=True))
print("-----------------------------------------")
print("------------------Generated summary from base model-----------------------")
summary_base = model_base.generate(tokenizer_base(test_dataset_article, truncation=True, return_tensors="pt").input_ids.to(device))
print(tokenizer_base.decode(summary_base[0], skip_special_tokens=True))
print("-----------------------------------------")

------------------Label-----------------------
Theia, a bully breed mix, was apparently hit by a car, whacked with a hammer and buried in a field .
"She's a true miracle dog and she deserves a good life," says Sara Mellado, who is looking for a home for Theia .
-----------------------------------------
------------------Generated summary from finetuning model-----------------------
Theia is only one year old but the dog's brush with death did not leave her unscathed. She suffered a dislocated jaw, leg injuries and a caved-in sinus cavity. She is in desperate need of extensive medical procedures to fix her nasal damage.
-----------------------------------------
------------------Generated summary from base model-----------------------
That's according to Washington State University, where the dog -- a friendly white-and-black bully breed mix now named Theia -- has been receiving care at the Veterinary Teaching Hospital. The veterinary hospital's Good Samaritan Fund committee awarded som