# FINE TUNING OF GOOGLE'S PEGASUS LARGE MODEL FOR NEWS SUMMARIZATION

## Introduction
The goal of this notebook is to generate coherent summaries from news articles.
To perform this task, we started from a model known as PEGASUS and developed by Google.
Quoting Google's research blog on PEGASUS (you can read more [here](https://blog.research.google/2020/06/pegasus-state-of-art-model-for.html)):
> «In PEGASUS pre-training, some whole sentences are removed from the documents and the model is tasked with retrieving them. An example input for pre-training is a document with missing sentences, while the output consists of the missing sentences concatenated together. [...] A challenging task like that encourages the model to learn about language and general facts about the world, as well as how to distill information taken from throughout a document in order to generate output that closely resembles the fine-tuning summarization task.»

With that being said, we can say that Pegasus is not really pre-trained for summarization tasks, but for very similar ones that allow the model to already have good performance in generating summaries.

After some testing with the large model, it was evident that many times the model is not that accurate in a specific task like this. For instance, in some tests it wasn't able to compose a summary that really contains the core of the argument and also couldn't really adapt to produce brief summaries. If a summary length that is considered "too short" by the model is specified, it often truncates sentences.

So we decided to try to fine tune the model on a very small part of the well-known CNN/DailyMail dataset ([here you can find some more informations] (https://huggingface.co/datasets/cnn_dailymail)).

The next notebook cells will show the implementation.


## Install and import libraries
We need some libraries that can be installed in this cell.
*   Standard libraries as Pandas and Numpy
*   Transformers
*   PyTorch
*   Datasets
*   Rouge



In [None]:
!pip install datasets transformers transformers[torch] sentencepiece rouge --quiet
import pandas as pd
from transformers import PegasusForConditionalGeneration, PegasusTokenizer, Trainer, TrainingArguments, Adafactor, get_linear_schedule_with_warmup
from datasets import load_dataset, DatasetDict, ClassLabel
import torch
from rouge import Rouge
import random
from tqdm import tqdm
from IPython.display import display, HTML

## The XSum dataset and the CustomDataset Class
Here the dataset is loaded and some random samples are shown.
The dataset is already split into training, test, and validation sets.
We are interested in:
* The "document" column, containing the full article
* The "summary" column, containing the summary, that will act as a label

In [None]:
dataset = load_dataset("xsum")
dataset

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

### **Here are some random examples of what the dataset rows look like**

In [None]:
def show_random_elements(dataset, num_examples=5):
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(dataset["train"])

Unnamed: 0,document,summary,id
0,"The incident took place at Gordons Chemists on Broad Street just before 17:00 on Friday.\nThe men, aged 19 and 26, are due to appear at Dunfermline Sheriff Court on Monday.\nBoxes of prescription medicine were stolen during the robbery.",Two men have been charged in connection with an armed robbery in Cowdenbeath.,39548137
1,But just how much attention have you been paying?\nWhy was new West Ham winger Sofiane Feghouli suspended by Valencia last season? Why did newly appointed Chelsea manager Antonio Conte want to speak to Leicester boss Claudio Ranieri?\nTest your knowledge in BBC Sport's quiz:,A new Premier League season is about to start after a summer of managerial changes and record transfers.,37027702
2,"The Superhero Series allows disabled and non-disabled people to take part in triathlons - either solo, or in a team.\n""As a person with a disability looking to do sport for fun I've found it a real struggle,"" Warner said.\n""I know what a positive impact sport can have on a person's life - I believe everyone should have the chance.""\nThe former T35 sprinter - who has cerebral palsy - became the first disabled person to enter the London Triathlon back in 1998.\n""Even as an elite athlete I found it hard to keep up and had to ask other participants for help with my wetsuit and to get my bike down from the bike rack,"" she continued.\n""I've also taken part in fun runs where the roads have reopened and organisers have started clearing up before I've had a chance to finish.""\nThe Superhero Series is described as being ""dedicated to the everyday superhero - the UK's 12 million people with disabilities - and their friends and families"".\nWorld Championship medallist Warner added: ""The idea is simple: to create fun, gutsy events where people with disabilities call the shots and don't have to worry about cut-off times or equipment restrictions.\n""If you need flippers or floats in the water, or want to use your powered wheelchair, we make it possible.\n""In fact, as far as I'm concerned, anything goes.""\nEntrants can choose to do the whole triathlon or just one or two stages as part of a relay with disabled and non-disabled family and friends.\nThere are three triathlon distances to choose from and all disabled participants are invited to bring along a free ""sidekick"" to assist them in completing the course.\n""We've gone all out to try to think of everything we can to ensure everyone can be a superhero for the day,"" Warner said.\nThere is also the chance to compete as part of a celebrity team alongside the likes of Rio cycling and athletics Paralympic medallist Kadeena Cox or Channel 4 TV presenter Sophie Morgan.\nThe 20 ""celebrity captains"" for the event will choose two athletes each to make up a team, with entrants asked to submit reasons they should be selected.",British Paralympian Sophia Warner has launched a mass-participation event aimed at encouraging those with disabilities to get into sport.,37988918
3,"A pilot scheme in England that allows parents to claim free childcare for three and four-year-olds has seen take-up rates of over 80%.\nNurseries say rising staff costs and inflation will force many providers to close.\nThe government says it is investing a ""record £6bn in childcare"".\nThe Conservatives' promise to double the amount of free childcare that parents in England can claim has proved very popular in pilot areas.\nNursery providers are concerned that the money being offered by the government to provide the service will not cover the rising costs of looking after the children.\nIn Harrogate, 90% of providers have told the BBC they will not offer 30 hours free childcare in 2017 unless the funding proposals are changed.\n""We'll go bust under the current proposals,"" says Josy Thompson, a nursery owner.\nProviders in the North Yorkshire town say the cost of providing the care is £4.50 an hour per child, but that the government subsidy will only cover £3.40 of that.\nEight areas of England have been trialling the 30-hour scheme since September, with the majority of councils involved reporting take-up rates of over 80%.\n""It's a great idea and some parents have seen their weekly childcare bills cut from £425 to £85 a week."", says Vanessa Warn, who owns three nurseries in York - one of the pilot areas.\nNurseries say introducing a living wage for their staff and a general rise in inflation will mean the proposed £6bn a year in funding for the scheme will not keep pace with their rising costs.\n""The government want us to deliver a champagne nursery service for lemonade prices,"" added Ms Warn.\nThe government says it has provided an extra £300m a year to boost the average hourly rate paid to providers across England to £4.88 per hour from £4.56.\nAs part of the scheme, nurseries are allowed to charge parents for some ""additional services"" like food and special events.\n""It's great that so many parents want to take advantage of this scheme but the government has got to listen to people on the frontline"" says Clare Schofield from the National Day Nurseries Association\nThe government hopes that by offering parents 30 hours of free childcare more adults will return to the workforce or look to increase their hours.\nThe Office for National Statistics estimates that there are 1.9 million people in England who choose not to work because they are instead looking after their family.\nA Department for Education spokesperson said: ""We've had huge demand from local areas to take part in delivering our 30-hour offer a year early and we are working with the eight areas that were chosen to get the delivery of our offer right so we can hit the ground running in September 2017.""","Nursery providers are warning high demand for a new free 30-hour childcare scheme and rising costs will mean many nurseries will ""go bust"".",37745435
4,"White, 24, who joined the Dragons from Livingston in May, is in hospital because of problems with his bloodstream.\n""They're trying to get to the bottom of it but it more or less paralysed him for a while because he couldn't move,"" Wrexham manager Gary Mills said.\n""He's got to go on a course of antibiotics for eight weeks without any exercise or anything.""\nMills added: ""We wish him well and hopefully we can get him back as quickly as possible.""\nWhite's illness could force Mills to sign a striker ahead of Wrexham's National League season opener against Dover Athletic on 6 August.\nMills has further added to his squad with the signing of former Northampton Town and Cambridge United goalkeeper Chris Dunn.\nDunn has signed a one-year contract and will also be the club's goalkeeping coach.\nThe 28-year-old has been training with the Dragons during pre-season and will provide competition for Shwan Jalal, who joined from Macclesfield Town in June.\n""That gives me two top keepers in this league,"" Mills told BBC Wales Sport.",Wrexham will be without striker Jordan White for two months.,36867630


### A CustomDataset class is created that will represent the Dataset object containing encodings (articles) and decodings (labels)


In [None]:
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels['input_ids'][idx])
        return item
    def __len__(self):
        return len(self.labels['input_ids'])

## Pre-trained model and tokenizer
**Now, we can import the original model and its corresponding tokenizer.**

In [None]:
model_name = 'google/pegasus-large'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)
tokenizer = PegasusTokenizer.from_pretrained(model_name)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.encoder.embed_positions.weight', 'model.decoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Some examples of the tokenizer workings.

In [None]:
tokenizer("Hello, this one sentence!")

{'input_ids': [8087, 108, 136, 156, 5577, 147, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [None]:
tokenizer(["Hello, this one sentence!", "This is another sentence."])

{'input_ids': [[8087, 108, 136, 156, 5577, 147, 1], [182, 117, 372, 5577, 107, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

## Tokenization and dataset creation
We create a function to tokenize the dataset. We use the PEGASUS tokenizer to tokenize the encodings and labels of the dataset and set a max_length of 1024 tokens for the article (input) and 128 for the summary (output). Then, we can use the CustomDataset class to create the final dataset.
As mentioned in the introduction, we will not use the entire dataset but only a very small portion of 1000 samples. As specified in the Google research blog article, the model showed how even as few as 1000 samples are sufficient to achieve good results in the fine tuning for summarization task.


In [None]:
def tokenize_data(texts, labels):
  encodings = tokenizer(texts, max_length=1024, truncation=True, padding=True)
  decodings = tokenizer(text_target=labels, max_length=128, truncation=True, padding=True)
  dataset_tokenized = CustomDataset(encodings, decodings)
  return dataset_tokenized

In [None]:
train_texts, train_labels = dataset['train']['document'][:1000], dataset['train']['summary'][:1000]
valid_texts, valid_labels = dataset['validation']['document'][:100], dataset['validation']['summary'][:100]

train_dataset = tokenize_data(train_texts, train_labels)
val_dataset = tokenize_data(valid_texts, valid_labels)

## Finetune the model
Now, through the Trainer API we can configurate the fine tuning process. Different trainings with different parameters were tried. Here we report one of the configurations that take very little time to run on a GPU, are easily reproducible, and still report good results.

In [None]:
freeze_encoder=False

if freeze_encoder:
    for param in model.model.encoder.parameters():
        param.requires_grad = False

training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=15,
    learning_rate=5e-5,
    optim="adafactor",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    warmup_steps=50,
    weight_decay=0.01,
    max_grad_norm=1.0,
    evaluation_strategy='epoch',
    save_strategy="epoch",
    save_total_limit=2,
    gradient_accumulation_steps=16,
    fp16=False,
    logging_dir='./logs',
    logging_steps=1,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss
0,7.1792,5.119861


Epoch,Training Loss,Validation Loss
0,7.1792,5.119861
2,4.9512,3.652002
4,1.5391,1.039047
6,0.7005,0.990224


## Test finetuned model on test dataset
We can easily import the last checkpoint of our training and test the finetuned model on the test set.

In [None]:
model_tuned = PegasusForConditionalGeneration.from_pretrained("results/checkpoint-375")
tokenizer_tuned = PegasusTokenizer.from_pretrained("results/checkpoint-375")

model_base = PegasusForConditionalGeneration.from_pretrained("google/pegasus-large")
tokenizer_base = PegasusTokenizer.from_pretrained("google/pegasus-large")

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Some summaries obtained from the model on the test set are shown.

P.S. There are several parameters for text generation. We tried a few configurations and then chose this one, but it can be edited at will to obtain different results.

In [None]:
test_dataset_articles = dataset["test"]["document"][:150]
test_dataset_labels = dataset["test"]["summary"][:150]

device = "cuda" if torch.cuda.is_available() else "cpu"

batch_size = 4
article_batches = [test_dataset_articles[i:i+batch_size] for i in range(0, len(test_dataset_articles), batch_size)]

predictions_base = []
model_base.to(device)

for batch in tqdm(article_batches, desc="Generating summaries from base model: ", colour="green"):
    base_inputs = tokenizer_base(batch, truncation=True, padding=True, max_length=1024, return_tensors="pt")
    base_inputs = {key: val.to(device) for key, val in base_inputs.items()}
    base_summary = model_base.generate(**base_inputs, max_length=128, early_stopping=True, length_penalty=0.8)
    base_outputs = [tokenizer_base.decode(summary, skip_special_tokens=True) for summary in base_summary]
    predictions_base.extend(base_outputs)

predictions_tuned = []
model_tuned.to(device)

for batch in tqdm(article_batches, desc="Generating summaries from tuned model: ", colour="green"):
    tuned_inputs = tokenizer_tuned(batch, truncation=True, padding=True, max_length=1024, return_tensors="pt")
    tuned_inputs = {key: val.to(device) for key, val in tuned_inputs.items()}
    tuned_summary = model_tuned.generate(**tuned_inputs, max_length=128, early_stopping=True, length_penalty=0.8)
    tuned_outputs = [tokenizer_tuned.decode(summary, skip_special_tokens=True) for summary in tuned_summary]
    predictions_tuned.extend(tuned_outputs)

Generating summaries from base model: 100%|[32m██████████[0m| 38/38 [03:35<00:00,  5.66s/it]
Generating summaries from tuned model: 100%|[32m██████████[0m| 38/38 [01:23<00:00,  2.20s/it]


In [None]:
for i in tqdm(range(10, 40), colour="green"):
  print("\n Article: ")
  print(test_dataset_articles[i])
  print("\n Summary of the tuned model: ")
  print(predictions_tuned[i])
  print("\n Summary of the base model: ")
  print(predictions_base[i])
  print("\n Label summary: ")
  print(test_dataset_labels[i])
  print("\n----------------------------------------------------------------------------")

100%|[32m██████████[0m| 30/30 [00:00<00:00, 736.44it/s]


 Article: 
The move is in response to an £8m cut in the subsidy received from the Department of Employment and Learning (DEL).
The cut in undergraduate places will come into effect from September 2015.
Job losses will be among both academic and non-academic staff and Queen's says no compulsory redundancies should be required.
There are currently around 17,000 full-time undergraduate and postgraduate students at the university, and around 3,800 staff.
Queen's has a current intake of around 4,500 undergraduates per year.
The university aims to reduce the number of student places by 1,010 over the next three years.
The BBC understands that there are no immediate plans to close departments or courses, but that the cuts in funding may put some departments and courses at risk.
The Education Minister Stephen Farry said he recognised that some students might now choose to study in other areas of the UK because of the cuts facing Northern Ireland's universities.
"Some people will now be forced




## Calculate Rouge metrics
In conclusion, we calculate the Rouge metrics for both the base model and the fine-tuned model, showing the improvement on all three metrics

In [None]:
rouge = Rouge()
scores_base = rouge.get_scores(predictions_base, test_dataset_labels, avg=True)
scores_tuned = rouge.get_scores(predictions_tuned, test_dataset_labels, avg=True)

print("Rouge base")
print(f"ROUGE-1 - F1: {scores_base['rouge-1']['f']:.4f}")
print(f"ROUGE-1 - Precision: {scores_base['rouge-1']['p']:.4f}")
print(f"ROUGE-1 - Recall: {scores_base['rouge-1']['r']:.4f}")
print(f"ROUGE-2 - F1: {scores_base['rouge-2']['f']:.4f}")
print(f"ROUGE-2 - Precision: {scores_base['rouge-2']['p']:.4f}")
print(f"ROUGE-2 - Recall: {scores_base['rouge-2']['r']:.4f}")
print(f"ROUGE-L - F1: {scores_base['rouge-l']['f']:.4f}")
print(f"ROUGE-L - Precision: {scores_base['rouge-l']['p']:.4f}")
print(f"ROUGE-L - Recall: {scores_base['rouge-l']['r']:.4f}")
print("Rouge tuned")
print(f"ROUGE-1 - F1: {scores_tuned['rouge-1']['f']:.4f}")
print(f"ROUGE-1 - Precision: {scores_tuned['rouge-1']['p']:.4f}")
print(f"ROUGE-1 - Recall: {scores_tuned['rouge-1']['r']:.4f}")
print(f"ROUGE-2 - F1: {scores_tuned['rouge-2']['f']:.4f}")
print(f"ROUGE-2 - Precision: {scores_tuned['rouge-2']['p']:.4f}")
print(f"ROUGE-2 - Recall: {scores_tuned['rouge-2']['r']:.4f}")
print(f"ROUGE-L - F1: {scores_tuned['rouge-l']['f']:.4f}")
print(f"ROUGE-L - Precision: {scores_tuned['rouge-l']['p']:.4f}")
print(f"ROUGE-L - Recall: {scores_tuned['rouge-l']['r']:.4f}")

Rouge base
ROUGE-1 - F1: 0.1519
ROUGE-1 - Precision: 0.1220
ROUGE-1 - Recall: 0.2504
ROUGE-2 - F1: 0.0184
ROUGE-2 - Precision: 0.0140
ROUGE-2 - Recall: 0.0358
ROUGE-L - F1: 0.1224
ROUGE-L - Precision: 0.0979
ROUGE-L - Recall: 0.2040
Rouge tuned
ROUGE-1 - F1: 0.3910
ROUGE-1 - Precision: 0.4202
ROUGE-1 - Recall: 0.3775
ROUGE-2 - F1: 0.1734
ROUGE-2 - Precision: 0.1886
ROUGE-2 - Recall: 0.1665
ROUGE-L - F1: 0.3263
ROUGE-L - Precision: 0.3513
ROUGE-L - Recall: 0.3145


## Try yourself the model with your news or with a sample of the test set
This is a section where you can test the fine-tuned model on an item of your choice or on random items in the test set. To switch from one option to another just change the input parameter passed to tokenizer inside the summary generation function.

In [None]:
your_article = """
A 1,200lb-man forced to move from his financially-troubled nursing home was hoisted from the building by a crane and driven to his new residence on a flatbed truck. Robert Butler, 43, was transported inside a shipping container from Bannister House in Providence, Rhode Island to the Eleanor Slater Hospital in Cranston. He was accompanied by his medical team. The operation took almost seven hours on Sunday and involved the Providence and Cranston fire departments, Lifespan, the Hospital Association of Rhode Island and Bay Crane Northeast. Scroll down for video . Robert Butler was transported inside a shipping container from Bannister House in Providence, Rhode Island to the Eleanor Slater Hospital in Cranston on Sunday after his nursing home shut down amid financial difficulties. In this 2009 image, he weighed 900lbs. He has since gained another 300lbs . The Rhode Island health department acquired a crane from Bay Crane Northeast to help with their operation . Mr Butler was moved to his new hospital resident in a shipping container alongside his medical team to monitor his health . According to Target 12, the complex operation began with firefighters widening the door of Butler's room, then building a special ramp to move Mr Butler on to a deck. He was then shifted into the shipping container, complete with medical equipment, and lifted with a crane down on to a flatbed truck. Michael Raia of the Rhode Island Executive Office of Health & Human Services said planning for the move began weeks ago. Mr Raia said the new facility is better equipped to help Butler. In 2006, Mr Butler, who is on permanent disability, told local news channels that his bed was broken  at the nursing home and he was not receiving proper care. At the time, he weighed 900lbs and was trying to find a doctor to perform a gastric bypass and a way to cover the cost of the medical bills. He told WPRI that his weight problem was linked to his depression and that he was addicted to food. The crane was used to hoist Mr Butler in a medically-equipped shipping container from the nursing home on to a flatbed truck on Sunday afternoon . Mr Butler weighs around 1,200lb after a decade of battling his weight which he linked to depression . Mr Butler said in an interview in 2006 (pictured) that he was desperate for a gastric bypass but was unable to find a doctor to perform the surgery. At the time he weight 900lb .
"""
#or
test_dataset_article = dataset["test"]["document"][22] #get the first article
test_dataset_label = dataset["test"]["summary"][22] #get the first label

print("------------------Article-----------------------")
print(your_article)
print("-----------------------------------------")
print("------------------Generated summary from finetuning model-----------------------")
summary_tuned = model_tuned.generate(tokenizer_tuned(your_article, truncation=True, return_tensors="pt").input_ids.to(device),
                              max_length=128, early_stopping=True, length_penalty=0.8)
print(tokenizer_tuned.decode(summary_tuned[0], skip_special_tokens=True))
print("-----------------------------------------")
print("------------------Generated summary from base model-----------------------")
summary_base = model_base.generate(tokenizer_base(your_article, truncation=True, return_tensors="pt").input_ids.to(device),
                                  max_length=128, early_stopping=True, length_penalty=0.8)
print(tokenizer_base.decode(summary_base[0], skip_special_tokens=True))
print("-----------------------------------------")

------------------Article-----------------------

A 1,200lb-man forced to move from his financially-troubled nursing home was hoisted from the building by a crane and driven to his new residence on a flatbed truck. Robert Butler, 43, was transported inside a shipping container from Bannister House in Providence, Rhode Island to the Eleanor Slater Hospital in Cranston. He was accompanied by his medical team. The operation took almost seven hours on Sunday and involved the Providence and Cranston fire departments, Lifespan, the Hospital Association of Rhode Island and Bay Crane Northeast. Scroll down for video . Robert Butler was transported inside a shipping container from Bannister House in Providence, Rhode Island to the Eleanor Slater Hospital in Cranston on Sunday after his nursing home shut down amid financial difficulties. In this 2009 image, he weighed 900lbs. He has since gained another 300lbs . The Rhode Island health department acquired a crane from Bay Crane Northeast to help