## Pegasus Large Model for summarization

### Install libraries

In [1]:
!pip install --upgrade transformers
!pip install datasets
!pip install rouge_score
!pip install rouge
!pip install sentencepiece

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 4.5 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 48.5 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 36.4 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 55.6 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 5.6 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Fo

### Load dataset

In [3]:
from google.colab import drive
import pandas as pd
from datasets import load_dataset, load_metric, Dataset


drive.mount('/content/drive')
path = "/content/drive/MyDrive/NN/amazon_review_dataset_processed.csv"
df = pd.read_csv(path)
amazon = Dataset.from_pandas(df)
amazon.shape

(11848, 3)

### Import necessary libraries

In [4]:
import transformers
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq
from datasets import load_dataset, load_metric
import torch
import numpy as np
import torch

### Import Pegasus Large Model 

In [5]:
model_name = 'google/pegasus-large'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = transformers.PegasusTokenizer.from_pretrained(model_name)

model = transformers.PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)

Downloading:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.02k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.12G [00:00<?, ?B/s]

### Data Preprocessing pipeline

In [6]:
max_source_length = 512
max_target_length = 175

def preprocess_function(reviews):
  input_sequences = reviews['reviewText']
  inputs = [sequence for sequence in input_sequences]
  model_inputs = tokenizer(inputs, max_length=max_source_length, truncation=True, padding = True)

  #output_sequences = reviews['summary']
  #summaries = [output_sequences[i][0] for i in range(len(output_sequences))]
  summaries = reviews['summary']
  labels = tokenizer(summaries, max_length=max_target_length, truncation=True, padding = True)

  model_inputs['labels'] = labels['input_ids']
  return model_inputs

In [7]:
tokenized_amazon = amazon.map(preprocess_function, batched=True)


  0%|          | 0/12 [00:00<?, ?ba/s]

### Train test split

In [8]:

NotTest_Test = tokenized_amazon.train_test_split(test_size=0.1, seed=42)
NotTest = NotTest_Test["train"]
test = NotTest_Test["test"]

Train_Val = NotTest.train_test_split(test_size=0.1, seed=42)
train = Train_Val["train"]
val = Train_Val["test"]

print(train.shape, val.shape, test.shape)

(9596, 6) (1067, 6) (1185, 6)


In [9]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

### Fine-tuning the model

In [10]:
training_args = Seq2SeqTrainingArguments(
    output_dir = "./results",
    evaluation_strategy = 'epoch',
    learning_rate = 2e-5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    fp16=True,
)

trainer = Seq2SeqTrainer(
    model = model,
    args = training_args,
    train_dataset = train,
    eval_dataset = val,
    tokenizer = tokenizer,
    data_collator = data_collator
)

trainer.train()

Using amp half precision backend
The following columns in the training set  don't have a corresponding argument in `PegasusForConditionalGeneration.forward` and have been ignored: summary, Unnamed: 0, reviewText. If summary, Unnamed: 0, reviewText are not expected by `PegasusForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 9596
  Num Epochs = 3
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 28788


Epoch,Training Loss,Validation Loss
1,1.7978,1.659862
2,1.5797,1.627413
3,1.6249,1.620811


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-500/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json
Model weights saved in ./results/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1000/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-1500
Configuration saved in ./results/checkpoint-1500/config.json
Model weights saved in ./results/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1500/special_toke

TrainOutput(global_step=28788, training_loss=1.8637155329036568, metrics={'train_runtime': 13921.2637, 'train_samples_per_second': 2.068, 'train_steps_per_second': 2.068, 'total_flos': 4.159095077810995e+16, 'train_loss': 1.8637155329036568, 'epoch': 3.0})

### Saving the fine-tuned model

In [11]:
trainer.save_model("./finetunedModelLarge")

Saving model checkpoint to ./finetunedModelLarge
Configuration saved in ./finetunedModelLarge/config.json
Model weights saved in ./finetunedModelLarge/pytorch_model.bin
tokenizer config file saved in ./finetunedModelLarge/tokenizer_config.json
Special tokens file saved in ./finetunedModelLarge/special_tokens_map.json


### Loading the fine-tuned Model

In [12]:
finetuned = AutoModelForSeq2SeqLM.from_pretrained("./finetunedModelLarge")

loading configuration file ./finetunedModelLarge/config.json
Model config PegasusConfig {
  "_name_or_path": "./finetunedModelLarge",
  "activation_dropout": 0.1,
  "activation_function": "relu",
  "add_bias_logits": false,
  "add_final_layer_norm": true,
  "architectures": [
    "PegasusForConditionalGeneration"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 16,
  "decoder_start_token_id": 0,
  "dropout": 0.1,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 16,
  "eos_token_id": 1,
  "extra_pos_embeddings": 1,
  "force_bos_token_to_be_generated": false,
  "forced_eos_token_id": 1,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": tru

### Generating summaries

In [25]:
all_outputs = []
x = torch.Tensor(test['input_ids']).long()


In [26]:
x_first = x[:20]
outputs = finetuned.generate(x_first, max_length=25, min_length=5, num_beams = 2, repetition_penalty = 2.5, early_stopping=True)
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
all_outputs.append(decoded_outputs)

In [29]:
x_second = x[20:40]
outputs = finetuned.generate(x_second, max_length=25, min_length=5, num_beams = 2, repetition_penalty = 2.5, early_stopping=True)
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
all_outputs.append(decoded_outputs)

In [30]:
x_third = x[40:60]
outputs = finetuned.generate(x_third, max_length=25, min_length=5, num_beams = 2, repetition_penalty = 2.5, early_stopping=True)
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
all_outputs.append(decoded_outputs)

In [31]:
x_fourth = x[60:80]
outputs = finetuned.generate(x_fourth, max_length=25, min_length=5, num_beams = 2, repetition_penalty = 2.5, early_stopping=True)
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
all_outputs.append(decoded_outputs)

In [32]:
x_fifth = x[80:100]
outputs = finetuned.generate(x_fifth, max_length=25, min_length=5, num_beams = 2, repetition_penalty = 2.5, early_stopping=True)
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
all_outputs.append(decoded_outputs)

In [33]:
x_sixth = x[100:120]
outputs = finetuned.generate(x_sixth, max_length=25, min_length=5, num_beams = 2, repetition_penalty = 2.5, early_stopping=True)
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
all_outputs.append(decoded_outputs)

In [34]:
x_seventh = x[120:140]
outputs = finetuned.generate(x_seventh, max_length=25, min_length=5, num_beams = 2, repetition_penalty = 2.5, early_stopping=True)
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
all_outputs.append(decoded_outputs)

In [35]:
x_eight = x[140:160]
outputs = finetuned.generate(x_eight, max_length=25, min_length=5, num_beams = 2, repetition_penalty = 2.5, early_stopping=True)
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
all_outputs.append(decoded_outputs)

In [36]:
x_ninth = x[160:180]
outputs = finetuned.generate(x_ninth, max_length=25, min_length=5, num_beams = 2, repetition_penalty = 2.5, early_stopping=True)
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
all_outputs.append(decoded_outputs)

In [37]:
x_tenth = x[180:200]
outputs = finetuned.generate(x_tenth, max_length=25, min_length=5, num_beams = 2, repetition_penalty = 2.5, early_stopping=True)
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
all_outputs.append(decoded_outputs)

### Evaluation usinng Rouge Scores

In [38]:
metric = load_metric('rouge')
flattened_outputs = np.array(all_outputs).flatten()
values = metric.compute(predictions=test['summary'][:200], references = flattened_outputs, use_stemmer=True)
values

Downloading builder script:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

{'rouge1': AggregateScore(low=Score(precision=0.1877720547329778, recall=0.18528026584429938, fmeasure=0.18073063562737102), mid=Score(precision=0.21628017160428165, recall=0.2143904592784237, fmeasure=0.20824760536855041), high=Score(precision=0.24577747672541603, recall=0.2443517436006116, fmeasure=0.23637194277222048)),
 'rouge2': AggregateScore(low=Score(precision=0.06868136928431044, recall=0.07013399439612673, fmeasure=0.06703398324609174), mid=Score(precision=0.09424022178801589, recall=0.09460668212506446, fmeasure=0.0905470085984775), high=Score(precision=0.12306484142899582, recall=0.12186393920621857, fmeasure=0.11711754526039134)),
 'rougeL': AggregateScore(low=Score(precision=0.16650477727842758, recall=0.16655897088109262, fmeasure=0.16218436151956495), mid=Score(precision=0.1933409698961171, recall=0.19358862470733607, fmeasure=0.18714703866573495), high=Score(precision=0.22225511786652477, recall=0.22138861488206504, fmeasure=0.21404071446582373)),
 'rougeLsum': Aggrega

### Printing the generated summaries

In [39]:
print(flattened_outputs)

['decent write super annoy heroine hot sex totally abrupt end'
 'cpr provide robust solution unwanted call one thing would like see improve'
 'theres katherine garbera eve gaddy whiskey river book 1'
 "'s first great record-one best ever hear ''"
 "blu-ray dvd combo set  heaven's lose property angeloid clockwork '' release feb. 2013."
 '6d new approach full frame dslr probably forward-thinking production'
 "inuyasha final act '' one special title call total package"
 "... milk human kindness... '' salty dog procol harum"
 'eye-fi sdhc 8gb sdhc wi-fi card absolutely useless'
 'nice case ipad air feature rechargeable bluetooth keyboard removable case nicely design quality stitch pu leather excellent interior microfiber line'
 'dirty sexy saint carly phillips erika wilde'
 'great game children/teens cognitively challenge people uncreative lazy parent'
 'astak ultrafast battery charger despite show amazon verify customer product'
 "... count bless... '' frank capra's 1954 holiday classic"


### Printing the actual dataset summaries

In [40]:
print(test['summary'][:200])

["decent write problem annoy heroine 's abrupt end might want pass one", 'well ... far good us use verizon fios single family home dect phone', 'shes determine go away present ryder let go easily ryder ford addi', '4.5 star underrate performance surprise delight throughout -minor vocal issue prevent full 5 star', 'entertain film fan watch previous series full fan-service', 'canon 6d bring modern tech classic shoot technique way keep photographer involve', "eight year hiatus original anime series `` inuyasha '' back first half final act", "`` salty dog '' allow procol harum sail uncharted waters-the salvo edition sound extremely good", 'eye.fi software steal account passwords review vote eye-fi shill read fast', 'excellent case lot nice feature auto sleep/wake function price right', 'author carly phillips erika wilde proud wonderful debut new series really great', 'great fun intelligent well-read creative type much lazy parent people hat english light class', 'astak ultrafast battery ch