## Pegasus CNN-Dailymail model for summarization

### Install libraries

In [None]:
!pip install --upgrade transformers
!pip install datasets
!pip install rouge_score
!pip install rouge
!pip install sentencepiece

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 5.1 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 37.9 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 4.1 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 51.6 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 32.7 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml


### Load dataset

In [None]:
from google.colab import drive
from datasets import load_dataset, load_metric, Dataset
import pandas as pd



drive.mount('/content/drive')
path = "/content/drive/MyDrive/NN/amazon_review_dataset_processed.csv"
df = pd.read_csv(path)
amazon = Dataset.from_pandas(df)
amazon.shape

(11848, 3)

### Import necessary libraries

In [None]:
import transformers
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq
from datasets import load_dataset, load_metric
import torch
import numpy as np
import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer


### Importing the model

In [None]:
if torch.cuda.is_available():
   device = torch.device("cuda")
else:
   device = torch.device("cuda")
model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-cnn_dailymail").to(device)
tokenizer = transformers.PegasusTokenizer.from_pretrained("google/pegasus-cnn_dailymail")

Downloading:   0%|          | 0.00/1.09k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.12G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

### Data Preprocessing pipeline

In [None]:
max_source_length = 512
max_target_length = 175

def preprocess_function(reviews):
  input_sequences = reviews['reviewText']
  inputs = [sequence for sequence in input_sequences]
  model_inputs = tokenizer(inputs, max_length=max_source_length, truncation=True, padding = True)

  #output_sequences = reviews['summary']
  #summaries = [output_sequences[i][0] for i in range(len(output_sequences))]
  summaries = reviews['summary']
  labels = tokenizer(summaries, max_length=max_target_length, truncation=True, padding = True)

  model_inputs['labels'] = labels['input_ids']
  return model_inputs

In [None]:
tokenized_amazon = amazon.map(preprocess_function, batched=True)


  0%|          | 0/12 [00:00<?, ?ba/s]

### Train-test split

In [None]:
NotTest_Test = tokenized_amazon.train_test_split(test_size=0.1 ,seed = 42)
NotTest = NotTest_Test["train"]
test = NotTest_Test["test"]

Train_Val = NotTest.train_test_split(test_size=0.1 , seed = 42)
train = Train_Val["train"]
val = Train_Val["test"]

print(train.shape, val.shape, test.shape)

(9596, 6) (1067, 6) (1185, 6)


In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

### Fine-tuning the model

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir = "./results",
    evaluation_strategy = 'epoch',
    learning_rate = 2e-5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    fp16=True,
)

trainer = Seq2SeqTrainer(
    model = model,
    args = training_args,
    train_dataset = train,
    eval_dataset = val,
    tokenizer = tokenizer,
    data_collator = data_collator
)

trainer.train()

Using amp half precision backend
The following columns in the training set  don't have a corresponding argument in `PegasusForConditionalGeneration.forward` and have been ignored: Unnamed: 0, reviewText, summary. If Unnamed: 0, reviewText, summary are not expected by `PegasusForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 9596
  Num Epochs = 3
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 28788


Epoch,Training Loss,Validation Loss
1,1.8937,1.732756
2,1.6511,1.688964
3,1.6927,1.680549


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-500/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json
Model weights saved in ./results/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1000/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-1500
Configuration saved in ./results/checkpoint-1500/config.json
Model weights saved in ./results/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1500/special_toke

TrainOutput(global_step=28788, training_loss=1.9701001325116747, metrics={'train_runtime': 13653.8127, 'train_samples_per_second': 2.108, 'train_steps_per_second': 2.108, 'total_flos': 4.159095077810995e+16, 'train_loss': 1.9701001325116747, 'epoch': 3.0})

### Saving the fine-tuned model

In [None]:
trainer.save_model("./finetunedModelCnn")

Saving model checkpoint to ./finetunedModelCnn
Configuration saved in ./finetunedModelCnn/config.json
Model weights saved in ./finetunedModelCnn/pytorch_model.bin
tokenizer config file saved in ./finetunedModelCnn/tokenizer_config.json
Special tokens file saved in ./finetunedModelCnn/special_tokens_map.json


### Loading the fine-tuned model

In [None]:
finetuned = AutoModelForSeq2SeqLM.from_pretrained("./finetunedModelCnn")

loading configuration file ./finetunedModelCnn/config.json
Model config PegasusConfig {
  "_name_or_path": "./finetunedModelCnn",
  "activation_dropout": 0.1,
  "activation_function": "relu",
  "add_bias_logits": false,
  "add_final_layer_norm": true,
  "architectures": [
    "PegasusForConditionalGeneration"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 16,
  "decoder_start_token_id": 0,
  "dropout": 0.1,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 16,
  "eos_token_id": 1,
  "extra_pos_embeddings": 1,
  "forced_eos_token_id": 1,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "len

### Generating summaries

In [None]:
all_outputs = []
x = torch.Tensor(test['input_ids']).long()

In [None]:
x_first = x[:20]
outputs = finetuned.generate(x_first, max_length=25, min_length=5, num_beams = 2, repetition_penalty = 2.5, early_stopping=True)
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
all_outputs.append(decoded_outputs)

In [None]:
x_second = x[20:40]
outputs = finetuned.generate(x_second, max_length=25, min_length=5, num_beams = 2, repetition_penalty = 2.5, early_stopping=True)
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
all_outputs.append(decoded_outputs)

In [None]:
x_third = x[40:60]
outputs = finetuned.generate(x_third, max_length=25, min_length=5, num_beams = 2, repetition_penalty = 2.5, early_stopping=True)
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
all_outputs.append(decoded_outputs)

In [None]:
x_fourth = x[60:80]
outputs = finetuned.generate(x_fourth, max_length=25, min_length=5, num_beams = 2, repetition_penalty = 2.5, early_stopping=True)
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
all_outputs.append(decoded_outputs)

In [None]:
x_fifth = x[80:100]
outputs = finetuned.generate(x_fifth, max_length=25, min_length=5, num_beams = 2, repetition_penalty = 2.5, early_stopping=True)
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
all_outputs.append(decoded_outputs)

In [None]:
x_sixth = x[100:120]
outputs = finetuned.generate(x_sixth, max_length=25, min_length=5, num_beams = 2, repetition_penalty = 2.5, early_stopping=True)
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
all_outputs.append(decoded_outputs)

In [None]:
x_seventh = x[120:140]
outputs = finetuned.generate(x_seventh, max_length=25, min_length=5, num_beams = 2, repetition_penalty = 2.5, early_stopping=True)
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
all_outputs.append(decoded_outputs)

In [None]:
x_eight = x[140:160]
outputs = finetuned.generate(x_eight, max_length=25, min_length=5, num_beams = 2, repetition_penalty = 2.5, early_stopping=True)
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
all_outputs.append(decoded_outputs)

In [None]:
x_ninth = x[160:180]
outputs = finetuned.generate(x_ninth, max_length=25, min_length=5, num_beams = 2, repetition_penalty = 2.5, early_stopping=True)
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
all_outputs.append(decoded_outputs)

In [None]:
x_tenth = x[180:200]
outputs = finetuned.generate(x_tenth, max_length=25, min_length=5, num_beams = 2, repetition_penalty = 2.5, early_stopping=True)
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
all_outputs.append(decoded_outputs)

### Evaluation using Rouge scores

In [None]:
metric = load_metric('rouge')
flattened_outputs = np.array(all_outputs).flatten()
values = metric.compute(predictions=test['summary'][:200], references = flattened_outputs, use_stemmer=True)
values

Downloading builder script:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

{'rouge1': AggregateScore(low=Score(precision=0.17150138835944756, recall=0.1795804679410717, fmeasure=0.17052959291414388), mid=Score(precision=0.20015964663990995, recall=0.20856194728688932, fmeasure=0.19835831101658968), high=Score(precision=0.23181019591433705, recall=0.24007987527608282, fmeasure=0.22895065033762882)),
 'rouge2': AggregateScore(low=Score(precision=0.06313107913328499, recall=0.06669814013600775, fmeasure=0.06347914914097093), mid=Score(precision=0.08861251983310806, recall=0.09085042000483179, fmeasure=0.08752917833837603), high=Score(precision=0.11940198628985395, recall=0.11928219582704873, fmeasure=0.11700989114021065)),
 'rougeL': AggregateScore(low=Score(precision=0.15176119511239974, recall=0.15927862152628497, fmeasure=0.15185021753896372), mid=Score(precision=0.17855548813501612, recall=0.18627360368604573, fmeasure=0.17696370772469944), high=Score(precision=0.20673292966761914, recall=0.21383902209653963, fmeasure=0.204193101459072)),
 'rougeLsum': Aggre

### Printing the generated summaries

In [None]:
print(flattened_outputs)

['decent write super annoy heroine hot sex totally abrupt end'
 'great device.... good call registry government agency allow add phone number'
 'theres katherine garbera eve gaddy whiskey river book 1 theres sexy attorney ryder for'
 "'s one best record ever make-a great collection austrian music master suitner"
 "heaven's lose property angeloid clockwork '' funimation blu-ray combo set"
 '6d innovative feature rich full frame price range half cost cameras picture quality'
 "blu-ray  inuyasha final act '' base last 21 volumes manga series"
 "... milk human kindness... '' procol harum"
 'eye-fi card shill try promote product negative review vote trash bin company'
 'nicely design case ipad air feature rechargeable bluetooth keyboard removable case'
 'dirty oh sexy new series two terrific author carly phillips erika wilde'
 'fun educational game require bite mental energy game intend fun'
 'astak ultrafast battery charger despite show amazon verify customer product'
 "frank capra's  best

### Printing the actual summaries

In [None]:
print(test['summary'][:200])

["decent write problem annoy heroine 's abrupt end might want pass one", 'well ... far good us use verizon fios single family home dect phone', 'shes determine go away present ryder let go easily ryder ford addi', '4.5 star underrate performance surprise delight throughout -minor vocal issue prevent full 5 star', 'entertain film fan watch previous series full fan-service', 'canon 6d bring modern tech classic shoot technique way keep photographer involve', "eight year hiatus original anime series `` inuyasha '' back first half final act", "`` salty dog '' allow procol harum sail uncharted waters-the salvo edition sound extremely good", 'eye.fi software steal account passwords review vote eye-fi shill read fast', 'excellent case lot nice feature auto sleep/wake function price right', 'author carly phillips erika wilde proud wonderful debut new series really great', 'great fun intelligent well-read creative type much lazy parent people hat english light class', 'astak ultrafast battery ch