## Abstract Summarizaiton

In this notebook we load a pretrained BART neural net with more than 200M parameters. Given the size of this network it would be impossible to train it on my own.

This model was trained on Xsum dataset (extreme summarization) and our objective would be to finetune it. Unfortunately, the finetuning part do not work properly, I'd guess for one of these 2 reasons: the model is soo big that even finetuning it requires more computing power than what we have access to; it was trained with a totally different loss (wrt. the one used by huggingface), thus finetuning do not work

In [None]:
%%capture
!pip install transformers
!pip install datasets
!pip install rouge_score

In [None]:
from tqdm import tqdm
import gc

import torch, torch.nn as nn
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

from datasets import load_dataset, load_metric
from transformers import BartForConditionalGeneration, BartTokenizer
from transformers.optimization import AdamW

In [None]:
encoder_max_length = 400
decoder_max_length = 75
batch_size         = 8
model_name         = 'sshleifer/distilbart-cnn-12-3'

In [None]:
%%capture

# just to download stuff at the beginning, hiding the logs after
BartForConditionalGeneration.from_pretrained(model_name)
load_dataset("cnn_dailymail", "3.0.0")
BartTokenizer.from_pretrained(model_name)

In [None]:
"""
Uses https://huggingface.co/datasets/viewer/?dataset=cnn_dailymail and a tokenizer
to create the hot-encodings to feed to the NN
"""

class CNNDataset(Dataset):
  def __init__(self, mode='train', n_articles=10000):
    super().__init__()
    raw_data  = load_dataset("cnn_dailymail", "3.0.0", split=f"{mode}[:{n_articles}]")  # download dataset from huggingface
    self.tokenizer = BartTokenizer.from_pretrained(model_name)    # load tokenizer to hot-encode articles
    self.data = self.preprocess_data(raw_data)                                          # preprocess data (hot-encode article/highlights)

  def preprocess_data(self, raw_data):
    data = []
    for i in range(len(raw_data)):
      hot_article = self.tokenizer(raw_data[i]['article'], padding="max_length", truncation=True, max_length=encoder_max_length, return_tensors="pt")
      with self.tokenizer.as_target_tokenizer():
        hot_high  = self.tokenizer(raw_data[i]['highlights'], padding="max_length", truncation=True, max_length=decoder_max_length, return_tensors="pt")
      labels      = torch.tensor([-100 if token == self.tokenizer.pad_token_id else token for token in hot_high.input_ids[0]], dtype=torch.long)
      data.append( {
          'input_ids':hot_article.input_ids[0],       # article
          'decoder_input_ids':hot_high.input_ids[0],  # summary
          'labels':labels,                            # summary (excluding <pad> from loss)
          'attention_mask':hot_article.attention_mask[0],
          'decoder_attention_mask':hot_high.attention_mask[0]
      } )
    return data
  
  def __len__(self):
    return len(self.data)
  
  def __getitem__(self, idx):
    return self.data[idx]


In [None]:
"""
Instantiate model, dataloader & optimizer
"""

# model trained on xetreme summarization task
model = BartForConditionalGeneration.from_pretrained(model_name).cuda()

# data
train_data = CNNDataset(mode='train', n_articles=1000)
train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, pin_memory=True)

# optimizer
params = [
    {
        "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "LayerNorm.weight"])],
        "weight_decay": 1e-8,
    },
    {
        "params": [p for n, p in model.named_parameters() if any(nd in n for nd in ["bias", "LayerNorm.weight"])],
        "weight_decay": 0.0,
    },
]
optim = torch.optim.AdamW(params, lr=3e-8)

Reusing dataset cnn_dailymail (/root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234)


## Finetuning not working

--> I think that the model was pretrained with a particular loss, for this reason finetuning do not work properly (or we just lack computing power)

In [None]:
# finetuning not working

model.train()
for epoch in range(1):
  tot_loss = []
  for data in tqdm(train_loader, postfix=True):

    # move to GPU
    for k in data:
      data[k] = data[k].cuda()

    # forward
    loss = model(**data)[0]

    # backward
    loss.backward()
    optim.step()
    optim.zero_grad()
    tot_loss.append(loss.item())

    # print stats
    if len(tot_loss)%100==0: 
      print(f'loss e:{epoch} = {1000*sum(tot_loss)/len(tot_loss):.5f}')
  print(f'FINAL loss e:{epoch} = {1000*sum(tot_loss)/len(tot_loss):.5f}')
  tot_loss=[]

 80%|████████  | 100/125 [03:34<00:53,  2.16s/itTrue]

loss e:0 = 9500.44662


100%|██████████| 125/125 [04:28<00:00,  2.15s/itTrue]

FINAL loss e:0 = 9493.32954





In [None]:
test_data = CNNDataset(mode='test', n_articles=1000)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=batch_size, shuffle=True, pin_memory=True)

predicted, real = [], []

model.eval()
for batch in test_loader:

  # generate hot-encoded summary using BART
  summary_ids = model.generate(
    batch['input_ids'].cuda(),
    num_beams=4,
    length_penalty=2.0,
    no_repeat_ngram_size=3
  )

  # convert hot-encode into text
  for sum_ids, tgt_ids in zip(summary_ids.cpu(), batch['decoder_input_ids']):
    predicted_summary = test_data.tokenizer.decode(sum_ids, skip_special_tokens=True)
    real_summary      = test_data.tokenizer.decode(tgt_ids, skip_special_tokens=True)

    predicted.append(predicted_summary)
    real.append(real_summary)

rouge = load_metric('rouge')
rouge.compute(predictions=predicted, references=real)

Reusing dataset cnn_dailymail (/root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234)
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


Downloading:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

{'rouge1': AggregateScore(low=Score(precision=0.2654222360550931, recall=0.41124030404098394, fmeasure=0.3163846305063779), mid=Score(precision=0.2728709818003307, recall=0.4211998226263163, fmeasure=0.3242209863000375), high=Score(precision=0.2796333294746038, recall=0.43031998802158794, fmeasure=0.33140917185333973)),
 'rouge2': AggregateScore(low=Score(precision=0.10641092098232009, recall=0.1667457545762694, fmeasure=0.12731087993487006), mid=Score(precision=0.1131658947297715, recall=0.17601675065083458, fmeasure=0.13469246299019255), high=Score(precision=0.12015714724502735, recall=0.18563892651058667, fmeasure=0.1424238803959585)),
 'rougeL': AggregateScore(low=Score(precision=0.18915586289745528, recall=0.29602743160142897, fmeasure=0.2263854091598036), mid=Score(precision=0.19613087356038905, recall=0.30535075925856336, fmeasure=0.23380691450104535), high=Score(precision=0.2034271587078356, recall=0.3150622048308093, fmeasure=0.2416123002715498)),
 'rougeLsum': AggregateScore(

{'rouge1': AggregateScore(low=Score(precision=0.2654222360550931, recall=0.41124030404098394, fmeasure=0.3163846305063779), mid=Score(precision=0.2728709818003307, recall=0.4211998226263163, fmeasure=0.3242209863000375), high=Score(precision=0.2796333294746038, recall=0.43031998802158794, fmeasure=0.33140917185333973)),
 'rouge2': AggregateScore(low=Score(precision=0.10641092098232009, recall=0.1667457545762694, fmeasure=0.12731087993487006), mid=Score(precision=0.1131658947297715, recall=0.17601675065083458, fmeasure=0.13469246299019255), high=Score(precision=0.12015714724502735, recall=0.18563892651058667, fmeasure=0.1424238803959585)),
 'rougeL': AggregateScore(low=Score(precision=0.18915586289745528, recall=0.29602743160142897, fmeasure=0.2263854091598036), mid=Score(precision=0.19613087356038905, recall=0.30535075925856336, fmeasure=0.23380691450104535), high=Score(precision=0.2034271587078356, recall=0.3150622048308093, fmeasure=0.2416123002715498)),
 'rougeLsum': AggregateScore(

In [None]:
## print some examples

for i in range(10,12):
  print(f'REAL SUMMARY:\n{real[i]}\n\nGENERATED:\n{predicted[i]}\n\n', '_'*100)

REAL SUMMARY:
U.S. military doesn't have further information to evaluate the Iraqi media reports.
Al-Douri's body arrives in Baghdad where DNA samples are taken.
Izzat Ibrahim al-Douri was the highest-ranking member of Iraqi President Saddam Hussein's regime to evade capture.

GENERATED:
 Izzat Ibrahim al-Douri was the "King of Clubs" in a deck of playing cards used by U.S. troops to identify the most-wanted regime officials. He was killed in an operation by Iraqi security forces and Shia militia members in the Hamrin Mountains.

 ____________________________________________________________________________________________________
REAL SUMMARY:
Cory Booker: The unfortunate reality is that the United States leads the world in incarceration, not education.
At the same time, we are losing the increasingly important race to educate our citizens.

GENERATED:
 The United States is home to 25% of the world's prison population. Instead of empowering the next generation of American artists, scie