# T5 Baseline

The initial exploration will use T5-small as the pre-training model along with ICSI dataset. When the model is ready, we will expand the dataset and also validation set for other hyperparameter tuning.

1. Library Loading  
2. Dataset Loading
3.   Dataset Transformation
4.   Training and Test Splitting
5.   Fine Tuning
6.   Checkpoint saving
7.   Evaluation



## Library Loading

In [None]:
!pip install transformers -q
#!curl -q https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
#!python pytorch-xla-env-setup.py --apt-packages libomp5 libopenblas-dev

[K     |████████████████████████████████| 1.3MB 5.6MB/s 
[K     |████████████████████████████████| 890kB 33.2MB/s 
[K     |████████████████████████████████| 1.1MB 37.6MB/s 
[K     |████████████████████████████████| 2.9MB 53.6MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


In [None]:
!pip install datasets
!pip install rouge_score
!pip install nlp


Collecting nlp
[?25l  Downloading https://files.pythonhosted.org/packages/09/e3/bcdc59f3434b224040c1047769c47b82705feca2b89ebbc28311e3764782/nlp-0.4.0-py3-none-any.whl (1.7MB)
[K     |████████████████████████████████| 1.7MB 4.9MB/s 
Installing collected packages: nlp
Successfully installed nlp-0.4.0


In [None]:
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler

# Importing the T5 modules from huggingface/transformers
# T5ForConditionalGeneration is specific for sequence-to-sequence
from transformers import T5Tokenizer, T5ForConditionalGeneration

from datasets import load_metric
import nlp


In [None]:
# Checking out the GPU we have access to. This is output is from the google colab version. 
!nvidia-smi

Mon Nov  9 13:17:27 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   60C    P8    11W /  70W |     10MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# # Setting up the device for GPU usage
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

## Data Loading

Loaded from GDrive the transformed dataset.

Considering the dataset has only 499 points, we will only split with  80% training dataset and 20% validation dataset.

This portion is using the dataset from extractive summary to abstractive summary

In [None]:
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)
torch.backends.cudnn.deterministic = True

train_size = 0.8

In [None]:
from google.colab import drive
drive.mount('/content/drive')

#/content/drive/My Drive/W266/data/ICSI_extrac_abstrac_512token.csv

Mounted at /content/drive


In [None]:
df = pd.read_csv('/content/drive/My Drive/W266/data/ICSI_extrac_abstrac_512token.csv',encoding='latin-1')
df = df[df['extractive'].notna()][['abstractive','extractive']]
# use the pre-defined "summarize" for abstractive summary
df.abstractive = 'summarize: ' + df.abstractive
print(df.head())
print(len(df))

                                         abstractive                                         extractive
0                                                NaN  So you 're essentially defining a lattice .  T...
2  summarize: On the one hand, a bespoke XML stru...  I mean , we I sort of already have developed a...
3  summarize: Two main options were discussed as ...  We should look at ATLAS ,  the NIST thing ,  T...
4  summarize: XML standards offer libraries that ...  Um , you would have another structure lower do...
5                                                NaN  I I don't see any way that file formats are go...
499


In [None]:
train_dataset=df.sample(frac=train_size,random_state = SEED)
test_dataset=df.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)

print("FULL Dataset: {}".format(df.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("TEST Dataset: {}".format(test_dataset.shape))

FULL Dataset: (499, 2)
TRAIN Dataset: (399, 2)
TEST Dataset: (100, 2)


## Dataset Transformation

Tokenize the input and also perform the attention masking to make sure everything can be done in tensors. 

Tunable Hyprparam:

*   MAX_LEN
*   SUMMARY_LEN
* TRAIN_BATCH_SIZE
* TEST_BATCH_SIZE


In [None]:
# most code from https://colab.research.google.com/drive/1ypT7oCjtBOTSMJv7J5_1vO7hDYSD_-oU?authuser=2#scrollTo=932p8NhxeNw4

class CustomDataset(Dataset):

    def __init__(self, dataframe, tokenizer, source_len, summ_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.source_len = source_len
        self.summ_len = summ_len
        self.abstractive = self.data.abstractive
        self.extractive = self.data.extractive

    def __len__(self):
        return len(self.abstractive)

    def __getitem__(self, index):
        extractive = str(self.extractive[index])
        extractive = ' '.join(extractive.split())

        abstractive = str(self.abstractive[index])
        abstractive = ' '.join(abstractive.split())

        source = self.tokenizer.batch_encode_plus([extractive], max_length= self.source_len, pad_to_max_length=True,return_tensors='pt')
        target = self.tokenizer.batch_encode_plus([abstractive], max_length= self.summ_len, pad_to_max_length=True,return_tensors='pt')
        source_ids = source['input_ids'].squeeze()
        source_mask = source['attention_mask'].squeeze()
        target_ids = target['input_ids'].squeeze()
        target_mask = target['attention_mask'].squeeze()

        return {
            'source_ids': source_ids.to(dtype=torch.long), 
            'source_mask': source_mask.to(dtype=torch.long), 
            'target_ids': target_ids.to(dtype=torch.long),
            'target_ids_y': target_ids.to(dtype=torch.long)
        }

In [None]:
### Training Dataset and Test Dataset 

# train_dataset (399, 2)
# test_dataset (100, 2)

MAX_LEN = 512
SUMMARY_LEN= 150

# note here only uses the t5-small model.
tokenizer = tokenizer = T5Tokenizer.from_pretrained("t5-small")
train_set = CustomDataset(train_dataset, tokenizer, MAX_LEN, SUMMARY_LEN)
test_set = CustomDataset(test_dataset, tokenizer, MAX_LEN, SUMMARY_LEN)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




In [None]:
# double checking the result size, only for one point
# https://stackoverflow.com/questions/43627405/understanding-getitem-method
print(train_set[0]['source_ids'].shape)
print(train_set[0]['source_mask'].shape)
print(train_set[0]['target_ids'].shape)
print(train_set[0]['target_ids_y'].shape)

Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


torch.Size([512])
torch.Size([512])
torch.Size([150])
torch.Size([150])




In [None]:
# https://deeplizard.com/learn/video/kWVgvsejXsE#:~:text=The%20num_workers%20attribute%20tells%20the,sequentially%20inside%20the%20main%20process
# num_workers to default 0
# This means that the training process will work sequentially inside the main process. 
# After a batch is used during the training process and another one is needed, we read the batch data from disk.

TRAIN_BATCH_SIZE = 4 
TEST_BATCH_SIZE = 2 

train_params = {
  'batch_size': TRAIN_BATCH_SIZE,
  'shuffle': True,
  'num_workers': 0
  }

test_params = {
  'batch_size': TEST_BATCH_SIZE,
  'shuffle': False,
  'num_workers': 0
  }

training_loader = DataLoader(train_set, **train_params)
test_loader = DataLoader(test_set, **test_params)

## Fine Tuning

Here we directly use the pre-trained model t5-small and will save checkpoint every 500 steps. 

Tunable Parameter:
* T5ForConditionalGeneration or T5
* epoch - train, test
* optimizer - LEARNING_RATE, Adam
* output: num_beams, length_penalty,early_stopping




### Training

The training part uses the t5-small pretrained model, didn't make any change to the model layer structures, and fine tune the parameters based on the dataset we have.

In [None]:
losslist = []
def train(epoch, tokenizer, model, device, loader, optimizer):
  # put into train mode 
  model.train()
  # enumerate the dataloader for training set into the defined network
  for _,data in enumerate(loader, 0):
      y = data['target_ids'].to(device, dtype = torch.long)
      # https://discuss.pytorch.org/t/contigious-vs-non-contigious-tensor/30107/2
      y_ids = y[:, :-1].contiguous()
      lm_labels = y[:, 1:].clone().detach()
      lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
      ids = data['source_ids'].to(device, dtype = torch.long)
      mask = data['source_mask'].to(device, dtype = torch.long)

      outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, lm_labels=lm_labels)
      loss = outputs[0]
      losslist.append(loss)
      if _%500==0:
          print(f'Epoch: {epoch}, Loss:  {loss.item()}')
      
      # https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

In [None]:
model = T5ForConditionalGeneration.from_pretrained("t5-small")
model = model.to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1197.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=242065649.0, style=ProgressStyle(descri…




In [None]:
# pretrained model shape
model

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseReluDense(
              (wi): Linear(in_features=512, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=512, bias=False)
              (dropout): Dro

In [None]:
# optimizer 
# https://pytorch.org/docs/stable/optim.html
LEARNING_RATE = 0.01
optimizer = torch.optim.Adam(params = model.parameters(), lr=LEARNING_RATE)

In [None]:
# training epoch
TRAIN_EPOCHS = 2

for epoch in range(TRAIN_EPOCHS):
  train(epoch, tokenizer, model, device, training_loader, optimizer)




Epoch: 0, Loss:  5.473861217498779
Epoch: 1, Loss:  0.000524244096595794


### Test and Evaluation

In [None]:
# https://towardsdatascience.com/fine-tuning-a-t5-transformer-for-any-summarization-task-82334c64c81

def test(epoch, tokenizer, model, device, loader):
  #https://stackoverflow.com/questions/60018578/what-does-model-eval-do-in-pytorch
  model.eval()
  predictions = []
  actuals = []
  #rouge_metric = load_metric('rouge') 
  # https://datascience.stackexchange.com/questions/32651/what-is-the-use-of-torch-no-grad-in-pytorch
  with torch.no_grad():

    for _, data in enumerate(loader, 0):

      y = data['target_ids'].to(device, dtype = torch.long)
      ids = data['source_ids'].to(device, dtype = torch.long)
      mask = data['source_mask'].to(device, dtype = torch.long)

      generated_ids = model.generate(
          input_ids = ids,
          attention_mask = mask, 
          max_length=150, 
          num_beams=2,
          repetition_penalty=2.5, 
          length_penalty=1.0, 
          early_stopping=True
          )
      preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
      target = [tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)for t in y]
      if _%100==0:
          print(f'Completed {_}')
      predictions.extend(preds)
      actuals.extend(target)
      #print(preds)
      #print(target)
      #rouge_metric.add(preds, target)
      
    #rouge_results = rouge_metric.compute(rouge_types=["rouge2"]) 
  return predictions, actuals

In [None]:
# Saving the dataframe as predictions.csv
MODEL_NAME = "T5CG_01"

# https://github.com/huggingface/datasets/issues/216
# TEST epoch
TEST_EPOCHS = 1
for epoch in range(TEST_EPOCHS):
    predictions, actuals = test(epoch, tokenizer, model, device, test_loader)
    # rouge_dict = {k: round(v.mid.fmeasure * 100, 4) for k, v in rouge_results.items()}
    # print(rouge_dict)
    final_df = pd.DataFrame({'Generated_Abstractive_Summary':predictions,
                             'Original_Extractive_Summary':actuals})
    final_df.to_csv('/content/drive/My Drive/W266/results/'+MODEL_NAME + '.csv')
    print('done testing')




Completed 0
done testing


In [None]:
# ROUGE Evaluation
# https://github.com/huggingface/datasets/issues/216
rouge = nlp.load_metric('rouge')
for actual, pred in zip(final_df.Original_Extractive_Summary,final_df.Generated_Abstractive_Summary):
  rouge.add(pred, actual)
score = rouge.compute(rouge_types=["rouge2"])


In [None]:
score

{'rouge2': AggregateScore(low=Score(precision=0.031859375000000016, recall=0.009151836904401259, fmeasure=0.013314036024703584), mid=Score(precision=0.046875, recall=0.014032077148693088, fmeasure=0.01953519606831055), high=Score(precision=0.06375, recall=0.019930392459915725, fmeasure=0.027099227637351116))}

#### Checkpoint 

Remember to change the CP_NAME to a new model pt name.

The model is then saved as checkpoints to Google Drive with the related tunable parameters.

In [None]:
# https://pytorch.org/tutorials/recipes/recipes/saving_and_loading_a_general_checkpoint.html
# Checkpoint Saving
CP_NAME = MODEL_NAME

CP_TRAIN_EPOCHS = TRAIN_EPOCHS
CP_TEST_EPOCHS = TEST_EPOCHS
CP_LEARNING_RATE = 0.01
CP_PATH = "/content/drive/My Drive/W266/checkpoints/"+ CP_NAME +".pt"
CP_MAX_LEN = 512
CP_SUMMARY_LEN = 150
CP_TRAIN_BATCH_SIZE = 4
CP_TEST_BATCH_SIZE = 2
CP_MODEL = 'T5ForConditionalGeneration'
CP_OPTIMIZER_OPTION = 'Adam'
CP_LOSSLIST = losslist
CP_TEST_OPTIONS = {
    "num_beams":          2,
    "repetition_penalty": 2.5, 
    "length_penalty":     1.0, 
    "early_stopping":     True
}


torch.save({
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'train_epoch': CP_TRAIN_EPOCHS,
            'test_epoch': CP_TEST_EPOCHS,
            'learning_rate': CP_LEARNING_RATE,
            'max_source_length':CP_MAX_LEN,
            'max_target_length':CP_SUMMARY_LEN,
            'train_batch_size':CP_TRAIN_BATCH_SIZE,
            'test_batch_size':CP_TEST_BATCH_SIZE,
            'model_option':CP_MODEL,
            'optimizer_option':CP_OPTIMIZER_OPTION,
            'losslist': CP_LOSSLIST,
            'test_option': CP_TEST_OPTIONS
            }, CP_PATH)

In [None]:
# checkpoint = torch.load(CP_PATH)
# model.load_state_dict(checkpoint['model_state_dict'])
# optimizer.load_state_dict(checkpoint['optimizer_state_dict'])



# train_epoch = checkpoint['train_epoch']
# test_epoch = checkpoint['test_epoch']
# losslist = checkpoint['losslist']
# learning_rate = checkpoint['learning_rate']
# max_source_length = checkpoint['max_source_length']
# max_target_length = checkpoint['max_target_length']
# train_batch_size = checkpoint['train_batch_size']
# test_batch_size = checkpoint['test_batch_size']
# optimizer_option = checkpoint['optimizer_option']
# test_option = checkpoint['test_option']