<a href="https://colab.research.google.com/github/sanazbahargam/Fine_Tuning_T5_for_Summary_Generation/blob/main/Fine_Tuning_T5_for_Summary_Generation_with_PyTorch_Lightning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine Tuning T5 for Summary Generation, with PyTorch Lightning




[My blog posts](https://sanazbahargam.github.io/year-archive/)

# Resources:
*   Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, [paper](https://arxiv.org/abs/1910.10683) 
*   [T5 Implementation on PyTorch](https://github.com/huggingface/transformers/blob/455c6390938a5c737fa63e78396cedae41e4e87e/src/transformers/modeling_t5.py) by HuggingFace
*  [PyTorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning)
*  [Optuna](https://optuna.org/): An open source hyperparameter optimization framework to automate hyperparameter search
* [ROUGE Score](https://pypi.org/project/rouge-score/)




# T5 Overview
T5 was introduced in the paper [_Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer_](https://arxiv.org/abs/1910.10683). In that paper, authors provided a comprehensive picture of how we pre-trained a standard text-to-text Transformer model on a large text corpus, achieving state-of-the-art results on many NLP tasks after fine-tuning.

They pre-trained T5 on a mixture of supervised and unsupervised tasks with the majoriy of data coming from an unlabeled dataset they developed called [C4](https://www.tensorflow.org/datasets/catalog/c4). C4 is based on a massive scrape of the web produced by [Common Crawl](https://commoncrawl.org). Loosely speaking, pre-training on C4 ideally gives T5 an understanding of natural language in addition to general world knowledge.


##  A Shared Text-To-Text Framework

With T5, authors propose reframing all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. This text-to-text framework allows us to use the same model, loss function, and hyperparameters on any NLP task, including machine translation, document summarization, question answering, and classification tasks (e.g., sentiment analysis). T5 can even be applied to regression tasks by training it to predict the string representation of a number instead of the number itself [source](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html).

<img src="https://1.bp.blogspot.com/-o4oiOExxq1s/Xk26XPC3haI/AAAAAAAAFU8/NBlvOWB84L0PTYy9TzZBaLf6fwPGJTR0QCLcBGAsYHQ/s1600/image3.gif" width="700" height="300" />

<font color="grey">Diagram of our text-to-text framework. Every task we consider uses text as input to the model, which is trained to generate some target text. This allows us to use the same model, loss function, and hyperparameters across our diverse set of tasks including translation (green), linguistic acceptability (red), sentence similarity (yellow), and **document summarization (blue)**. </font> 

## Installation
Installing the required packages, here's a breief decription of each package:
*  Optuna: An open source hyperparameter optimization framework to automate hyperparameter search
*  pytorch_lightning: An open-source Python library providing a lightweight PyTorch wrapper for high-performance AI research; to scale your models, not the boilerplate.
*  Transformers: Provides thousands of pretrained models to perform tasks on various tasks.  Transformers is backed by the two most popular deep learning libraries, PyTorch and TensorFlow. 
  

In [None]:
!pip uninstall tensorflow-tensorboard -q
!pip install --upgrade tensorflow -q
!pip install optuna -q
!pip install pytorch_lightning -q
!pip install rouge-score -q
!pip install transformers -q

# Code for TPU packages install
# !curl -q https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
# !python pytorch-xla-env-setup.py --apt-packages libomp5 libopenblas-dev

[K     |████████████████████████████████| 320.4MB 48kB/s 
[K     |████████████████████████████████| 256kB 3.2MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 163kB 6.7MB/s 
[K     |████████████████████████████████| 81kB 5.3MB/s 
[K     |████████████████████████████████| 81kB 5.1MB/s 
[K     |████████████████████████████████| 112kB 8.3MB/s 
[K     |████████████████████████████████| 51kB 6.3MB/s 
[K     |████████████████████████████████| 143kB 7.8MB/s 
[?25h  Building wheel for optuna (PEP 517) ... [?25l[?25hdone
  Building wheel for pyperclip (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 409kB 3.2MB/s 
[K     |████████████████████████████████| 2.8MB 6.3MB/s 
[K     |████████████████████████████████| 829kB 36.6MB/s 
[K     |████████████████████████████████| 276kB 37.8MB/s 
[?25h  

Importing stock libraries

In [None]:
import argparse
from argparse import ArgumentParser
from os.path import join, isfile
from os import listdir
import optuna
from optuna.integration import PyTorchLightningPruningCallback
import pandas as pd
import pytorch_lightning as pl
from pytorch_lightning.loggers import TensorBoardLogger
from rouge_score import rouge_scorer
import shutil
import torch
from torch.utils.data import TensorDataset, random_split
from torch.utils.data import  DataLoader, RandomSampler, SequentialSampler #Dataset,
from transformers import get_linear_schedule_with_warmup, AdamW
# Importing the T5 modules from huggingface/transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration

Let's see the GPU we get from Colab


In [None]:
# Checking out the GPU we have access to. This is output is from the google colab version. 
!nvidia-smi

Mon Oct  5 14:54:20 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P8     7W /  75W |     10MiB /  7611MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
class MetricsCallback(pl.Callback):
    def __init__(self):
        super().__init__()
        self.metrics = []

    def on_validation_end(self, trainer, pl_module):
        self.metrics.append(trainer.callback_metrics)

In [None]:
class T5Finetuner(pl.LightningModule):

    def __init__(self, args, df):
        super().__init__()
        self.save_hyperparameters()
        self.args = args
        self.model = T5ForConditionalGeneration.from_pretrained(self.args.model)
        self.tokenizer = T5Tokenizer.from_pretrained(self.args.model)
        self.data = df
        self.scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

    def encode_text(self, context, text):
      ctext = str(context)
      ctext = ' '.join(ctext.split())
      text = str(text) #summarized text
      text = ' '.join(text.split())
      source = self.tokenizer.batch_encode_plus([ctext], 
                                                max_length= self.args.source_len, 
                                                truncation=True,
                                                #pad_to_max_length=True,
                                                padding='max_length',
                                                return_tensors='pt')
      target = self.tokenizer.batch_encode_plus([text], 
                                                max_length= self.args.summ_len,
                                                truncation=True,
                                                #pad_to_max_length=True,
                                                padding='max_length',
                                                return_tensors='pt')
      y = target['input_ids']
      target_id = y[:, :-1].contiguous()
      target_label = y[:, 1:].clone().detach()
      target_label[y[:, 1:] == self.tokenizer.pad_token_id] = -100 #in case the labels are not provided, empty string
      return source['input_ids'], source['attention_mask'], target_id, target_label
    
    def prepare_data(self):
        source_ids, source_masks, target_ids, target_labels = [], [], [], [] 
        for _, row in self.data.iterrows():
            source_id, source_mask, target_id, target_label = self.encode_text(row.ctext, row.text)
            source_ids.append(source_id)
            source_masks.append(source_mask)
            target_ids.append(target_id)
            target_labels.append(target_label)

        # Convert the lists into tensors
        source_ids = torch.cat(source_ids, dim=0)
        source_masks = torch.cat(source_masks, dim=0)
        target_ids = torch.cat(target_ids, dim=0)
        target_labels = torch.cat(target_labels, dim=0)
        # splitting the data to train, validation, and test
        data = TensorDataset(source_ids, source_masks, target_ids, target_labels)
        train_size, val_size = int(0.8 * len(data)), int(0.1 * len(data))
        test_size = len(data) - (train_size + val_size)
        self.train_dat, self.val_dat, self.test_dat = \
            random_split(data, [train_size, val_size, test_size])
    
    def forward(self, batch, batch_idx):
        source_ids, source_mask, target_ids, target_labels = batch[:4]
        return self.model(input_ids = source_ids, attention_mask = source_mask, 
                          decoder_input_ids=target_ids, labels=target_labels)
        
    def training_step(self, batch, batch_idx):
        loss = self(batch, batch_idx)[0]
        return {'loss': loss, 'log': {'train_loss': loss}}

    def validation_step(self, batch, batch_idx):
        loss = self(batch, batch_idx)[0]
        return {'loss': loss}

    def validation_epoch_end(self, outputs):
        loss = sum([o['loss'] for o in outputs]) / len(outputs)
        out = {'val_loss': loss}
        return {**out, 'log': out}

    def test_step(self, batch, batch_idx):
        loss = self(batch, batch_idx)[0]
        return {'loss': loss}

    def test_epoch_end(self, outputs):
        loss = sum([o['loss'] for o in outputs]) / len(outputs)
        out = {'test_loss': loss}
        return {**out, 'log': out}
    
    def train_dataloader(self):
        return DataLoader(self.train_dat, batch_size=self.args.bs,
                          num_workers=4, sampler=RandomSampler(self.train_dat))

    def val_dataloader(self):
        return DataLoader(self.val_dat, batch_size=self.args.bs, num_workers=4,
                          sampler=SequentialSampler(self.val_dat))

    def test_dataloader(self):
        return DataLoader(self.test_dat, batch_size=self.args.bs, num_workers=4,
                          sampler=SequentialSampler(self.test_dat))    

    def configure_optimizers(self):
        optimizer = AdamW(self.model.parameters(), lr=self.args.lr, eps=1e-4)
        scheduler = get_linear_schedule_with_warmup(
            optimizer, num_warmup_steps=0,
            num_training_steps=self.args.max_epochs * len(self.train_dat))
        return {'optimizer': optimizer, 'lr_scheduler': scheduler}
    
    def generate_summary(self, ctext, summ_len=150, text='', beam_search=2, repetition_penalty=2.5):
        source_id, source_mask, target_id, target_label = self.encode_text(ctext, text)
        self.model.eval()
        with torch.no_grad():
            generated_ids = self.model.generate(
                input_ids = source_id,
                attention_mask = source_mask, 
                max_length=summ_len, 
                truncation=True,
                num_beams=beam_search,
                repetition_penalty=repetition_penalty, 
                length_penalty=1.0, 
                early_stopping=True
                )
            prediction = [self.tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
        if len(text) > 0:
            target = [self.tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)for t in target_id]
            scores = self.scorer.score(target[0], prediction[0])
            return prediction, scores
        else:
            return prediction
        

    def save_core_model(self):
        store_path = join(self.args.output, self.args.name, 'core')
        self.model.save_pretrained(store_path)
        self.tokenizer.save_pretrained(store_path)
        
    @staticmethod
    def add_model_specific_args(parent_parser):
        p = ArgumentParser(parents=[parent_parser], add_help=False)
        p.add_argument('-m', '--model', type=str, default='t5-base',
                       help='name of the model or the path pointing to it')
        p.add_argument('--bs', '--batch_size', type=int, default=2)
        p.add_argument('--source_len', type=int, default=512)
        p.add_argument('--summ_len', type=int, default=150)
        return p

In [None]:
def parse_arguments():
    p = ArgumentParser()
    p.add_argument('-p', '--path', type=str,  
                   default='/content/gdrive/My Drive/Colab Notebooks/data/text_summarization_t5/news_summary.csv',
                  help='path to the data file')
    p.add_argument('-o', '--output', type=str, default='/tmp/tpu-template',
                  help='path to the output directory for storing the model')
    p.add_argument('-n', '--name', type=str, default='t5-base',
                  help='this name will be used on tensorboard for the model')
    p.add_argument('-t', '--trials', type=int, default=1,
                  help='number of trials for hyperparameter search')
    p.add_argument('--seed', type=int, default=0, help='randomization seed')
    p = T5Finetuner.add_model_specific_args(p)
    p = pl.Trainer.add_argparse_args(p)
    args,_ = p.parse_known_args()
    args.max_epochs = 2
    return args

def optuna_objective(trial, args):
    # sampling the hyperparameters
    args.lr = trial.suggest_categorical("lr", [1e-6, 5e-6, 1e-5, 5e-5, 1e-4])
    # setting up the right callbacks
    cp_callback = pl.callbacks.ModelCheckpoint(
        join(args.output, args.name, f"trial_{trial.number}", "{epoch}"),
        monitor="val_loss", mode="min")
    pr_callback = PyTorchLightningPruningCallback(trial, monitor="val_loss")
    metrics_callback = MetricsCallback()
    df = pd.read_csv(args.path, engine='python')
    summarizer = T5Finetuner(args, df)         # loading the model
    trainer = pl.Trainer.from_argparse_args(      # loading the trainer
        args, gpus=(1 if torch.cuda.is_available() else 0),
        default_root_dir=args.output, gradient_clip_val=1.0,
        checkpoint_callback=cp_callback, callbacks=[metrics_callback],
        early_stop_callback=pr_callback, num_sanity_val_steps=-1,
        # select TensorBoad or Wandb logger
        logger=TensorBoardLogger(join(args.output, 'logs'), name=args.name, version=f'trial_{trial.number}')
        )
  
    trainer.fit(summarizer)                       # fitting the model
    trainer.test(summarizer)                      # testing the model
    return min([x['val_loss'].item() for x in metrics_callback.metrics])

In [None]:
def main():
    import glob
    import os
    from google.colab import drive

    drive.mount('/content/gdrive')
    # Setting up the device for GPU usage
    from torch import cuda
    device = 'cuda' if cuda.is_available() else 'cpu'

    # Preparing for TPU usage, if you don't have access ot TPU, remove these comments
    # import torch_xla
    # import torch_xla.core.xla_model as xm
    # device = xm.xla_device() 

    args = parse_arguments()      
    # parsing the input arguments
    shutil.rmtree(join(args.output, args.name), ignore_errors=True)
    shutil.rmtree(join(args.output, 'logs', args.name), ignore_errors=True)
    pl.seed_everything(args.seed)             # making it reproducible

    # creating a study for hyperparameter search
    pruner = optuna.pruners.MedianPruner()
    study = optuna.create_study(direction="maximize", pruner=pruner)
    study.optimize(lambda x: optuna_objective(x, args), n_trials=args.trials)
    # Loading the best model and saving the core bert model inside it
    best_trial_number = study.best_trial.number
    path = join(args.output, args.name, f"trial_{best_trial_number}")
    model_file = [f for f in listdir(path) if isfile(join(path, f))][0]
    t5model = T5Finetuner.load_from_checkpoint(join(path, model_file))
    t5model.save_core_model()

    print("\n Let's test the model on a wikipedia page:")
    prediction, scores = t5model.generate_summary('''Avram Noam Chomsky (born December 7, 1928) is an American linguist, philosopher, cognitive scientist, historian, social critic, and political activist. Sometimes called "the father of modern linguistics", Chomsky is also a major figure in analytic philosophy, and is one of the founders of the field of cognitive science. He is Laureate Professor of Linguistics at the University of Arizona and Institute Professor Emeritus at the Massachusetts Institute of Technology (MIT), and is the author of more than 100 books on topics such as linguistics, war, politics, and mass media. Ideologically, he aligns with anarcho-syndicalism and libertarian socialism. Born to Ashkenazi Jewish immigrants in Philadelphia, Chomsky developed an early interest in anarchism from alternative bookstores in New York City. He studied at the University of Pennsylvania. During his postgraduate work in the Harvard Society of Fellows, Chomsky developed the theory of transformational grammar for which he earned his doctorate in 1955. That year he began teaching at MIT, and in 1957 emerged as a significant figure in linguistics with his landmark work Syntactic Structures, which played a major role in remodeling the study of language. From 1958 to 1959 Chomsky was a National Science Foundation fellow at the Institute for Advanced Study. He created or co-created the universal grammar theory, the generative grammar theory, the Chomsky hierarchy, and the minimalist program. Chomsky also played a pivotal role in the decline of linguistic behaviorism, and was particularly critical of the work of B. F. Skinner. An outspoken opponent of U.S. involvement in the Vietnam War, which he saw as an act of American imperialism, in 1967 Chomsky rose to national attention for his anti-war essay "The Responsibility of Intellectuals". Associated with the New Left, he was arrested multiple times for his activism and placed on President Richard Nixon's Enemies List. While expanding his work in linguistics over subsequent decades, he also became involved in the linguistics wars. In collaboration with Edward S. Herman, Chomsky later articulated the propaganda model of media criticism in Manufacturing Consent and worked to expose the Indonesian occupation of East Timor. His defense of freedom of speech, including Holocaust denial, generated significant controversy in the Faurisson affair of the 1980s. Since retiring from MIT, he has continued his vocal political activism, including opposing the 2003 invasion of Iraq and supporting the Occupy movement. Chomsky began teaching at the University of Arizona in 2017.''',
                            summ_len=90,
                            text= '''Avram Noam Chomsky was born on December 7, 1928, in the East Oak Lane neighborhood of Philadelphia, Pennsylvania. His parents, Ze'ev "William" Chomsky and Elsie Simonofsky, were Jewish immigrants. William had fled the Russian Empire in 1913 to escape conscription and worked in Baltimore sweatshops and Hebrew elementary schools before attending university''',
    )
    print('Generated summary:\n', prediction, '\nROUGE score:\n', scores)


In [None]:
if __name__ == '__main__':
    main()

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


[I 2020-10-05 16:25:08,783] A new study created in memory with name: no-name-86cbc9cb-c6a4-461b-8cbc-c9f1282ada57
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]

Could not log computational graph since the `model.example_input_array` attribute is not set or `input_array` was not given


  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 222 M 


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validation sanity check', layout=Layout…



HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Training', layout=Layout(flex='2'), max…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…


Please also save or load the state of the optimzer when saving or loading the scheduler.



HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

Saving latest checkpoint..






Could not log computational graph since the `model.example_input_array` attribute is not set or `input_array` was not given



HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Testing', layout=Layout(flex='2'), max=…

--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_loss': tensor(1.8394, device='cuda:0')}
--------------------------------------------------------------------------------



[I 2020-10-05 17:31:10,780] Trial 0 finished with value: 1.8311350345611572 and parameters: {'lr': 5e-06}. Best is trial 0 with value: 1.8311350345611572.



 Let's test the model on a wikipedia page:
Generated summary:
 ['Avram Noam Chomsky (born December 7, 1928) is an American linguist, philosopher, cognitive scientist, historian, social critic, and political activist. Born to Ashkenazi Jewish immigrants in Philadelphia, Chomsky developed an early interest in anarchism from alternative bookstores in New York City. He is the author of more than 100 books on topics such as linguistics, war, politics'] 
ROUGE score:
 {'rouge1': Score(precision=0.3333333333333333, recall=0.3584905660377358, fmeasure=0.34545454545454546), 'rouge2': Score(precision=0.08928571428571429, recall=0.09615384615384616, fmeasure=0.0925925925925926), 'rougeL': Score(precision=0.21052631578947367, recall=0.22641509433962265, fmeasure=0.21818181818181817)}


If you are running this code on Google Colab, depending on your input size and batch size, you may get 

<font color="red">RuntimeError: cuda runtime error : out of memory.</font> 

In that case you can free the CUDA memory and reduce the batch size and run the code again. In order to free CUDA memory, you can use the following code to see the used/avialable memery, free the CUDA memroy and see the used/avaliable memory again.

> Indented block






In [None]:
!pip install py3nvml -q 
from py3nvml.py3nvml import * 
import gc
def get_cuda_memory_info():
  t = torch.cuda.get_device_properties(0).total_memory
  c = torch.cuda.memory_cached(0)
  a = torch.cuda.memory_allocated(0)
  f = c-a  # free inside cache
  nvmlInit()
  h = nvmlDeviceGetHandleByIndex(0)
  info = nvmlDeviceGetMemoryInfo(h)
  print(f'\ntotal    : {info.total/1000000} * 10^6')
  print(f'free     : {info.free/1000000} * 10^6')
  print(f'used     : {info.used/1000000} * 10^6')


     |████████████████████████████████| 61kB 2.2MB/s 
[?25h

In [None]:
get_cuda_memory_info()
gc.collect() 
torch.cuda.empty_cache()
get_cuda_memory_info()


torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved




total    : 7981.694976 * 10^6
free     : 4925.095936 * 10^6
used     : 3056.59904 * 10^6

total    : 7981.694976 * 10^6
free     : 7223.574528 * 10^6
used     : 758.120448 * 10^6
