<a href="https://colab.research.google.com/github/sehgalsakshi/Text-Summarization-and-Headline-Generation-Using-T5/blob/main/HeadingGeneration_T5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Fine Tuned T5 for Abstractive Headlines Generation

T5 is a text to text transformer where encoder recieves a sequence and also decoder outputs a sequence.
But here all tasks are modelled in the same way unlike BERT.
For example, Bert has to be fine tuned differently for different tasks but in T5, fine tuning is same for all the tasks, just the task name has to be mentioned in input.

**Why T5 for Summarization?**

T5 performs abstractive summarization in contrast to extractive. 

**Extractive summarization** means **identifying important sections** of the text and generating them verbatim producing a subset of the sentences from the original text while **Abstractive summarization** **reproduces important material in a new way** after interpretation and examination of the text using advanced natural language

T5 model has been trained on a very large dataset thus it's pretrained model is sufficient to perform generic text summarization. But if you've a domain specific dataset, one can consider fine tuning it.

Here we're performing **Abstractive Headlines Generation** using T5 summarization task. 
Since T5 generates summarization, we're fine tuning it to perform summarization for restricted number of words, thus getting a heading for the text.

Input would be a Text column and target to train for would be Headlines column.

In [None]:
#install the requirements
!pip install transformers -q
!pip install wandb -q
!pip install sentencepiece



In [None]:
# Importing libraries
import os
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
import sentencepiece

# Importing the T5 modules from huggingface/transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration

# WandB – Import the wandb library to log the model run and all the parameters
import wandb

In [None]:
# Checking the GPU we have access to.
!nvidia-smi

Sun Dec 27 21:35:42 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P8    10W /  70W |     10MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Setting up the device for GPU usage
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

In [None]:
!wandb login

[34m[1mwandb[0m: Currently logged in as: [33msakshisehgal[0m (use `wandb login --relogin` to force relogin)


In [None]:
'''Creating a custom dataset for reading the dataset and 
loading it into the dataloader to pass it to the neural network for finetuning the model'''

class HeadlinesDataset(Dataset):

    def __init__(self, dataframe, tokenizer, source_len, summ_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.source_len = source_len
        self.summ_len = summ_len
        self.headlines = self.data.headlines
        self.ctext = self.data.ctext

    def __len__(self):
        return len(self.headlines)

    def __getitem__(self, index):
        ctext = str(self.ctext[index])
        headlines = str(self.headlines[index])

        #cleaning data so as to ensure data is in string type
        ctext = ' '.join(ctext.split())
        headlines = ' '.join(headlines.split())

        source = self.tokenizer.batch_encode_plus([ctext], max_length= self.source_len, pad_to_max_length=True,return_tensors='pt')
        target = self.tokenizer.batch_encode_plus([headlines], max_length= self.summ_len, pad_to_max_length=True,return_tensors='pt')

        source_ids = source['input_ids'].squeeze()
        source_mask = source['attention_mask'].squeeze()
        target_ids = target['input_ids'].squeeze()
        target_mask = target['attention_mask'].squeeze()

        return {
            'source_ids': source_ids.to(dtype=torch.long), 
            'source_mask': source_mask.to(dtype=torch.long), 
            'target_ids': target_ids.to(dtype=torch.long),
            'target_ids_y': target_ids.to(dtype=torch.long)
        }

In [None]:
#Function to be called for training with the parameters passed from main function
def train(epoch, tokenizer, model, device, loader, optimizer):
    model.train()
    for _,data in enumerate(loader, 0):
        y = data['target_ids'].to(device, dtype = torch.long)
        y_ids = y[:, :-1].contiguous()
        lm_labels = y[:, 1:].clone().detach()
        lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
        ids = data['source_ids'].to(device, dtype = torch.long)
        mask = data['source_mask'].to(device, dtype = torch.long)

        outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, labels=lm_labels)
        loss = outputs[0]
        
        if _%10 == 0:
            wandb.log({"Training Loss": loss.item()})

        if _%500==0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

In [None]:
#Function to evaluate model for predictions
def validate(epoch, tokenizer, model, device, loader):
    model.eval()
    predictions = []
    actuals = []
    with torch.no_grad():
        for _, data in enumerate(loader, 0):
            y = data['target_ids'].to(device, dtype = torch.long)
            ids = data['source_ids'].to(device, dtype = torch.long)
            mask = data['source_mask'].to(device, dtype = torch.long)

            generated_ids = model.generate(
                input_ids = ids,
                attention_mask = mask, 
                max_length=150, 
                num_beams=2,
                repetition_penalty=2.5, 
                length_penalty=1.0, 
                early_stopping=True
                )
            preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
            target = [tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)for t in y]
            if _%100==0:
                print(f'Completed {_}')

            predictions.extend(preds)
            actuals.extend(target)
    return predictions, actuals

In [None]:
model_dir = 'model'
os.mkdir(model_dir)

In [None]:
def main():
    # Intialize new run in WandB
    wandb.init(project="t5_headlines_summarization")

    # WandB – Config is a variable that holds and saves hyperparameters and inputs
    # Defining some key variables that will be used later on in the training  
    config = wandb.config          # Initialize config
    config.TRAIN_BATCH_SIZE = 2    # input batch size for training (default: 64)
    config.VALID_BATCH_SIZE = 2    # input batch size for testing (default: 1000)
    config.TRAIN_EPOCHS = 2        # number of epochs to train (default: 10)
    config.VAL_EPOCHS = 1 
    config.LEARNING_RATE = 1e-4    # learning rate (default: 0.01)
    config.SEED = 42               # random seed (default: 42)
    config.MAX_LEN = 512
    config.SUMMARY_LEN = 20        #Generally this value is around 150 but since headlines are not that long, we're giving a realistic max word length 

    # Set random seeds and deterministic pytorch for reproducibility
    torch.manual_seed(config.SEED) # pytorch random seed
    np.random.seed(config.SEED) # numpy random seed
    torch.backends.cudnn.deterministic = True

    # tokenzier for encoding the text
    tokenizer = T5Tokenizer.from_pretrained("t5-base")
    

    # Importing the raw dataset
    # Since it's a sequence generation task, we can not perform data cleaning 
    #or else the output sequence would not be grammatically correct 
    df = pd.read_csv('news_summary.csv',encoding='utf-8')
    print(df.columns)
    #Using just the required columns
    df = df[['headlines','ctext']]
    df.ctext = 'summarize: ' + df.ctext
    print(df.head())

    
    # Creation of Dataset and Dataloader
    # Defining the train size. So 80% of the data will be used for training and the rest for validation. 
    train_size = 0.8
    train_dataset=df.sample(frac=train_size,random_state = config.SEED)
    val_dataset=df.drop(train_dataset.index).reset_index(drop=True)
    train_dataset = train_dataset.reset_index(drop=True)

    print("FULL Dataset: {}".format(df.shape))
    print("TRAIN Dataset: {}".format(train_dataset.shape))
    print("TEST Dataset: {}".format(val_dataset.shape))


    # Creating the Training and Validation dataset for further creation of Dataloader
    training_set = HeadlinesDataset(train_dataset, tokenizer, config.MAX_LEN, config.SUMMARY_LEN)
    val_set = HeadlinesDataset(val_dataset, tokenizer, config.MAX_LEN, config.SUMMARY_LEN)

    # Defining the parameters for creation of dataloaders
    train_params = {
        'batch_size': config.TRAIN_BATCH_SIZE,
        'shuffle': True,
        'num_workers': 0
        }

    val_params = {
        'batch_size': config.VALID_BATCH_SIZE,
        'shuffle': False,
        'num_workers': 0
        }

    # Creation of Dataloaders for testing and validation. This will be used down for training and validation stage for the model.
    training_loader = DataLoader(training_set, **train_params)
    val_loader = DataLoader(val_set, **val_params)


    
    # Defining the model. We are using t5-base model and added a Language model layer on top for generation of Summary. 
    # Further this model is sent to device (GPU/TPU) for using the hardware.
    model = T5ForConditionalGeneration.from_pretrained("t5-base")
    model = model.to(device)

    # Defining the optimizer that will be used to tune the weights of the network in the training session. 
    optimizer = torch.optim.Adam(params =  model.parameters(), lr=config.LEARNING_RATE)

    # Log metrics with wandb
    wandb.watch(model, log="all")
    # Training loop
    print('Initiating Fine-Tuning for the model on our dataset')

    for epoch in range(config.TRAIN_EPOCHS):
        train(epoch, tokenizer, model, device, training_loader, optimizer)
    
    #Saving the model state so that it can be reused for loading model in flask api
    model_name = 'heading_model_cpu.pth' if device == 'cpu' else 'heading_model.pth'
    path = './'+model_dir+'/'+model_name
    torch.save(model, path)

    print('Now generating summaries on our fine tuned model for the validation dataset and saving it in a dataframe')
    for epoch in range(config.VAL_EPOCHS):
      predictions, actuals = validate(epoch, tokenizer, model, device, val_loader)
      final_df = pd.DataFrame({'Generated Text':predictions,'Actual Text':actuals})
      final_df.to_csv('./'+model_dir+'/predictions.csv')
    print('Output Files generated for review')    

if __name__ == '__main__':
    main()

VBox(children=(Label(value=' 0.04MB of 0.04MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
Training Loss,1.14997
_step,361.0
_runtime,1431.0
_timestamp,1609106374.0


0,1
Training Loss,█▄▅▄▃▅▇▇▅▃▅▅▄▅▃▄▄▂▅▄▃▂▂▁▃▂▂▁▃▃▅▃▁▂▂▃▂▂▆▁
_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
_runtime,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███
_timestamp,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███


Index(['author', 'date', 'text', 'read_more', 'ctext'], dtype='object')
                                                text                                              ctext
0  Daman & Diu revokes mandatory Rakshabandhan in...  summarize: The Daman and Diu administration on...
1  Malaika slams user who trolled her for 'divorc...  summarize: From her special numbers to TV?appe...
2  'Virgin' now corrected to 'Unmarried' in IGIMS...  summarize: The Indira Gandhi Institute of Medi...
3  Aaj aapne pakad liya: LeT man Dujana before be...  summarize: Lashkar-e-Taiba's Kashmir commander...
4  Hotel staff to get training to spot signs of s...  summarize: Hotels in Mumbai and other Indian c...
FULL Dataset: (4514, 2)
TRAIN Dataset: (3611, 2)
TEST Dataset: (903, 2)


Some weights of the model checkpoint at t5-base were not used when initializing T5ForConditionalGeneration: ['decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight']
- This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by prov

Initiating Fine-Tuning for the model on our dataset
Epoch: 0, Loss:  8.10882568359375




Epoch: 0, Loss:  1.833003044128418
Epoch: 0, Loss:  2.382096767425537
Epoch: 0, Loss:  2.4758236408233643
Epoch: 1, Loss:  1.5662261247634888
Epoch: 1, Loss:  1.3060849905014038
Epoch: 1, Loss:  0.8978652358055115
Epoch: 1, Loss:  1.0221585035324097
Now generating summaries on our fine tuned model for the validation dataset and saving it in a dataframe
Completed 0
Completed 100
Completed 200
Completed 300
Completed 400
Output Files generated for review
