Install libraries
1. A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. It is used primarily in the fields of natural language processing (NLP).
2. SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. 

In [1]:
!pip install transformers==2.9.0
!pip install sentencepiece


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==2.9.0
  Downloading transformers-2.9.0-py3-none-any.whl (635 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m635.7/635.7 KB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
Collecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m880.6/880.6 KB[0m [31m55.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tokenizers==0.7.0
  Downloading tokenizers-0.7.0-cp38-cp38-manylinux1_x86_64.whl (7.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m92.1 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m33.4 MB/s

Import Libraries

T5 uses Sentencepiece tokenizer, which is implemented in C and is opaque to Python.

In [2]:
# Importing stock libraries
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler

# Importing the T5 modules from huggingface/transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration

Check GPU Access

In [3]:
# Checking out the GPU we have access to. This is output is from the google colab version. 
!nvidia-smi

Tue Feb 21 09:22:42 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P0    26W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Set GPU if possible else CPU

In [4]:
# # Setting up the device for GPU usage
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

# Preparing for TPU usage
# import torch_xla
# import torch_xla.core.xla_model as xm
# device = xm.xla_device()

Class of Custom Dataset to Read the Data Frame and then further load it into Dataloader

The output of tokenizer is a dictionary containing two keys – input ids and attention mask. Input ids are the unique identifiers of the tokens in a sentence. Attention mask is used to batch the input sequence together and indicate whether the token should be attended by our model or not.

Masking is needed to prevent the attention mechanism of a transformer from “cheating” in the decoder when training. 

The tokenizer uses the batch_encode_plus method to perform tokenization and generate the necessary outputs, namely: source_id, source_mask from the actual text and target_id and target_mask from the summary text.

In [5]:
# Creating a custom dataset for reading the dataframe and loading it into the dataloader to pass it to the neural network at a later stage for finetuning the model and to prepare it for predictions

class CustomDataset(Dataset):

    def __init__(self, dataframe, tokenizer, source_len, summ_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.source_len = source_len
        self.summ_len = summ_len
        self.text = self.data.text
        self.ctext = self.data.ctext

    def __len__(self):
        return len(self.text)   

    def __getitem__(self, index):       #remove any extra spaces
        ctext = str(self.ctext[index])
        ctext = ' '.join(ctext.split())

        text = str(self.text[index])
        text = ' '.join(text.split())

        source = self.tokenizer.batch_encode_plus([ctext], max_length= self.source_len, pad_to_max_length=True,return_tensors='pt')   #pt = pytorch, pad_to_max_length=True for equal size of largest tensor 
        target = self.tokenizer.batch_encode_plus([text], max_length= self.summ_len, pad_to_max_length=True,return_tensors='pt')

        source_ids = source['input_ids'].squeeze()           
        source_mask = source['attention_mask'].squeeze()  # Returns a tensor with all the dimensions of input of size 1 removed
        target_ids = target['input_ids'].squeeze()
        target_mask = target['attention_mask'].squeeze()

        return {
            'source_ids': source_ids.to(dtype=torch.long),  
            'source_mask': source_mask.to(dtype=torch.long), # long() is equivalent to self.to(torch. int64)
            'target_ids': target_ids.to(dtype=torch.long),
            'target_ids_y': target_ids.to(dtype=torch.long)
        }

Train Function

The Dataloader passes the data through this train function.

A contiguous tensor is a tensor whose elements are stored in a contiguous order without leaving any empty space between them. 
When you call contiguous(), it actually makes a copy of the tensor such that the order of its elements in memory is the same as if it had been created from scratch with the same data.

clone() creates a copy of tensor that imitates the original tensor's requires_grad field. You should use detach() when attempting to remove a tensor from a computation graph, and clone as a way to copy the tensor while still keeping the copy as a part of the computation graph it came from.

language_model_labels (lm_labels) are calculated from the target_ids also, source_id and attention_mask are extracted.

In [6]:
# Creating the training function. This will be called in the main function. It is run depending on the epoch value.
# The model is put into train mode and then we enumerate over the training loader and passed to the defined network 

def train(epoch, tokenizer, model, device, loader, optimizer):
    model.train()
    for _,data in enumerate(loader, 0):    
        y = data['target_ids'].to(device, dtype = torch.long)  # cuda or cpu
        y_ids = y[:, :-1].contiguous()    
                                          
        lm_labels = y[:, 1:].clone().detach() 
        lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100  # The id of the padding token
        ids = data['source_ids'].to(device, dtype = torch.long)
        mask = data['source_mask'].to(device, dtype = torch.long)   

        outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, lm_labels=lm_labels)
        loss = outputs[0]
        
        if _%500==0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')  #print every 500 training samples
        
        optimizer.zero_grad()  #zero_grad() restarts looping without losses from the last step if you use the gradient method for decreasing the error (or losses). 
        loss.backward()
        optimizer.step()


Generate news summary on the data that is unseen during training using Validation Step

This is also a required argument that represents a binary mask indicating which tokens in the input should be attended to by the model. The mask has the same shape as input_ids, with 1s indicating tokens that should be attended to and 0s indicating tokens that should be ignored.

Number of Beams is the number of different possible sequences considered at each generation step.

For regression penalty a higher value for repetition_penalty will make the model less likely to repeat tokens in the output.

Length Penalty is an optional argument that encourages the model to generate longer sequences. A value greater than 1 will make the model more likely to generate longer sequences.

Special tokens are called special because they are not derived from your input. They are added for a certain purpose and are independent of the specific input.

In [7]:
def validate(epoch, tokenizer, model, device, loader):
    model.eval()
    predictions = []
    actuals = []
    with torch.no_grad():
        for _, data in enumerate(loader, 0): #the expression enumerate(loader, 0), the second argument 0 is the starting value of the index that enumerate will use to count the items in loader. 
            y = data['target_ids'].to(device, dtype = torch.long)
            ids = data['source_ids'].to(device, dtype = torch.long)
            mask = data['source_mask'].to(device, dtype = torch.long)

            generated_ids = model.generate(
                input_ids = ids,
                attention_mask = mask, 
                max_length=150, 
                num_beams=2, 
                repetition_penalty=2.5, 
                length_penalty=1.0, 
                early_stopping=True
                )
            preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
            target = [tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)for t in y] 
            if _%100==0:
                print(f'Completed {_}')   ##print every 100 validation samples

            predictions.extend(preds)
            actuals.extend(target)
    return predictions, actuals

main() Executest the complete Flow

PyTorch Data Loader combines a dataset and a sampler, and provides an iterable over the given dataset. The DataLoader supports both map-style and iterable-style datasets with single- or multi-process loading, customizing loading order and optional automatic batching (collation) and memory pinning.

The T5ForConditionalGeneration adds a Language Model head to our T5 model. The Language Model head allows us to generate text based on the training of T5 model.

The original summary and generated summary are converted into a list and returned to the main function.

Both the lists are used to create the final dataframe with 2 columns Generated Summary and Actual Summary.

In [9]:
def main():

    TRAIN_BATCH_SIZE = 2    # input batch size for training (default: 64)
    VALID_BATCH_SIZE = 2    # input batch size for testing (default: 1000)
    TRAIN_EPOCHS = 2        # number of epochs to train (default: 10)
    VAL_EPOCHS = 1 
    LEARNING_RATE = 1e-4    # learning rate (default: 0.01)
    SEED = 42               # random seed (default: 42)
    MAX_LEN = 512
    SUMMARY_LEN = 60 

    # Set random seeds and deterministic pytorch for reproducibility
    torch.manual_seed(SEED) # pytorch random seed
    np.random.seed(SEED) # numpy random seed
    torch.backends.cudnn.deterministic = True

    # tokenzier for encoding the text
    tokenizer = T5Tokenizer.from_pretrained("t5-base")
    

    # Importing and Pre-Processing the domain data
    # Selecting the needed columns only. 
    # Adding the summarzie text in front of the text. This is to format the dataset similar to how T5 model was trained for summarization task. 
    df = pd.read_csv('news_summary.csv',encoding='latin-1')
    df = df[['text','ctext']]
    df.ctext = 'summarize: ' + df.ctext
    print(df.head())

    # Creation of Dataset and Dataloader
    # Defining the train size. So 80% of the data will be used for training and the rest will be used for validation. 
    train_size = 0.8
    train_dataset=df.sample(frac=train_size, random_state = SEED).reset_index(drop=True)
    val_dataset=df.drop(train_dataset.index).reset_index(drop=True)

    print("FULL Dataset: {}".format(df.shape))
    print("TRAIN Dataset: {}".format(train_dataset.shape))
    print("TEST Dataset: {}".format(val_dataset.shape))


    # Creating the Training and Validation dataset for further creation of Dataloader
    training_set = CustomDataset(train_dataset, tokenizer, MAX_LEN, SUMMARY_LEN)
    val_set = CustomDataset(val_dataset, tokenizer, MAX_LEN, SUMMARY_LEN)

    # Defining the parameters for creation of dataloaders
    train_params = {
        'batch_size': TRAIN_BATCH_SIZE,
        'shuffle': True,
        'num_workers': 0
        }

    val_params = {
        'batch_size': VALID_BATCH_SIZE,
        'shuffle': False,
        'num_workers': 0
        }

    # Creation of Dataloaders for testing and validation. This will be used down for training and validation stage for the model.
    training_loader = DataLoader(training_set, **train_params)
    val_loader = DataLoader(val_set, **val_params)


    
    # Defining the model. We are using t5-base model and added a Language model layer on top for generation of Summary. 
    # Further this model is sent to device (GPU/TPU) for using the hardware.
    model = T5ForConditionalGeneration.from_pretrained("t5-base")
    model = model.to(device)

    # Defining the optimizer that will be used to tune the weights of the network in the training session. 
    optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

    # Training loop
    print('Initiating Fine-Tuning for the model on our dataset')

    for epoch in range(TRAIN_EPOCHS):
        train(epoch, tokenizer, model, device, training_loader, optimizer)


    # Validation loop and saving the resulting file with predictions and acutals in a dataframe.
    # Saving the dataframe as predictions.csv
    print('Now generating summaries on our fine tuned model for the validation dataset and saving it in a dataframe')
    for epoch in range(VAL_EPOCHS):
        predictions, actuals = validate(epoch, tokenizer, model, device, val_loader)
        final_df = pd.DataFrame({'Generated Text':predictions,'Actual Text':actuals})
        final_df.to_csv('predictions.csv')
        print('Output Files generated for review')

if __name__ == '__main__':
    main()

                                                text  \
0  The Administration of Union Territory Daman an...   
1  Malaika Arora slammed an Instagram user who tr...   
2  The Indira Gandhi Institute of Medical Science...   
3  Lashkar-e-Taiba's Kashmir commander Abu Dujana...   
4  Hotels in Maharashtra will train their staff t...   

                                               ctext  
0  summarize: The Daman and Diu administration on...  
1  summarize: From her special numbers to TV?appe...  
2  summarize: The Indira Gandhi Institute of Medi...  
3  summarize: Lashkar-e-Taiba's Kashmir commander...  
4  summarize: Hotels in Mumbai and other Indian c...  
FULL Dataset: (4514, 2)
TRAIN Dataset: (3611, 2)
TEST Dataset: (903, 2)


Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

Initiating Fine-Tuning for the model on our dataset
Epoch: 0, Loss:  5.825940132141113
Epoch: 0, Loss:  2.218724250793457
Epoch: 0, Loss:  3.1664295196533203
Epoch: 0, Loss:  2.106602668762207
Epoch: 1, Loss:  1.584641695022583
Epoch: 1, Loss:  0.8339561223983765
Epoch: 1, Loss:  2.1917176246643066
Epoch: 1, Loss:  0.7319134473800659
Now generating summaries on our fine tuned model for the validation dataset and saving it in a dataframe
Completed 0
Completed 100
Completed 200
Completed 300
Completed 400
Output Files generated for review


Reference: -
1. https://www.kaggle.com/datasets/sunnysai12345/news-summary?resource=download
2. https://github.com/sunnysai12345/News_Summary
3. https://wandb.ai/mukilan/T5_transformer/reports/Exploring-Google-s-T5-Text-To-Text-Transformer-Model--VmlldzoyNjkzOTE2
4. https://www.youtube.com/watch?v=91iLu6OOrwk
5. https://github.com/google/sentencepiece
6. https://towardsdatascience.com/how-to-adapt-a-multilingual-t5-model-for-a-single-language-b9f94f3d9c90#:~:text=T5%20uses%20Sentencepiece%20tokenizer%2C%20which,and%20is%20opaque%20to%20Python.
7. https://neptune.ai/blog/hugging-face-pre-trained-models-find-the-best#:~:text=The%20output%20of%20tokenizer%20is,by%20our%20model%20or%20not.
8. http://jalammar.github.io/illustrated-transformer/
9. https://medium.com/analytics-vidhya/masking-in-transformers-self-attention-mechanism-bad3c9ec235c#:~:text=Masking%20is%20needed%20to%20prevent,a%20translating%20task%20for%20instance).
10. https://www.kaggle.com/code/eggwhites2705/transformers-summarization-t5/notebook
