# Fine tune T5 on Question Answering

## T5
- It is a text to text transfer transformer. 
- T5 alone can be used to perform different NLP taska such as Text classification, language translation, text summarization, question answering makes itself a most flexible model.

**What it does?**  
1. Convert all problems to text to text generation.
    - For example: In Language Translation, English to Italian  
    Input: I love you  
    Output: Ti amo
    
    - For example: In Text Classification,  
    Input: This product is trash.  
    Output: Negative

2. Learns to predict [MASK] words.
3. Use task specific prefixes to guide the model during fine tuning.
   For example, it adds specific token at the beginning of the input text to indicate what  task is it performing.

T5 has been shown to achieve state-of-the-art results on a wide range of NLP tasks, and it’s considered a highly sophisticated and powerful NLP model, showing a high level of versatility, fine-tuning capability, and an efficient way to transfer knowledge.

## Implementation


### 1. Installation

In [6]:
!pip install transformers
!pip install evaluate
!pip install rouge

Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.1
Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


In [7]:
import torch
import json
import torch.nn as nn
import nltk
import spacy
import string
import evaluate  # Bleu
import pandas as pd
import numpy as np
import transformers
import matplotlib.pyplot as plt
import warnings


from tqdm import tqdm
from torch.optim import Adam
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader, RandomSampler
from transformers import T5Tokenizer, T5Model, T5ForConditionalGeneration, T5TokenizerFast

warnings.filterwarnings("ignore")

2024-02-18 16:57:55.461186: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-18 16:57:55.461283: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-18 16:57:55.589790: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


### 2. Setup models

In [8]:
DEVICE = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
TOKENIZER = T5TokenizerFast.from_pretrained("t5-base")
MODEL = T5ForConditionalGeneration.from_pretrained("t5-base", return_dict=True).to(DEVICE)
OPTIMIZER = Adam(MODEL.parameters(), lr=0.00001)
Q_LEN = 256   # Question Length
T_LEN = 32    # Target Length
BATCH_SIZE = 16

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

### 3. Dataset
We will be using the Stanford Question Answering Dataset (SQuAD 1.1). Download the dataset from [here](https://rajpurkar.github.io/SQuAD-explorer/).

>“The Stanford Question Answering Dataset (SQuAD 1.1) is a popular dataset for training and evaluating question-answering models. It contains more than 100,000 question-answer pairs, each consisting of a question about a passage of text and the corresponding answer. The dataset is widely used in natural language processing research, and is considered to be a benchmark for question-answering performance.”
>

Loading the data

In [9]:
!mkdir squad
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json -O squad/train-v2.0.json

mkdir: cannot create directory 'squad': File exists
--2024-02-18 16:58:15--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: 'squad/train-v2.0.json'


2024-02-18 16:58:15 (230 MB/s) - 'squad/train-v2.0.json' saved [42123633/42123633]



In [10]:
!ls

squad


In [16]:
with open('./squad/train-v2.0.json') as f:
    data = json.load(f)

The below function `prepare_data` takes in data and extracts the context, questions, and answers from it, and returns the “articles” list containing the extracted context, questions, and answers.

In [17]:
def prepare_data(data):
    articles = []
    
    for article in data["data"]:
        for paragraph in article["paragraphs"]:
            for qa in paragraph["qas"]:
                question = qa["question"]

                if not qa["is_impossible"]:
                    answer = qa["answers"][0]["text"]
                
                inputs = {"context": paragraph["context"], "question": question, "answer": answer}

            
                articles.append(inputs)

    return articles

In [18]:
data = prepare_data(data)
print(len(data))

# Create a Dataframe
data = pd.DataFrame(data)

130319


In [19]:
data = data[:50000]

In [20]:
data.head(2)

Unnamed: 0,context,question,answer
0,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,When did Beyonce start becoming popular?,in the late 1990s
1,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,What areas did Beyonce compete in when she was...,singing and dancing


We will write a custom dataset using the PyTorch framework to make our data task-ready.

In [21]:
class QA_Dataset(Dataset):
    def __init__(self, tokenizer, dataframe, q_len, t_len):
        self.tokenizer = tokenizer
        self.q_len = q_len
        self.t_len = t_len
        self.data = dataframe
        self.questions = self.data["question"]
        self.context = self.data["context"]
        self.answer = self.data['answer']
        
    def __len__(self):
        return len(self.questions)
    
    def __getitem__(self, idx):
        question = self.questions[idx]
        context = self.context[idx]
        answer = self.answer[idx]
        
        question_tokenized = self.tokenizer(question, context, max_length=self.q_len, padding="max_length",
                                                    truncation=True, pad_to_max_length=True, add_special_tokens=True)
        answer_tokenized = self.tokenizer(answer, max_length=self.t_len, padding="max_length", 
                                          truncation=True, pad_to_max_length=True, add_special_tokens=True)
        
        labels = torch.tensor(answer_tokenized["input_ids"], dtype=torch.long)
        labels[labels == 0] = -100
        
        return {
            "input_ids": torch.tensor(question_tokenized["input_ids"], dtype=torch.long),
            "attention_mask": torch.tensor(question_tokenized["attention_mask"], dtype=torch.long),
            "labels": labels,
            "decoder_attention_mask": torch.tensor(answer_tokenized["attention_mask"], dtype=torch.long)
        }

In [22]:
# Dataloader

train_data, val_data = train_test_split(data, test_size=0.2, random_state=42)

train_sampler = RandomSampler(train_data.index)
val_sampler = RandomSampler(val_data.index)

qa_dataset = QA_Dataset(TOKENIZER, data, Q_LEN, T_LEN)

train_loader = DataLoader(qa_dataset, batch_size=BATCH_SIZE, sampler=train_sampler)
val_loader = DataLoader(qa_dataset, batch_size=BATCH_SIZE, sampler=val_sampler)

### 4. Training loop

In [23]:
train_loss = 0
val_loss = 0
train_batch_count = 0
val_batch_count = 0
EPOCHS = 1

for epoch in range(EPOCHS):
    MODEL.train()
    for batch in tqdm(train_loader, desc="Training batches"):
        input_ids = batch["input_ids"].to(DEVICE)
        attention_mask = batch["attention_mask"].to(DEVICE)
        labels = batch["labels"].to(DEVICE)
        decoder_attention_mask = batch["decoder_attention_mask"].to(DEVICE)

        outputs = MODEL(
                          input_ids=input_ids,
                          attention_mask=attention_mask,
                          labels=labels,
                          decoder_attention_mask=decoder_attention_mask
                        )

        OPTIMIZER.zero_grad()
        outputs.loss.backward()
        OPTIMIZER.step()
        train_loss += outputs.loss.item()
        train_batch_count += 1
    
    #Evaluation
    MODEL.eval()
    for batch in tqdm(val_loader, desc="Validation batches"):
        input_ids = batch["input_ids"].to(DEVICE)
        attention_mask = batch["attention_mask"].to(DEVICE)
        labels = batch["labels"].to(DEVICE)
        decoder_attention_mask = batch["decoder_attention_mask"].to(DEVICE)

        outputs = MODEL(
                          input_ids=input_ids,
                          attention_mask=attention_mask,
                          labels=labels,
                          decoder_attention_mask=decoder_attention_mask
                        )

        OPTIMIZER.zero_grad()
        outputs.loss.backward()
        OPTIMIZER.step()
        val_loss += outputs.loss.item()
        val_batch_count += 1
        
    print(f"{epoch+1}/{2} -> Train loss: {train_loss / train_batch_count}\tValidation loss: {val_loss/val_batch_count}")

Training batches: 100%|██████████| 2500/2500 [23:14<00:00,  1.79it/s]
Validation batches: 100%|██████████| 625/625 [05:38<00:00,  1.85it/s]

1/2 -> Train loss: 1.0593451853692533	Validation loss: 0.44639317539930345





Save model and tokenizer

In [24]:
MODEL.save_pretrained("qa_t5_model")
TOKENIZER.save_pretrained("qa_t5_tokenizer")

('qa_t5_tokenizer/tokenizer_config.json',
 'qa_t5_tokenizer/special_tokens_map.json',
 'qa_t5_tokenizer/spiece.model',
 'qa_t5_tokenizer/added_tokens.json',
 'qa_t5_tokenizer/tokenizer.json')

### 5. Inference
>BLEU (Bilingual Evaluation Understudy) is a evaluation metric for machine learning models that generate text, such as machine translation and text summarization. It compares the generated text with a reference text and assigns a score between 0 and 1, where 1 is the best score, based on the overlapping n-grams (word sequences) between them. The higher the BLEU score, the more similar the generated text is to the reference text.
>
Additionally, BLEU is not the only metric to evaluate the performance of QA systems, and it’s important to use a set of metrics that are suitable for the task, such as ROUGE, METEOR, METRIC, and others.

In [81]:
import nltk
from nltk.translate.bleu_score import sentence_bleu
nltk.download('punkt')

def calculate_bleu_score(pred, ref):
    # Tokenize the sentences
    pred_tokens = nltk.word_tokenize(pred.lower())
    ref_tokens = nltk.word_tokenize(ref.lower())

    # Calculate BLEU score
    bleu_score = sentence_bleu([ref_tokens], pred_tokens)
    return bleu_score

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [82]:
def predict_answer(context, question, ref_answer=None):
    inputs = TOKENIZER(question, context, max_length=Q_LEN, padding="max_length", truncation=True, add_special_tokens=True)
    
    input_ids = torch.tensor(inputs["input_ids"], dtype=torch.long).to(DEVICE).unsqueeze(0)
    attention_mask = torch.tensor(inputs["attention_mask"], dtype=torch.long).to(DEVICE).unsqueeze(0)

    outputs = MODEL.generate(input_ids=input_ids, attention_mask=attention_mask)
  
    predicted_answer = TOKENIZER.decode(outputs.flatten(), skip_special_tokens=True)
    score = calculate_bleu_score(predicted_answer, ref_answer)
    return {
            "context": context,
            "question": question,
            "ref": ref_answer, 
            "pred": predicted_answer, 
            "score": score
        }

Now that we have our inference function, let’s test it on some examples from the same dataset as well as real cases.

In [83]:
context = data.iloc[0]["context"]
question = data.iloc[0]["question"]
answer = data.iloc[0]["answer"]

p = predict_answer(context, question, ref_answer=answer)
print("Context: \n", p['context'])
print("\n")
print("Question: \n", p['question'])
print("\n")
print("Predicted Ans: \n", p['pred'])
print("\n")
print("Actual Ans: \n", p['ref'])
print("\n")
print("BLEU Score: \n", p['score'])

Context: 
 Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".


Question: 
 When did Beyonce start becoming popular?


Predicted Ans: 
 late 1990s


Actual Ans: 
 in the late 1990s


BLEU Score: 
 0.36787944117144233


In [84]:
context = data.iloc[56]["context"]
question = data.iloc[56]["question"]
answer = data.iloc[56]["answer"]

p = predict_answer(context, question, ref_answer=answer)
print("Context: \n", p['context'])
print("\n")
print("Question: \n", p['question'])
print("\n")
print("Predicted Ans: \n", p['pred'])
print("\n")
print("Actual Ans: \n", p['ref'])
print("\n")
print("BLEU Score: \n", p['score'])

Context: 
 Beyoncé attended St. Mary's Elementary School in Fredericksburg, Texas, where she enrolled in dance classes. Her singing talent was discovered when dance instructor Darlette Johnson began humming a song and she finished it, able to hit the high-pitched notes. Beyoncé's interest in music and performing continued after winning a school talent show at age seven, singing John Lennon's "Imagine" to beat 15/16-year-olds. In fall of 1990, Beyoncé enrolled in Parker Elementary School, a music magnet school in Houston, where she would perform with the school's choir. She also attended the High School for the Performing and Visual Arts and later Alief Elsik High School. Beyoncé was also a member of the choir at St. John's United Methodist Church as a soloist for two years.


Question: 
 I which church was Beyonce  a member and soloist  in the choir?


Predicted Ans: 
 St. John's United Methodist Church


Actual Ans: 
 St. John's United Methodist Church


BLEU Score: 
 1.0


In [85]:
context = data.iloc[36]["context"]
question = data.iloc[36]["question"]
answer = data.iloc[36]["answer"]

p = predict_answer(context, question, ref_answer=answer)
print("Context: \n", p['context'])
print("\n")
print("Question: \n", p['question'])
print("\n")
print("Predicted Ans: \n", p['pred'])
print("\n")
print("Actual Ans: \n", p['ref'])
print("\n")
print("BLEU Score: \n", p['score'])

Context: 
 A self-described "modern-day feminist", Beyoncé creates songs that are often characterized by themes of love, relationships, and monogamy, as well as female sexuality and empowerment. On stage, her dynamic, highly choreographed performances have led to critics hailing her as one of the best entertainers in contemporary popular music. Throughout a career spanning 19 years, she has sold over 118 million records as a solo artist, and a further 60 million with Destiny's Child, making her one of the best-selling music artists of all time. She has won 20 Grammy Awards and is the most nominated woman in the award's history. The Recording Industry Association of America recognized her as the Top Certified Artist in America during the 2000s decade. In 2009, Billboard named her the Top Radio Songs Artist of the Decade, the Top Female Artist of the 2000s and their Artist of the Millennium in 2011. Time listed her among the 100 most influential people in the world in 2013 and 2014. Forb

### 6. Conclusion
Finally, this is just a simple example of what can be accomplished with the T5 model, and there are many other creative ways to utilize it in a variety of natural language processing tasks.