# Introduction

In this notebook, we will try to fine tune the pretrained gpt2 model of Hugging Face library.

# Dataset & DataLoaders

First, we need a text dataset to train it. For this we are gonnan use pytorch SQUAD 1 dataset. [https://pytorch.org/text/stable/datasets.html#squad-1-0]

## Loading a Dataset

In [7]:
!nvidia-smi

/bin/bash: nvidia-smi: command not found


In [2]:
import torch
from torch.utils.data import Dataset
from datasets import load_dataset

# Load the SQuAD1 dataset
train_dataset = load_dataset("squad")["train"]
test_dataset = load_dataset("squad")["validation"]

ModuleNotFoundError: No module named 'datasets'

## Iterating and Visualizing the Dataset

Before creating custom dataset class and DataLoader object, we need to visualize our squad dataset.

In [3]:
train_dataset[0]

NameError: name 'train_dataset' is not defined

In [4]:
example = train_dataset[0]

print(" Title: ", example['title'])
print("\n Context: ", example['context'])
print("\n Question: ", example['question'])
print("\n Answers: ", example['answers'])

NameError: name 'train_dataset' is not defined

## Creating a Custom Dataset

We are gonna use gpt2 tokenizer for our dataset.

In the given code, the truncation strategy used is `max_length` with `truncation=True`. This strategy truncates the input sequence to a maximum length of `max_length`. If the input sequence is longer than `max_length`, it is truncated from the end of the sequence. If the input sequence is shorter than `max_length`, it is padded with special tokens to reach the maximum length.

The __getitem__ method of the SquadDataset return the `input_ids` and `attention_mask` tensors as well as the `start_positions` and `end_positions` tensors, which represent the indices of the start and end tokens of the answer in the input_ids tensor.

`input_ids` has the input sequence of tokens.

In [5]:

# Define the dataset class
class SquadDataset(Dataset):
    def __init__(self, dataset, tokenizer):
        self.tokenizer = tokenizer
        self.dataset = dataset
        
    def __len__(self):
        return len(self.dataset)
    
    def __getitem__(self, idx):
        example = self.dataset[idx]
        context = example['context']
        question = example['question']
        answer = example['answers']['text'][0]
        
        # do encoding of the context and question 
        encoding = self.tokenizer.encode_plus(
            question,
            context,
            add_special_tokens=True,
            return_token_type_ids=True,
            return_attention_mask=True,
            padding='max_length',   
            max_length=384,
            truncation=True
        )
        
        # get start and end positions of answer in input_ids
        input_ids = encoding['input_ids']
        answer_start = example['answers']['answer_start'][0]
        answer_end = answer_start + len(answer)
        
        start_positions = []
        end_positions = []
        for i, token_id in enumerate(input_ids):
            if i == answer_start:
                start_positions.append(i)
            else:
                start_positions.append(-100)
            
            if i == answer_end:
                end_positions.append(i)
            else:
                end_positions.append(-100)
        
        # Create input tensors
        inputs = {
            'input_ids': torch.tensor(encoding['input_ids'], dtype=torch.long),
            'attention_mask': torch.tensor(encoding['attention_mask'], dtype=torch.long),
            'token_type_ids': torch.tensor(encoding['token_type_ids'], dtype=torch.long),
            'start_positions': torch.tensor(start_positions, dtype=torch.float),  # start and end positions should be float
            'end_positions': torch.tensor(end_positions, dtype=torch.float)
        }
        
        return inputs, answer

Other options for truncation strategy include:
- `do_not_truncate`: This strategy does not truncate the input sequence at all and raises an error if the input sequence is longer than the maximum length.
- `longest_first`: This strategy truncates the longest sequence first, until the total batch size is under the maximum length.
- `only_first`: This strategy truncates only the first sequence in the input, until it is under the maximum length.
- `only_second`: This strategy truncates only the second sequence in the input, until it is under the maximum length.
- `longest_first_trunc_at_point`: This strategy truncates the longest sequence first, but allows for a custom truncation point based on a particular token in the sequence.

You can select the `truncation` strategy by setting the truncation parameter to `True` and selecting the appropriate value for the `strategy` parameter.

------------------------
Do not set `return_overflowing_tokens` to True, because then it will return tensors of different size.

## Preparaing Data for training with DataLoaders

We are gonna use gpt2 tokenizer for our dataset.

We also need to set the padding token in our tokenizer. Otherwise we might get Value error.

Use can use following ways to add padding token:
1. Add special token

    `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`
2. Add end-of-sentence token as the padding token

    `tokenizer.pad_token = tokenizer.eos_token`
    
Here, we are using eos token as our padding token.

In [5]:
from transformers import GPT2Tokenizer
from torch.utils.data import DataLoader

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

# Create the Dataloader
train_dataloader = DataLoader(
    SquadDataset(train_dataset, tokenizer),
    batch_size=16,
    shuffle=True
)
test_dataloader = DataLoader(
    SquadDataset(test_dataset, tokenizer),
    batch_size=16,
    shuffle=True
)

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

## Iterate through the DataLoader

In [6]:
for batch in train_dataloader:
    
    print(f"Our context is:\n {batch[0]['input_ids']}")
#     print(f"Our context is:\n {batch['context'][0]}")
#     print(f"Question: {batch['question'][0]}")
#     print(f"Answer: {batch['answer'][0]}")
    break

Our context is:
 tensor([[ 2061,   466,  1450,  ..., 50256, 50256, 50256],
        [ 2061,  3858,   286,  ..., 50256, 50256, 50256],
        [   41,  8101,   290,  ..., 50256, 50256, 50256],
        ...,
        [ 2061,   547,   350,  ..., 50256, 50256, 50256],
        [ 2215,   373,   262,  ..., 50256, 50256, 50256],
        [ 2437,   867,   286,  ..., 50256, 50256, 50256]])


## Get Device for Training

In [7]:
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device")

Using cuda device


# Transfer Learning

But we are gonna use a pre-trained model, that is GPT2ForQuestionAnswering. and we are gonna fine-tune it on Squad Dataset.

In [8]:
from transformers import GPT2ForQuestionAnswering

model = GPT2ForQuestionAnswering.from_pretrained("gpt2").to(device)

print(model)

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForQuestionAnswering were not initialized from the model checkpoint at gpt2 and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


GPT2ForQuestionAnswering(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (qa_outputs): Linear(in_features=768, out_features=2, bias=True)
)


# Optimizing Model Parameters


## Hyperparameters

In [9]:
learning_rate = 5e-5
epochs = 5

## Optimizer

First we need to create a optimizer, that will do the optimization of model parameters using gradient descent based algorithm

In [10]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=learning_rate)
# loss_fn = model.get_loss()



## Optimization loop

This will do the training/finetuning of our model.

In the GPT-2 Question Answering model, the objective is to predict the start and end indices of answer in the context.The `input_ids` tensor contains the question and context, and the `labels` tensor contains start and end indices of answer in the context.

In [11]:
def train_loop(dataloader, model, optimizer):
    
    # set the model to training model
    model.train()
    
    for batch in dataloader:
        optimizer.zero_grad()
        
        # previous tokens
        input_ids = batch[0]['input_ids'].to(device)
        attention_mask = batch[0]['attention_mask'].to(device)
        token_type_ids = batch[0]['token_type_ids'].to(device)
        start_positions = batch[0]['start_positions'].to(device)
        end_positions = batch[0]['end_positions'].to(device)
        
        labels = {
            'start_positions': start_positions,
            'end_positions': end_positions
        }
        
       # get outputs from model
        outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)

        # calculate loss
        loss_start = nn.CrossEntropyLoss()(outputs.start_logits, start_positions)
        loss_end = nn.CrossEntropyLoss()(outputs.end_logits, end_positions)
        loss = (loss_start + loss_end) / 2  # average loss for start and end positions
        
        # backpropagation
        loss.backward()
        optimizer.step()
        

def test_loop(dataloader, model):
    # set the model of evaluation
    model.eval()
    val_loss = 0
    
    # Evaluating the model with torch.no_grad() ensures that no gradients are computed during test mode
    with torch.no_grad():
        for batch in dataloader:
            # previous tokens
            input_ids = batch[0]['input_ids'].to(device)
            attention_mask = batch[0]['attention_mask'].to(device)
            token_type_ids = batch[0]['token_type_ids'].to(device)
            start_positions = batch[0]['start_positions'].to(device)
            end_positions = batch[0]['end_positions'].to(device)

            labels = {
                'start_positions': start_positions,
                'end_positions': end_positions
            }

           # get outputs from model
            outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)

            # calculate loss
            loss_start = nn.CrossEntropyLoss()(outputs.start_logits, start_positions)
            loss_end = nn.CrossEntropyLoss()(outputs.end_logits, end_positions)
            loss = (loss_start + loss_end) / 2  # average loss for start and end positions
            
            val_loss += loss.item()
    
    # Print the validation loss for this epoch
    print(f"Validation Loss: {val_loss/len(dataloader)}")
    

In [12]:
import transformers
import torch.nn as nn
transformers.logging.set_verbosity_error()

for t in range(epochs):
    print(f"Epoch {t+1}\n ---------------------------")
    train_loop(train_dataloader, model, optimizer)
    test_loop(test_dataloader, model)

print("Done!")

Epoch 1
 ---------------------------
Validation Loss: -39169690.67473525
Epoch 2
 ---------------------------
Validation Loss: -84010044.72012103
Epoch 3
 ---------------------------
Validation Loss: -138141566.45083207
Epoch 4
 ---------------------------
Validation Loss: -200857133.21633887
Epoch 5
 ---------------------------
Validation Loss: -272117614.28139186
Done!


`model` returns a `CausalLMOutputWithCrossAttentions` object, not just a loss. We can get the loss using `loss` attribute on it.

This is the recommeded way of obtaining the loss value when using the `transformers` library.

## Saving Model

To use our trained model later, we can save it in a file.

In [13]:
# Save your fine-tuned model
model.save_pretrained("fine_tuned_QA")

### Loading Model

In [14]:
from transformers import GPT2ForQuestionAnswering

# Load the fine-tuned GPT-2 model and tokenizer
model_name = "fine_tuned_QA"
model = GPT2ForQuestionAnswering.from_pretrained(model_name).to(device)

# Inference

For inference we need to have question and context, that will have the answer.

We are gonna use `GPT2Tokenizer` for encoding our input.

In [15]:
# Define the question and context
question = "What is the capital of India?"
context = "India is a country located in South Asia. Its captial is New Delhi."

In [16]:
# Encode the inputs
inputs = tokenizer.encode_plus(question, context, add_special_tokens=True, return_tensors="pt").to(device)

The tokenizer used duing pre-training and fine-tuning must be the same in order to ensure that the tokenization scheme is consistent.

The model takes the encoded inputs tensor as input, and generates the output which has start_logits and end_logits using the fine-tuned model. By using these start_logits and end_logits we calculate the start_index and end_index.

In [17]:
# Get the start and end logits from the model
output = model(**inputs)
start_logits = output.start_logits.squeeze(-1).tolist()
end_logits = output.end_logits.squeeze(-1).tolist()

# Find the start and end indices of the answer using the logits
start_index = int(torch.argmax(torch.tensor(start_logits)))
end_index = int(torch.argmax(torch.tensor(end_logits)))

# Get the answer from the context using the indices
answer = tokenizer.decode(inputs["input_ids"][0][start_index:end_index+1])
print("Answer is: ", answer)

Answer is:  What
