# Fine tune a BERT for Custom Question Answering Dataset


## 1. Install huggingface transformers library

This example uses the `transformers` [library](https://github.com/huggingface/transformers/) by huggingface. We'll start by installing the package.

In [1]:
!pip install transformers[torch] -q
!pip install accelerate -U

Collecting accelerate
  Obtaining dependency information for accelerate from https://files.pythonhosted.org/packages/1b/da/24a54b9205fce3bdbaad521c35944d0b0a2d292ac5ae921e484b76312b43/accelerate-0.27.2-py3-none-any.whl.metadata
  Downloading accelerate-0.27.2-py3-none-any.whl.metadata (18 kB)
Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[?25hInstalling collected packages: accelerate
  Attempting uninstall: accelerate
    Found existing installation: accelerate 0.25.0
    Uninstalling accelerate-0.25.0:
      Successfully uninstalled accelerate-0.25.0
Successfully installed accelerate-0.27.2


In [2]:
import torch
import numpy as np
import pandas as pd
import json

from pathlib import Path
from transformers import DistilBertTokenizerFast
from torch.utils.data import DataLoader
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer



## 2. Load the dataset

In [3]:
!mkdir squad
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json -O squad/train-v2.0.json
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json -O squad/dev-v2.0.json

--2024-02-14 15:32:52--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: 'squad/train-v2.0.json'


2024-02-14 15:32:52 (225 MB/s) - 'squad/train-v2.0.json' saved [42123633/42123633]

--2024-02-14 15:32:53--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: 'squad/dev-v2.0.json'


2024-02-14 15:32:54 (52.2 MB/s) - 'squad/dev-v2.0.json' saved [4370528/4370528]



Each split is in a structured json file with a number of questions and answers for each passage (or context). We’ll take this apart into parallel lists of contexts, questions, and answers (note that the contexts here are repeated since there are multiple questions per context):

In [4]:
def read_squad(path):
    path = Path(path)
    with open(path, 'rb') as f:
        squad_dict = json.load(f)

    contexts = []
    questions = []
    answers = []
    for group in squad_dict['data']:
        for passage in group['paragraphs']:
            context = passage['context']
            for qa in passage['qas']:
                question = qa['question']
                for answer in qa['answers']:
                    contexts.append(context)
                    questions.append(question)
                    answers.append(answer)

    return contexts, questions, answers


In [13]:
train_contexts, train_questions, train_answers = read_squad('squad/train-v2.0.json')
val_contexts, val_questions, val_answers = read_squad('squad/dev-v2.0.json')

In [14]:
print(train_contexts[:4])

['Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".', 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead si

In [15]:
print(train_questions[:4])

['When did Beyonce start becoming popular?', 'What areas did Beyonce compete in when she was growing up?', "When did Beyonce leave Destiny's Child and become a solo singer?", 'In what city and state did Beyonce  grow up? ']


In [16]:
print(train_answers[:4])

[{'text': 'in the late 1990s', 'answer_start': 269}, {'text': 'singing and dancing', 'answer_start': 207}, {'text': '2003', 'answer_start': 526}, {'text': 'Houston, Texas', 'answer_start': 166}]


In [17]:
len(train_questions)

86821

In [18]:
train_contexts = train_contexts[:6000]
train_questions = train_questions[:6000]
train_answers = train_answers[:6000]

In [19]:
len(val_questions)

20302

In [20]:
val_contexts = val_contexts[:800]
val_questions = val_questions[:800]
val_answers = val_answers[:800]

The "contexts" and "questions" are simply pieces of text presented as sequences of characters. The answers, on the other hand, are dictionaries containing a portion of the text where the correct answer is found, along with the starting character index of that answer within the text. To prepare this data for training a model, we require two key pieces of information: (1) the tokenized versions of the context/question pairs, and (2) integers indicating the positions of the tokens where the answer begins and ends.

First, let’s get the character position at which the answer ends in the passage (we are given the starting position). Sometimes SQuAD answers are off by one or two characters, so we will also adjust for that.

In [21]:
def add_end_idx(answers, contexts):
    for answer, context in zip(answers, contexts):
        gold_text = answer['text']
        start_idx = answer['answer_start']
        end_idx = start_idx + len(gold_text)

        # sometimes squad answers are off by a character or two – fix this
        if context[start_idx:end_idx] == gold_text:
            answer['answer_end'] = end_idx
        elif context[start_idx-1:end_idx-1] == gold_text:
            answer['answer_start'] = start_idx - 1
            answer['answer_end'] = end_idx - 1     # When the gold label is off by one character
        elif context[start_idx-2:end_idx-2] == gold_text:
            answer['answer_start'] = start_idx - 2
            answer['answer_end'] = end_idx - 2     # When the gold label is off by two characters


In [22]:
add_end_idx(train_answers, train_contexts)
add_end_idx(val_answers, val_contexts)

In [23]:
train_answers[0]

{'text': 'in the late 1990s', 'answer_start': 269, 'answer_end': 286}

In [24]:
train_contexts[0]

'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".'

In [25]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True)
val_encodings = tokenizer(val_contexts, val_questions, truncation=True, padding=True)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [26]:
print(train_encodings[0])

Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])


After identifying the start and end positions of the answer in terms of characters within the text, we need to convert these positions into corresponding start and end positions in terms of tokens. When using tokenizers provided by the Hugging Face library (🤗 Fast Tokenizers), there is a built-in method called char_to_token() which facilitates this conversion. This method helps in mapping character positions to their corresponding token positions, which is essential for accurately locating the answer within the tokenized text.

In [27]:
def add_token_positions(encodings, answers):
    start_positions = []
    end_positions = []
    for i in range(len(answers)):
        start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
        end_positions.append(encodings.char_to_token(i, answers[i]['answer_end'] - 1))
        # if None, the answer passage has been truncated
        if start_positions[-1] is None:
            start_positions[-1] = tokenizer.model_max_length
        if end_positions[-1] is None:
            end_positions[-1] = tokenizer.model_max_length
    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})


In [28]:
add_token_positions(train_encodings, train_answers)
add_token_positions(val_encodings, val_answers)

Our data is ready. Let’s just put it in a PyTorch dataset so that we can easily use it for training

In [29]:
class SquadDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

In [30]:
train_dataset = SquadDataset(train_encodings)
val_dataset = SquadDataset(val_encodings)

## 3. Fine-tuning BERT for our dataset

In [32]:
from transformers import DefaultDataCollator

model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")
print(model)

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForQuestionAnswering(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
      

In [33]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

In [34]:
batch_size = 16

**Trainer Function**

In [35]:
training_args = TrainingArguments(output_dir="my_qa_model", evaluation_strategy="epoch",
                                  learning_rate=5e-5, per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size, num_train_epochs=15, push_to_hub=False,
)

In [36]:
trainer = Trainer(model = model, args=training_args, train_dataset=train_dataset, eval_dataset=val_dataset, tokenizer=tokenizer)

In [37]:
trainer.train()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,No log,2.235459
2,No log,2.038531
3,1.845800,2.205851
4,1.845800,2.389799
5,1.845800,2.662532
6,0.483800,2.994654
7,0.483800,3.224663
8,0.210500,3.119981
9,0.210500,3.3886
10,0.210500,3.587508




TrainOutput(global_step=2820, training_loss=0.4846927855877166, metrics={'train_runtime': 2509.5254, 'train_samples_per_second': 35.863, 'train_steps_per_second': 1.124, 'total_flos': 1.175877900288e+16, 'train_loss': 0.4846927855877166, 'epoch': 15.0})

## 4. Model Evaluation

In [38]:
trainer.evaluate()

{'eval_loss': 3.766305923461914,
 'eval_runtime': 5.6371,
 'eval_samples_per_second': 141.916,
 'eval_steps_per_second': 4.435,
 'epoch': 15.0}

Save the model

In [39]:
trainer.save_model('bert_qa_model')

In [47]:
loaded_model = AutoModelForQuestionAnswering.from_pretrained("bert_qa_model")
loaded_model.to(device)

DistilBertForQuestionAnswering(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
      

## 5. Model Inference

In [57]:
question = "Who is an apple?"
answer_text = "It has a color of red. Apple is a fruit."

The simplest way to try out your finetuned model for inference is to use it in a pipeline(). Instantiate a pipeline for question answering with your model, and pass your text to it:

In [58]:
inputs = tokenizer(question, answer_text, return_tensors="pt").to(device)

In [59]:
with torch.no_grad():
    outputs = loaded_model(**inputs)

In [60]:
answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

In [61]:
predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)

'apple is a fruit.'

## 6. Conclusion

As you can see, we don't got the better result because our model is finetuned on a little part of the dataset and also with few number of epochs without experimenting on hyperparameters.