
<h1 align="center" style="color:green;font-size: 3em;" >Building a Question Answering System Using Bert</h1>

Question Anwering is an intresting problem in NLP that can be solved using Bert. In this article iam not getting in to internals working of bert. It will make the blog lengthy. Instead I will focus on applying bert to solve this problem using Hugging Face Library.

For Question Answering Bert takes 2 parameters input question and the text which contains the answer as a packed sequence. In this blog we will take SQuAD dataset and will train an question answering system. We will use hugging face library to solve our problem.

<img src ="https://external-preview.redd.it/xQNdh3sU1DLy_IWTG_0hyhDip-XupEQu1xGHG4gdDKM.jpg?auto=webp&s=6643c9803d945c0a2d1230d06a428d31e55ef9e3">

We will divide our article to 5 sections so that you can understand the concept better.

* [1. Understanding question answering system Besed On Bert](#section1)
* [2. Simple Inference Pipeline on Pretrained Model](#section2)
* [3. Understanding Data Preprocessing required for Question-Answering System](#section3)
* [4. Understanding Metrics needed for Evaluation](#section4)
* [5. Training the model](#section5)



<a class="anchor" id="section1"></a>
<h2 style="color:green;font-size: 2em;">Understanding Question Answering Model Based On Bert</h2>

Consider case of Bert (bert-base). It will produce a 768 dim vector output corresponding to each token. In downstream tasks like Question-Answering we will have 2 linear layers - one for start position and another for end position(start token classifier and end token classifier). We have seperate weights for each of them.During finetuning they are trained together.

During inference for every token in the text, we feed its final embedding into the start token classifier as well as end token clasifier. For each token internally a dot product occurs with start token vector and produce logits corresponding to that token. Similarly for the end token classifier as well. Thus model will produce start logits and end logits corresponding to all the input tokens.





<img src = "http://www.mccormickml.com/assets/BERT/SQuAD/start_token_classification.png" width=500 height=1500>

<img src="http://www.mccormickml.com/assets/BERT/SQuAD/end_token_classification.png" width=400 height=1300>




<a class="anchor" id="section2"></a>
<h2 style="color:green;font-size: 2em;">Simple Inference Pipeline on Pretrained Model</h2>

Before processing further Let me make a sample inference and show how the input and prediction should looks like. Lets load the tokenizer and model first

In [1]:
import torch
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer
import warnings
warnings.simplefilter("ignore")

weight_path = "kaporter/bert-base-uncased-finetuned-squad"
# loading tokenizer
tokenizer = BertTokenizer.from_pretrained(weight_path)
#loading the model
model = BertForQuestionAnswering.from_pretrained(weight_path)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/321 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/614 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/415M [00:00<?, ?B/s]

Lets take an example

```
question = "How many parameters does BERT-large have?"

context = "BERT-large is really big... it has 24-layers and an embedding size of 1,024, for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take a couple minutes to download to your Colab instance."

answer = 340M

```



Now lets generate token_ids using tokenizer and see it

In [2]:
question = "How many parameters does BERT-large have?"
context = "BERT-large is really big... it has 24-layers and an embedding size of 1,024, for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take a couple minutes to download to your Colab instance."

input_ids = tokenizer.encode(question, context)
print (f'We have about {len(input_ids)} tokens generated')

tokens = tokenizer.convert_ids_to_tokens(input_ids)
print(" ")
print('Some examples of token-input_id pairs:')

for i, (token,inp_id) in enumerate(zip(tokens,input_ids)):
    print(token,":",inp_id)
    

We have about 70 tokens generated
 
Some examples of token-input_id pairs:
[CLS] : 101
how : 2129
many : 2116
parameters : 11709
does : 2515
bert : 14324
- : 1011
large : 2312
have : 2031
? : 1029
[SEP] : 102
bert : 14324
- : 1011
large : 2312
is : 2003
really : 2428
big : 2502
. : 1012
. : 1012
. : 1012
it : 2009
has : 2038
24 : 2484
- : 1011
layers : 9014
and : 1998
an : 2019
em : 7861
##bed : 8270
##ding : 4667
size : 2946
of : 1997
1 : 1015
, : 1010
02 : 6185
##4 : 2549
, : 1010
for : 2005
a : 1037
total : 2561
of : 1997
340 : 16029
##m : 2213
parameters : 11709
! : 999
altogether : 10462
it : 2009
is : 2003
1 : 1015
. : 1012
34 : 4090
##gb : 18259
, : 1010
so : 2061
expect : 5987
it : 2009
to : 2000
take : 2202
a : 1037
couple : 3232
minutes : 2781
to : 2000
download : 8816
to : 2000
your : 2115
cola : 15270
##b : 2497
instance : 6013
. : 1012
[SEP] : 102


According to Bert paper we have berts input is a combination of 3 inputs: 

* Word piece embedding
* Positional embedding
* segmentation embedding.

The token embeddings and Positional embeddings are generated by model itself(taking care of hugging face).We only need to pass Segmentation emebedding. Segmentation emebdding will be 0 for all tokens related to question and 1 for all tokens related to Context.

Lets generate segmentation embedding.

Note: In the transformers library, huggingface likes to call these token_type_ids. So we will use same name.


In [3]:
sep_idx = tokens.index('[SEP]')

# we will provide including [SEP] token which seperates question from context and 1 for rest.
token_type_ids = [0 for i in range(sep_idx+1)] + [1 for i in range(sep_idx+1,len(tokens))]
print(token_type_ids)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


Now lets pass our input through model and sees the output.

In [4]:
# Run our example through the model.
out = model(torch.tensor([input_ids]), # The tokens representing our input text.
                token_type_ids=torch.tensor([token_type_ids]))

start_logits,end_logits = out['start_logits'],out['end_logits']
# Find the tokens with the highest `start` and `end` scores.
answer_start = torch.argmax(start_logits)
answer_end = torch.argmax(end_logits)

ans = ''.join(tokens[answer_start:answer_end])
print('Predicted answer:', ans)

Predicted answer: 340


In [5]:
del model
del tokenizer

We have seen that how we can predict using a finetuned bert(bert-base) model. Now lets train and model on Squad dataest


<a class="anchor" id="section3"></a>
<h2 style="color:green;font-size: 2em;">Understanding Data Preprocessing required for Question-Answering System</h2>

In this section before getting to the training part, let us understand how we process train data and validation data.

```
Note: For training and demo we will use distil bert instead of bert as it has less parameters and so consume less memory.
```

In [6]:
import transformers
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import seaborn as sns
import torch.nn.functional as F
import numpy as np
import pandas as pd
import os
import warnings
warnings.simplefilter("ignore")

**About dataset**

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

**Loading dataset**

In [7]:
from datasets import load_dataset
dataset = load_dataset("squad")
dataset

Downloading builder script:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading and preparing dataset squad/plain_text (download: 33.51 MiB, generated: 85.63 MiB, post-processed: Unknown size, total: 119.14 MiB) to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

We have about 87599 data points in train and 10570 in validation. 
Lets see some sample

In [8]:
# to make text bold
s_bold = '\033[1m'
e_bold = '\033[0;0m'

print(s_bold + 'Train Data Sample.....' + e_bold)
train_data = dataset["train"]
for data in train_data:
    print(' ')
    print(s_bold + 'ID -' + e_bold, data['id'])
    print(s_bold +'TITLE - '+ e_bold, data['title'])
    print(s_bold + 'CONTEXT - '+ e_bold,data['context'])
    print(s_bold + 'ANSWERS - ' + e_bold,data['answers']['text'])
    print(s_bold + 'ANSWERS START INDEX - ' + e_bold,data['answers']['answer_start'])
    print(' ')
    break
    
print('---'*30)   
print(s_bold + 'Validation Data Sample.....' + e_bold)
train_data = dataset["validation"]
for data in train_data:
    print(' ')
    print(s_bold + 'ID -' + e_bold, data['id'])
    print(s_bold +'TITLE - '+ e_bold, data['title'])
    print(s_bold + 'CONTEXT - '+ e_bold,data['context'])
    print(s_bold + 'ANSWERS - ' + e_bold,data['answers']['text'])
    print(s_bold + 'ANSWERS START INDEX - ' + e_bold,data['answers']['answer_start'])
    print(' ')
    break
    


[1mTrain Data Sample.....[0;0m
 
[1mID -[0;0m 5733be284776f41900661182
[1mTITLE - [0;0m University_of_Notre_Dame
[1mCONTEXT - [0;0m Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
[1mANSWERS - [0;0m ['Saint Bernadette Soubirous']
[1mANSWERS START INDEX - [0;0m [515]
 
-----------------------------------------------------------------------

We can see a title, Context and Answers along with stat indexes.We might need a little preprocessing of text before feeding to Bert Model. As discussed in our earlier article, input to our Bert Model is a sum of token embeddings, positional embeddings and segmentation embedddings. Lets see how out input looks like for a question answering system.

Lets take an example:

```
Question: "Which is your favorite sport?"

Reference text: "Iam Jhon.My favorite sport is football.I happily live in Florida. 

Answer: football.

````

After tokenization our input looks like this - 


```
[[CLS],"Which","is","your","favorite","sport","?",[SEP],"My","favourite","sport","is","football",".","I","happy","###ly","live","in","Florida",[SEP]]

```

Thus it will be in format  [CLS] question [SEP] context [SEP]



As next step suitable padding is added and is converted to word ids. Finally it is mapped with embedding matrx to generate Embedding vector. Similary we will have a corresponsing positional embedding vector.

Next we will see how our segmentation embedding looks like
```

[0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1]

```

It will be 0's for tokens corresponding to first sentence and 1 for tokens corresponding to second setance. It is mainly used to distinguish between 2 inputs.
Finally these three are summed and will becomes input to our Bert Model.

Another important thing that we found is that we can see multiple answers in one of our validation sample.Lets analyze it further.

In [9]:
dataset["train"].filter(lambda x: len(x["answers"]["text"]) != 1)

  0%|          | 0/88 [00:00<?, ?ba/s]

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 0
})

In [10]:
dataset["validation"].filter(lambda x: len(x["answers"]["text"]) != 1)

  0%|          | 0/11 [00:00<?, ?ba/s]

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 10567
})

We can see that in train we have only one answer for all the samples.But in validation data there are 10567 samples with multiple answers.

Before getting to the problem Let us understand how a question answering problem is solved.

In [11]:
## Lets sample some dataset so that we can reduce training time.
dataset["train"] = dataset["train"].select([i for i in range(8000)])
dataset["validation"] = dataset["validation"].select([i for i in range(2000)])
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 8000
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 2000
    })
})

**Data Preprocessing**



**How to label the dataset?**

Bert needs to know span of tokens corresponding text containing the answer. In question answering system, corresponding to each input token we have 2 outputs.ie, the start and end position. These can be (0,0),(1,0) or (0,1).ie each label can be either 1 or 0. 

We will label as  (1,0) correponding to the starting of answer token among the input tokens to the model. Similarly we have (0,1) corresponding to ending of answer token. We will have (0,0) label correponding to all other tokens.Let me explain it by taking an example.

```
eg: Question ->  which is your favorite place?
Context -> "My favourite place is Empire state building"
Answer -> "Empire state building"
```

After tokenization it may looks as follows (if max_length = 20)

```
[CLS],[which], [is] , [your], [favourite] , [place] , [?] , [SEP], [My] [favourite], [place], [is], [Empire], [state], [building], [SEP],[PAD], [PAD], [PAD] [PAD] [PAD]
```


So the labels of each token will be as follows:

```
[CLS] -> [0,0]
[which] -> [0,0]
[is] -> [0,0]
[your] -> [0,0]
[favourite] -> [0,0]
[place] -> [0,0]
[?] -> [0,0]
[SEP] -> [0,0]
[My] -> [0,0]
[favourite] -> [0,0]
[place] -> [0,0]
[is] -> [0,0]
[Empire] -> [1,0]
[state] -> [0,0]
[building] -> [0,1]
[SEP] -> [0,0]
[PAD] -> [0,0]
[PAD] -> [0,0]
[PAD] -> [0,0]
[PAD] -> [0,0]

```

Finally the labels will be :

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0

The model will predict the start and end logit of each token.
eg:
```
[0.2, 0.4 ,0.3 , 0.5, 0.6, 0.4, 0.1, 0.3, 0.2, 0.4, 0.3, 0.5 ,0.9 , 0.6, 0.7, 0.8, 0.23, 0.31, 0.12]

[0.3, 0.1 ,0.5 , 0.5, 0.3, 0.2, 0.1, 0.13, 0.22, 0.61, 0.23, 0.51 ,0.4 , 0.83, 0.45, 0.12, 0.3, 0.51, 0.22]

```


Note: 

In case of hugging face library, we do not need to provide labels.We just need to start and end position of tokens. For example in above case we have start_position = 12 and end_position = 14. Model will provide start logits and end logits as output annd we can apply arg max to find start position and end position. From above logits we can find argmax values as 12 and 14 for start and end positions respectivly. From these we can  get the answer from context as "Empire state building".
On Prediction we will have probability prediction corresponding to each start and end positions. Thus we can find the start position with highest probability and end position with highest probability.



**How to handle long contexts??**

Here we have relatively smaller context length. But what if its very large. Then during truncation there is a chance that answer might get truncated and context will miss the answer.To solve this problem we create several features of different pieces of context. The only thing we must aware is to add enough overlap between contexts.
This can be done by tokenizer itself.


In [12]:
from transformers import AutoTokenizer

# model_checkpoint = "bert-base-cased"
# tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

trained_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(trained_checkpoint)

context = dataset["train"][0]["context"]
question = dataset["train"][0]["question"]
answer = dataset["train"][0]["answers"]["text"]


inputs = tokenizer(
    question,
    context,
    max_length=160,
    truncation="only_second",  # only to truncate context
    stride=70,  # no of overlapping tokens  between concecute context pieces
    return_overflowing_tokens=True,  #to let tokenizer know we want overflow tokens
)


print(f"The 4 examples gave {len(inputs['input_ids'])} features.")
print(f"Here is where each comes from: {inputs['overflow_to_sample_mapping']}.")

print('Question: ',question)
print(' ')
print('Context : ',context)
print(' ')
print('Answer: ', answer)
print('--'*25)

for i,ids in enumerate(inputs["input_ids"]):
    print('Context piece', i+1)
    print(tokenizer.decode(ids[ids.index(102):]))
    print(' ')
    


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

The 4 examples gave 2 features.
Here is where each comes from: [0, 0].
Question:  To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
 
Context :  Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
 
Answer:  ['Saint Bernadette Soubirous']
--------------------------------------------------
Context piece 1
[SEP] architecturally, the s

We can see that our entire context is divided to 4 overlaping pieces and answer only appears in 3rd and 4th piece. Here we created some contexts without answers, For those question,context pair we have label as start_position = end_position = 0.
We will also set the same labels in unfortunate cases where answer has been truncated either at start or end. For the examples where answer is fully in context the labels will be the index of the token where the answer starts and the index of the token where the answer ends.

```
Note: 
Basically offset_mapping -> refers to start index and end index of each token with respect to whole text.

overflow_to_sample_mapping (overflow-tokens) -> indicates from which base context the sub context came from. 
eg - [0,1,1] indicates first datapoint is from 1st context. 2 and 3rd from second context. 
```

**How to label the dataset if we split contexts of longer length in to smaller contexts?**

We already explained about dividing the context in to pieces. Now we will see how we can label the context after dividing it in to sub contexts. Here we need to label all tokens. 

* We will label all tokens as (0,0) which is not part of answer.


* We will give (0, 0) for all tokens in context if the answer is not in the corresponding span of the context. Also in cases if only answers start index is there but being truncated or answers end index is there.

* We will provide (1,0) for token with start index of answer and (0, 1) for token with end index of answer if both start and end index is present in same context piece.



Actually Hugging face will take care of these. We only need to pass Start index and End index corresponding to each input. If no answer is present in context we need to pass start and end position as 0.

In [13]:
from transformers import AutoTokenizer

del tokenizer
trained_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(trained_checkpoint)

def train_data_preprocess(examples):
    
    """
    generate start and end indexes of answer in context
    """
    
    def find_context_start_end_index(sequence_ids):
        """
        returns the token index in whih context starts and ends
        """
        token_idx = 0
        while sequence_ids[token_idx] != 1:  #means its special tokens or tokens of queston
            token_idx += 1                   # loop only break when context starts in tokens
        context_start_idx = token_idx
    
        while sequence_ids[token_idx] == 1:
            token_idx += 1
        context_end_idx = token_idx - 1
        return context_start_idx,context_end_idx  
    
    
    questions = [q.strip() for q in examples["question"]]
    context = examples["context"]
    answers = examples["answers"]
    
    inputs = tokenizer(
        questions,
        context,
        max_length=512,
        truncation="only_second",
        stride=128,
        return_overflowing_tokens=True,  #returns id of base context
        return_offsets_mapping=True,  # returns (start_index,end_index) of each token
        padding="max_length"
    )


    start_positions = []
    end_positions = []

    
    for i,mapping_idx_pairs in enumerate(inputs['offset_mapping']):
        context_idx = inputs['overflow_to_sample_mapping'][i]
    
        # from main context
        answer = answers[context_idx]
        answer_start_char_idx = answer['answer_start'][0]
        answer_end_char_idx = answer_start_char_idx + len(answer['text'][0])

    
        # now we have to find it in sub contexts
        tokens = inputs['input_ids'][i]
        sequence_ids = inputs.sequence_ids(i)
   
        # finding the context start and end indexes wrt sub context tokens
        context_start_idx,context_end_idx = find_context_start_end_index(sequence_ids)
    
        #if the answer is not fully inside context label it as (0,0)
        # starting and end index of charecter of full context text
        context_start_char_index = mapping_idx_pairs[context_start_idx][0]
        context_end_char_index = mapping_idx_pairs[context_end_idx][1]
    

        #If the answer is not fully inside the context, label is (0, 0)
        if (context_start_char_index > answer_start_char_idx) or (
            context_end_char_index < answer_end_char_idx):
            start_positions.append(0)
            end_positions.append(0)
    
        else:

            # else its start and end token positions
            # here idx indicates index of token
            idx = context_start_idx
            while idx <= context_end_idx and mapping_idx_pairs[idx][0] <= answer_start_char_idx:
                idx += 1
            start_positions.append(idx - 1)  
        

            idx = context_end_idx
            while idx >= context_start_idx and mapping_idx_pairs[idx][1] > answer_end_char_idx:
                idx -= 1
            end_positions.append(idx + 1)
    
    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs
    
train_sample = dataset["train"].select([i for i in range(200)])
    
train_dataset = train_sample.map(
    train_data_preprocess,
    batched=True,
    remove_columns=dataset["train"].column_names
)

len(dataset["train"]),len(train_dataset)

  0%|          | 0/1 [00:00<?, ?ba/s]

(8000, 200)

We can see a increase in number of datapoints after the tokenization method we used.Lets compare the values before and after tokenization.We will print some of the questions,context and answers after tokenization and compare with the original one.

In [14]:
def print_context_and_answer(idx,mini_ds=dataset["train"]):
    
    print(idx)
    print('----')
    question = mini_ds[idx]['question']
    context = mini_ds[idx]['context']
    answer = mini_ds[idx]['answers']['text']
    print('Theoretical values :')
    print(' ')
    print('Question: ')
    print(question)
    print(' ')
    print('Context: ')
    print(context)
    print(' ')
    print('Answer: ')
    print(answer)
    print(' ')
    answer_start_char_idx = mini_ds[idx]['answers']['answer_start'][0]
    answer_end_char_idx = answer_start_char_idx + len(mini_ds[idx]['answers']['text'][0])
    print('Start and end index of text: ',answer_start_char_idx,answer_end_char_idx)
    print('----'*20)
    print('Values after tokenization:')
    

    #answer
    sep_tok_index = train_dataset[idx]['input_ids'].index(102) #get index for [SEP]
    question_ = train_dataset[idx]['input_ids'][:sep_tok_index+1]
    question_decoded = tokenizer.decode(question_) 
    context_ = train_dataset[idx]['input_ids'][sep_tok_index+1:]
    context_decoded = tokenizer.decode(context_) 
    start_idx = train_dataset[idx]['start_positions']
    end_idx = train_dataset[idx]['end_positions']
    answer_toks = train_dataset[idx]['input_ids'][start_idx:end_idx]
    answer_decoded = tokenizer.decode(answer_toks)
    print(' ')
    print('Question: ')
    print(question_decoded)
    print(' ')
    print('Context: ')
    print(context_decoded)
    print(' ')
    print('Answer: ')
    print(answer_decoded)
    print(' ')
    print('Start pos and end pos of tokens: ',train_dataset[idx]['start_positions'],train_dataset[idx]['end_positions'])
    print('____'*20)
    
    
print_context_and_answer(0)
print_context_and_answer(1)
print_context_and_answer(2)
print_context_and_answer(3)

0
----
Theoretical values :
 
Question: 
To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
 
Context: 
Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
 
Answer: 
['Saint Bernadette Soubirous']
 
Start and end index of text:  515 541
--------------------------------------------------------------------------------
Values after tok



<a class="anchor" id="section4"></a>
<h2 style="color:green;font-size: 2em;">Understanding Metrics needed for Evaluation</h2>

**How to evaluate the model?**

Lets take a small eval set.Here we donot need to do much preprocesing. . We will use the pretrained "distilbert-base-uncased" weights which is not fine tuned and lets see how the model performs.

* We will set offset to None for all those questions part of the data.
* We will also append base context id to each sample


We will evalue using our untuned bert-base model and lets see the performance. 

In [15]:
from transformers import AutoTokenizer

def preprocess_validation_examples(examples):
    """
    preprocessing validation data
    """
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=512,
        truncation="only_second",
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")

    base_ids = []

    for i in range(len(inputs["input_ids"])):
        
        # take the base id (ie in cases of overflow happens we get base id)
        base_context_idx = sample_map[i]
        base_ids.append(examples["id"][base_context_idx])
        
        # sequence id indicates the input. 0 for first input and 1 for second input
        # and None for special tokens by default
        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        # for Question tokens provide offset_mapping as None
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["base_id"] = base_ids
    return inputs


# del tokenizer

trained_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(trained_checkpoint)

data_val_sample = dataset["validation"].select([i for i in range(100)])
eval_set = data_val_sample.map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=dataset["validation"].column_names,
)
len(eval_set)

  0%|          | 0/1 [00:00<?, ?ba/s]

100

In [16]:
import torch
from transformers import DistilBertForQuestionAnswering

# del tokenizer
# take a small sample

eval_set_for_model = eval_set.remove_columns(["base_id", "offset_mapping"])
eval_set_for_model.set_format("torch")

checkpoint =  "distilbert-base-uncased"
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

batch = {k: eval_set_for_model[k].to(device) for k in eval_set_for_model.column_names}

model = DistilBertForQuestionAnswering.from_pretrained(checkpoint).to(
    device
)


with torch.no_grad():
    outputs = model(**batch)
    
start_logits = outputs.start_logits.cpu().numpy()
end_logits = outputs.end_logits.cpu().numpy()

start_logits.shape,end_logits.shape

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this mode

((100, 512), (100, 512))

We will evaluate our model using Evaluate library. We use 2 metrics for evaluation.

1. Exact match
2. f1 score

**Exact Match**

For each question-answer pair if the charecters of the models prediction exactly match with charecters of true answer then EM=1 else 0.When assessing against a negative example, if the model predicts any text at all, it automatically receives a 0 for that example

**F1 score**

F1 score depends up on precision and recall.
```
f1 score = 2 * (precision * recall)/ precision + recall

```

If we take the theoritical answers and predicted answers,the number of shared words between theoritical and predicted answer is the basis for f1 score.precision is the ratio of the number of shared words to the total number of words in the prediction, and recall is the ratio of the number of shared words to the total number of words in the ground truth.

In [17]:
!pip install evaluate

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting evaluate
  Downloading evaluate-0.2.2-py3-none-any.whl (69 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.8/69.8 kB[0m [31m782.6 kB/s[0m eta [36m0:00:00[0m
Installing collected packages: evaluate
Successfully installed evaluate-0.2.2
[0m

In [18]:
import numpy as np
import collections
import evaluate

def predict_answers_and_evaluate(start_logits,end_logits,eval_set,examples):
    """
    make predictions 
    Args:
    start_logits : strat_position prediction logits
    end_logits: end_position prediction logits
    eval_set: processed val data
    examples: unprocessed val data with context text
    """
    # appending all id's corresponding to the base context id
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(eval_set):
        example_to_features[feature["base_id"]].append(idx)

    n_best = 20
    max_answer_length = 30
    predicted_answers = []

    for example in examples:
        example_id = example["id"]
        context = example["context"]
        answers = []

        # looping through each sub contexts corresponding to a context and finding
        # answers
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = eval_set["offset_mapping"][feature_index]
        
            # sorting the predictions of all hidden states and taking best n_best prediction
            # means taking the index of top 20 tokens
            start_indexes = np.argsort(start_logit).tolist()[::-1][:n_best]
            end_indexes = np.argsort(end_logit).tolist()[::-1][:n_best]
        
    
            for start_index in start_indexes:
                for end_index in end_indexes:
                
                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length.
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                       ):
                        continue

                    answers.append({
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                        })

    
            # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})
    
    metric = evaluate.load("squad")

    theoretical_answers = [
            {"id": ex["id"], "answers": ex["answers"]} for ex in examples
    ]
    
    metric_ = metric.compute(predictions=predicted_answers, references=theoretical_answers)
    return predicted_answers,metric_



Let us evaluate the model.This metric expects the predicted answers in the format we saw above (a list of dictionaries with one key for the ID of the example and one key for the predicted text) and the theoretical answers in the format.

In [19]:
pred_answers,metrics_ = predict_answers_and_evaluate(start_logits,end_logits,eval_set,data_val_sample)
metrics_

Downloading builder script:   0%|          | 0.00/4.53k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.32k [00:00<?, ?B/s]

{'exact_match': 0.0, 'f1': 3.9449777275864233}

We have very poor score as expected. Now we will fine tune the model and will see the performance in whole validation dataset.


<a class="anchor" id="section5"></a>
<h2 style="color:green;font-size: 2em;">Training a Question Answering System based on Bert</h2>

Lets again load the dataset from fresh. We will sample a small portion of dataset for training. You can train with full data if you have enough resources.

In [20]:
from datasets import load_dataset
dataset = load_dataset("squad")

#lets sample a small dataset
dataset['train'] = dataset['train'].select([i for i in range(5000)])
dataset['validation'] = dataset['validation'].select([i for i in range(500)])

dataset

  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 5000
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 500
    })
})

Let us define a Pyorch dataloader

In [21]:
from torch.utils.data import DataLoader, Dataset


class DataQA(Dataset):
    def __init__(self, dataset,mode="train"):
        self.mode = mode
        
        
        if self.mode == "train":
            # sampling
            self.dataset = dataset["train"]
            self.data = self.dataset.map(train_data_preprocess,
                                                      batched=True,
                            remove_columns= dataset["train"].column_names)
        
        else:
            self.dataset = dataset["validation"]
            self.data = self.dataset.map(preprocess_validation_examples,
            batched=True,remove_columns = dataset["validation"].column_names,
               )
            
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):

        out = {}
        example = self.data[idx]
        out['input_ids'] = torch.tensor(example['input_ids'])
        out['attention_mask'] = torch.tensor(example['attention_mask'])

        
        if self.mode == "train":

            out['start_positions'] = torch.unsqueeze(torch.tensor(example['start_positions']),dim=0)
            out['end_positions'] = torch.unsqueeze(torch.tensor(example['end_positions']),dim=0)
            
        return out
        

In [22]:
trained_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(trained_checkpoint)


train_dataset = DataQA(dataset,mode="train")
val_dataset = DataQA(dataset,mode="validation")



for i,d in enumerate(train_dataset):
    for k in d.keys():
        print(k + ' : ', d[k].shape)
    print('--'*40)

    if i == 3:
        break
        
print('__'*50)

for i,d in enumerate(val_dataset):
    for k in d.keys():
        print(k + ' : ', len(d[k]))
    print('--'*40)
    
    if i == 3:
        break

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

input_ids :  torch.Size([512])
attention_mask :  torch.Size([512])
start_positions :  torch.Size([1])
end_positions :  torch.Size([1])
--------------------------------------------------------------------------------
input_ids :  torch.Size([512])
attention_mask :  torch.Size([512])
start_positions :  torch.Size([1])
end_positions :  torch.Size([1])
--------------------------------------------------------------------------------
input_ids :  torch.Size([512])
attention_mask :  torch.Size([512])
start_positions :  torch.Size([1])
end_positions :  torch.Size([1])
--------------------------------------------------------------------------------
input_ids :  torch.Size([512])
attention_mask :  torch.Size([512])
start_positions :  torch.Size([1])
end_positions :  torch.Size([1])
--------------------------------------------------------------------------------
____________________________________________________________________________________________________
input_ids :  512
attention_mask :  

Let us load the data in batches

In [23]:
from transformers import default_data_collator
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    train_dataset,
    shuffle=True,
    collate_fn=default_data_collator,
    batch_size=2,
)
eval_dataloader = DataLoader(
    val_dataset, collate_fn=default_data_collator, batch_size=2
)




for batch in train_dataloader:
   print(batch['input_ids'].shape)
   print(batch['attention_mask'].shape)
   print(batch['start_positions'].shape)
   print(batch['end_positions'].shape)
   break

print('---'*20)

for batch in eval_dataloader:
   print(batch['input_ids'].shape)
   print(batch['attention_mask'].shape)
   break

torch.Size([2, 512])
torch.Size([2, 512])
torch.Size([2, 1])
torch.Size([2, 1])
------------------------------------------------------------
torch.Size([2, 512])
torch.Size([2, 512])


**Define Model**

In [24]:
from transformers import DistilBertForQuestionAnswering
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Available device: {device}')

checkpoint =  "distilbert-base-uncased"
model = DistilBertForQuestionAnswering.from_pretrained(checkpoint)
model = model.to(device)

Available device: cuda


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this mode

**Model Training**

In [25]:
from transformers import AdamW
from tqdm.notebook import tqdm
import datetime
import numpy as np
import collections
import evaluate

optimizer = AdamW(model.parameters(), lr=2e-5)

epochs = 2

# Total number of training steps is [number of batches] x [number of epochs]. 
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * epochs
print(total_steps)


def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))





5010


We need processed validation data at the time of evaluation to get offsets for each context

In [26]:
# we need processed validation data to get offsets at the time of evaluation
validation_processed_dataset = dataset["validation"].map(preprocess_validation_examples,
            batched=True,remove_columns = dataset["validation"].column_names,
               )

  0%|          | 0/1 [00:00<?, ?ba/s]

In [27]:
import random,time
import numpy as np

# to reproduce results
seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)


#storing all training and validation stats
stats = []


#to measure total training time
total_train_time_start = time.time()

for epoch in range(epochs):
    print(' ')
    print(f'=====Epoch {epoch + 1}=====')
    print('Training....')
     
    # ===============================
    #    Train
    # ===============================   
    # measure how long training epoch takes
    t0 = time.time()
     
    training_loss = 0
    # loop through train data
    model.train()
    for step,batch in enumerate(train_dataloader):
         
        # we will print train time in every 40 epochs
        if step%40 == 0 and not step == 0:
              elapsed_time = format_time(time.time() - t0)
              # Report progress.
              print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed_time))

         
       
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        start_positions = batch['start_positions'].to(device)
        end_positions = batch['end_positions'].to(device)
            


        #set gradients to zero
        model.zero_grad()

        result = model(input_ids = input_ids, 
                        attention_mask = attention_mask,
                        start_positions = start_positions,
                        end_positions = end_positions,
                        return_dict=True)
         
        loss = result.loss
    
        #accumulate the loss over batches so that we can calculate avg loss at the end
        training_loss += loss.item()      

        #perform backward prorpogation
        loss.backward()

        # update the gradients
        optimizer.step()

    # calculate avg loss
    avg_train_loss = training_loss/len(train_dataloader) 
 
    # calculates training time
    training_time = format_time(time.time() - t0)
     
    
    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epoch took: {:}".format(training_time))
    
    
    # ===============================
    #    Validation
    # ===============================
     
    print("")
    print("Running Validation...")

    t0 = time.time()

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()
     

    start_logits,end_logits = [],[]
    for step,batch in enumerate(eval_dataloader):
         
       
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)

         
        # Tell pytorch not to bother with constructing the compute graph during
        # the forward pass, since this is only needed for backprop (training).
        with torch.no_grad():  
             result = model(input_ids = input_ids, 
                        attention_mask = attention_mask,return_dict=True)
        


        start_logits.append(result.start_logits.cpu().numpy())
        end_logits.append(result.end_logits.cpu().numpy())
   

    start_logits = np.concatenate(start_logits)
    end_logits = np.concatenate(end_logits)
    # start_logits = start_logits[: len(val_dataset)]
    # end_logits = end_logits[: len(val_dataset)]




    # calculating metrics
    answers,metrics_ = predict_answers_and_evaluate(start_logits,end_logits,validation_processed_dataset,dataset["validation"])
    print(f'Exact match: {metrics_["exact_match"]}, F1 score: {metrics_["f1"]}')


    print('')
    # Measure how long the validation run took.
    validation_time = format_time(time.time() - t0)

    print("  Validation took: {:}".format(validation_time))

print("")
print("Training complete!")

print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_train_time_start)))


 
=====Epoch 1=====
Training....
  Batch    40  of  2,505.    Elapsed: 0:00:03.
  Batch    80  of  2,505.    Elapsed: 0:00:06.
  Batch   120  of  2,505.    Elapsed: 0:00:08.
  Batch   160  of  2,505.    Elapsed: 0:00:11.
  Batch   200  of  2,505.    Elapsed: 0:00:14.
  Batch   240  of  2,505.    Elapsed: 0:00:17.
  Batch   280  of  2,505.    Elapsed: 0:00:20.
  Batch   320  of  2,505.    Elapsed: 0:00:22.
  Batch   360  of  2,505.    Elapsed: 0:00:25.
  Batch   400  of  2,505.    Elapsed: 0:00:28.
  Batch   440  of  2,505.    Elapsed: 0:00:31.
  Batch   480  of  2,505.    Elapsed: 0:00:33.
  Batch   520  of  2,505.    Elapsed: 0:00:36.
  Batch   560  of  2,505.    Elapsed: 0:00:39.
  Batch   600  of  2,505.    Elapsed: 0:00:42.
  Batch   640  of  2,505.    Elapsed: 0:00:44.
  Batch   680  of  2,505.    Elapsed: 0:00:47.
  Batch   720  of  2,505.    Elapsed: 0:00:50.
  Batch   760  of  2,505.    Elapsed: 0:00:53.
  Batch   800  of  2,505.    Elapsed: 0:00:56.
  Batch   840  of  2,505.  

**Note: You can train for more epochs with full data which will provide us a better result**