<a href="https://www.kaggle.com/code/susantaghosh/fine-tuning-bert-for-extractive-qa?scriptVersionId=96924287" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<a href="https://colab.research.google.com/github/susantaghosh1/nlp-notebooks/blob/develop/Fine_Tuning_Extractive_QA_with_BERT_and_Friends.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine Tuning BERT/RoBERTa/DeBERTa/ALBERT/DistillBERT for extractive QA on Squad dataset

In this section we will fine-tune Extractive QA on Squad dataset. Encoder-only models like BERT tend to be great at extracting answers to factoid questions like “Who invented the Transformer architecture?” but fare poorly when given open-ended questions like “Why is the sky blue?” In these more challenging cases, encoder-decoder models like T5 and BART are typically used to synthesize the information in a way that’s quite similar to text summarization.

All of those work in the same way: they add a linear layer on top of the base model, which is used to produce a tensor of shape (batch_size,sequence_length,2), indicating the unnormalized scores **[LOGITS]** for start position and end position of the answers for every example in the batch.

Let's discuss little bit internal working of the model :

1. Question and Context [tokenized version] will be passed together as a pair to the model **[ let's say shape of input to the model is (5,30) where 5 is batch_size and 30 is sequence length [number of tokens in each input]**
2. Vanilla BERT [OR it's friends] will produce contextualized embeddings for each and every word in the sequence. Shape of output from BERT is **(5,30,768) where 5 is the batch size, 30 is the sequece length and 768 is the embedding dimension of the each token**
3. Now a linear head will be added on top of each of the tokens and each liner layer will take 768 dim as input and outputs 2 tensors , which we call start_logits and end_logits. Now, shape of output is **(5,30,2)**
4. Now we will split the start_logits and end_logits where shape of each logits are **(5,30,1)**
5. Now we will remove the single dimesion from the last dimension of start and end logits or in other words we will squeeze the start and end logits across the last dimesion and now shape of start and end logits will be **(5,30)**

**start_logits = tensor of shape (5,30)**
**end_logits = tensor of shape (5,30)**

6. Model will take start_positions and end_positions of the answer in the tokenized data as labels

start_positions (`torch.LongTensor` of shape `(batch_size,)`):
            Labels for position (index) of the start of the labelled span for computing the token classification loss.Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence are not taken into account for computing the loss.

end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for position (index) of the end of the labelled span for computing the token classification loss.Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence are not taken into account for computing the loss.

**start_positions = tensor of shape (5,)**
**end_positions = tensor of shape (5,)**

7. Now Cross Entropy loss will be computed between **start_logits and start_positions** and **end_logits and end_positions**.

8. Total loss will be the average loss of **start_logits and start_positions** and end_logits and end_positions** and it will be backpropagated to the model for calculationg the gradients and optimizing the weights

Pseudo code for QA Model with BERT

class PseudoQA(nn.Module):

  def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels

        self.bert = BertModel(config, add_pooling_layer=False)
        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)

        # Initialize weights and apply final processing
        self.post_init()
  
   def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        token_type_ids: Optional[torch.Tensor] = None,
        start_positions: Optional[torch.Tensor] = None,
        end_positions: Optional[torch.Tensor] = None,
    ) :
        
        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        sequence_output = outputs[0] ## ** last hidden state output of bert**

        # ** shape of sequence_output : (batch_size,sequence_length,768) **

        logits = self.qa_outputs(sequence_output)
        # ** shape of logits : (batch_size,sequence_length,2) **
        start_logits, end_logits = logits.split(1, dim=-1)
        # ** shape of start_logits and end_logits : (batch_size,sequence_length,1) **
        start_logits = start_logits.squeeze(-1).contiguous() # ** shape : (batch_size,sequence_length) **
        end_logits = end_logits.squeeze(-1).contiguous() # ** shape : (batch_size,sequence_length) **

        total_loss = None
        if start_positions is not None and end_positions is not None:
            # If we are on multi-GPU, split add a dimension
            if len(start_positions.size()) > 1:
                start_positions = start_positions.squeeze(-1)
            if len(end_positions.size()) > 1:
                end_positions = end_positions.squeeze(-1)
            # sometimes the start/end positions are outside our model inputs, 
            # we ignore these terms
            ignored_index = start_logits.size(1)
            start_positions = start_positions.clamp(0, ignored_index)
            end_positions = end_positions.clamp(0, ignored_index)

            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
            start_loss = loss_fct(start_logits, start_positions)
            end_loss = loss_fct(end_logits, end_positions)
            total_loss = (start_loss + end_loss) / 2
  



Enough of theory!!!! Let's dirty our hands

In [3]:
%%capture
!pip install datasets transformers[sentencepiece]
!pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
!pip install scipy sklearn

In [4]:
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device

device(type='cpu')

In [3]:
!nvidia-smi

/bin/bash: nvidia-smi: command not found


In [6]:
# load the dataset

from datasets import load_dataset

raw_datasets = load_dataset("squad")

  0%|          | 0/2 [00:00<?, ?it/s]

In [7]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [None]:
print("Context: ", raw_datasets["train"][0]["context"])
print("Question: ", raw_datasets["train"][0]["question"])
print("Answer: ", raw_datasets["train"][0]["answers"])

In [None]:
print(raw_datasets["train"][0]["answers"].keys())
print(type(raw_datasets["train"][0]["answers"]['text']))
print(raw_datasets["train"][0]["answers"]['text'][0])

In [None]:
answer = raw_datasets["train"][0]["answers"]['text'][0]
answer_start = raw_datasets["train"][0]["answers"]['answer_start'][0]
answer_end = answer_start + len(answer)
answer_from_context = raw_datasets["train"][0]["context"] [answer_start:answer_end]


In [None]:
answer_from_context

During training, there is only one possible answer. We can double-check this by using the Dataset.filter() method:

In [None]:
raw_datasets["train"].filter(lambda x: len(x["answers"]["text"]) != 1)

For evaluation, however, there are several possible answers for each sample, which may be the same or different:

In [None]:
print(raw_datasets["validation"][0]["answers"])
print(raw_datasets["validation"][2]["answers"])

# PreProcessing the training data

In [None]:
from transformers import AutoTokenizer

model_checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [None]:
tokenizer.is_fast,tokenizer.special_tokens_map

We can pass to our tokenizer the question and the context together, and it will properly insert the special tokens to form a sentence like this:

Copied
[CLS] question [SEP] context [SEP]

a predicted answer to all the acceptable answers and take the best score. 

In [None]:
context = raw_datasets["train"][0]["context"]
question = raw_datasets["train"][0]["question"]

inputs = tokenizer(question, context,return_offsets_mapping=True)


In [None]:
len(inputs['input_ids']),len(inputs['offset_mapping']),inputs

In [None]:
tokenizer.decode(inputs["input_ids"])


In [None]:
tokenizer.convert_ids_to_tokens(inputs["input_ids"])

In this case the context is not too long, but some of the examples in the dataset have very long contexts that will exceed the maximum length we set (which is 384 in this case).  we will deal with long contexts by creating several training features from one sample of our dataset, with a sliding window between them.

To see how this works using the current example, we can limit the length to 100 and use a sliding window of 50 tokens. As a reminder, we use:

max_length to set the maximum length (here 100)
truncation="only_second" to truncate the context (which is in the second position) when the question with its context is too long
stride to set the number of overlapping tokens between two successive chunks (here 50)
return_overflowing_tokens=True to let the tokenizer know we want the overflowing tokens

return_offsets_mapping=True to get the positions of the tokens with respect to the input of the tokenizer [for sequence_id =0, position of question otherwise positions of context]

In [None]:
batch_encoding = tokenizer(question,context,max_length=100,truncation="only_second",stride=50,
                           return_overflowing_tokens=True,return_offsets_mapping=True)

In [None]:
batch_encoding.keys(),len(batch_encoding['input_ids'])

In [None]:
batch_encoding

In [None]:
batch_encoding['overflow_to_sample_mapping'] # one long context has been truncated to 4 samples

In [None]:
sequence_ids = batch_encoding.sequence_ids(0)
sliced_text = ""
for idx,tokens,positions in zip(range(len(batch_encoding['input_ids'][0])),batch_encoding['input_ids'][0],batch_encoding['offset_mapping'][0]):
  if sequence_ids[idx]==0:
    sliced_text = question[positions[0]:positions[1]]
  elif sequence_ids[idx]==1:
    sliced_text = context[positions[0]:positions[1]]
  print(f"tokens :: {tokens} and decoed token :: {tokenizer.convert_ids_to_tokens(tokens)} and positions :: {positions} and sliced  text :: {sliced_text}")  ## positions for special tokens will be (0,0)

In [None]:
# let's try to encode few more samples together

sample_question =  raw_datasets["train"][2:6]["question"] # list of size 4
sample_context =  raw_datasets["train"][2:6]["context"] # list of size 4
sample_answers = raw_datasets["train"][2:6]["answers"]
sample_question,sample_question[0],sample_context[0],sample_answers[0]

In [None]:
sample_encoding = tokenizer(sample_question,sample_context,max_length=100,truncation="only_second",stride=50,
                           return_overflowing_tokens=True,return_offsets_mapping=True)
sample_encoding,sample_encoding.keys(),len(sample_encoding['offset_mapping'][0])

In [None]:
for k,v in sample_encoding.items():
  print(f"shape of {k} :: {len(v)}")  # 4 inputs  results in 19 samples

input_ids ,token_type_ids,attention_mask,offset_mapping : each of them will be list of lists and overflow_to_sample_mapping will be simple list

let's make the labels. labels will be start_positions and end_positions where each of them will be of shape (batch_size)

(0, 0) if the answer is not in the corresponding span of the context
(start_position, end_position) if the answer is in the corresponding span of the context, with start_position being the index of the token (in the input IDs) at the start of the answer and end_position being the index of the token (in the input IDs) where the answer ends

In [None]:
sample_answers = raw_datasets["train"][2:6]["answers"]
sample_answers,sample_answers[0]

In [None]:
sample_encoding['overflow_to_sample_mapping']


In [None]:
# find the original sample
# find answers start and end char positions of that original sample
# Find the start and end of the context
# If the answer is not fully inside the context, label is (0, 0)
# Otherwise it's the start and end token positions
sample_mappings = sample_encoding['overflow_to_sample_mapping']
start_positions = []
end_positions = []
for i,offset in enumerate(sample_encoding['offset_mapping']):
  original_sample_id = sample_mappings[i] #find the original sample
  answer = sample_answers[original_sample_id]
  answer_start = answer['answer_start'][0]
  answer_end = answer_start+len(answer['text'][0])
  sequence_id = sample_encoding.sequence_ids(i)
  idx = 0
  while sequence_id[idx]!=1:
    idx +=1
  context_start = idx
  while sequence_id[idx]==1:
    idx +=1
  context_end = idx-1
  if offset[context_start][0]>answer_start or offset[context_end][1]<answer_end:
    start_positions.append(0)
    end_positions.append(0)
  else:
    idx = context_start
    while idx <= context_end and offset[idx][0] <= answer_start:
      idx +=1
    start_positions.append(idx-1)
    idx = context_end
    while idx >= context_start and offset[idx][1] >= answer_end:
      idx -= 1
    end_positions.append(idx+1)
start_positions, end_positions




Let’s take a look at a few results to verify that our approach is correct. For the first feature we find (83, 85) as labels, so let’s compare the theoretical answer with the decoded span of tokens from 83 to 85 (inclusive):

In [None]:
idx = 0
sample_idx = sample_encoding["overflow_to_sample_mapping"][idx]
answer = sample_answers[sample_idx]["text"][0]

start = start_positions[idx]
end = end_positions[idx]
labeled_answer = tokenizer.decode(sample_encoding["input_ids"][idx][start : end + 1])

print(f"Theoretical answer: {answer}, labels give: {labeled_answer}")

In [None]:
idx = 4
sample_idx = sample_encoding["overflow_to_sample_mapping"][idx]
answer = sample_answers[sample_idx]["text"][0]

decoded_example = tokenizer.decode(sample_encoding["input_ids"][idx])
print(f"Theoretical answer: {answer}, decoded example: {decoded_example}") #we don’t see the answer inside the context.

In [None]:
max_length = 384
stride = 128


def preprocess_training_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [None]:
tokenized_dataset = raw_datasets.map(
    preprocess_training_examples,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

In [None]:
tokenized_dataset

# Fine Tuning the **model**

In [None]:
from transformers import AutoModelForQuestionAnswering
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

In [54]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [55]:
!apt install git-lfs

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  git-lfs
0 upgraded, 1 newly installed, 0 to remove and 16 not upgraded.
Need to get 3316 kB of archives.
After this operation, 11.1 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/universe amd64 git-lfs amd64 2.9.2-1 [3316 kB]
Fetched 3316 kB in 3s (1277 kB/s)  [0m33m[33m[33m[33m

7[0;23r8[1ASelecting previously unselected package git-lfs.
(Reading database ... 105611 files and directories currently installed.)
Preparing to unpack .../git-lfs_2.9.2-1_amd64.deb ...
7[24;0f[42m[30mProgress: [  0%][49m[39m [..........

In [None]:
from transformers import TrainingArguments

args = TrainingArguments(
    "bert-finetuned-squad",
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    fp16=True,
    push_to_hub=True,
)

In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['validation'],
    tokenizer=tokenizer,
)
trainer.train()

# Evaluating the model

In huggingface QA pipeline or TransformerReader in haystack inference occurs in below steps

1. Model will output **start_logit and end logit** for each tokens in the batch
2. We will mask logits of question as well as padding tokens
3. Convert the logits into probabilities by taking softmax
4. calculate score of each **(start_logit,end_logit)** pairs by taking product [matrix multiplication] of the two probabilites
5. look for the pair with the maximum score that yielded a valid answer (e.g., a start_token lower than end_token).

To speed up the evalutation step we will change above steps a little bit

1. We will exclude the softmax step [ logit score will be sufficient]
2. Instead of calculating core of each (start_logit,end_logit) pairs, we will sort the start and end logits and select **n_best** logits where n_best will be a user defined parameter like 5,20 etc.
3. Since we will skip the softmax, those scores will be logit scores, and will be obtained by taking the sum of the start and end logits (instead of the product, because of the rule **log(ab) = log(a) + log(b).**

Let's make small batch of 100 documents from validation set and evaluate our model

In [8]:
batch = raw_datasets["validation"].shuffle().select(range(16))
batch

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 16
})

In [9]:
for each_document in batch:
  break
each_document

{'id': '570d529fb3d812140066d6bc',
 'title': 'Victoria_(Australia)',
 'context': 'Major events also play a big part in tourism in Victoria, particularly cultural tourism and sports tourism. Most of these events are centred on Melbourne, but others occur in regional cities, such as the V8 Supercars and Australian Motorcycle Grand Prix at Phillip Island, the Grand Annual Steeplechase at Warrnambool and the Australian International Airshow at Geelong and numerous local festivals such as the popular Port Fairy Folk Festival, Queenscliff Music Festival, Bells Beach SurfClassic and the Bright Autumn Festival.',
 'question': 'Besides cultural events, what other tourist attraction does Victoria have?',
 'answers': {'text': ['sports', 'sports tourism', 'sports'],
  'answer_start': [92, 92, 92]}}

In [29]:
trained_model_checkpoint = 'susghosh/bert-finetuned-squad'
from transformers import AutoTokenizer,AutoModelForQuestionAnswering
tokenizer = AutoTokenizer.from_pretrained(trained_model_checkpoint)
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model = AutoModelForQuestionAnswering.from_pretrained(trained_model_checkpoint).to(device)

In [11]:
max_length = 384
stride = 128
def pre_process_small_batch(example):
   inputs = tokenizer(
        example['question'],
        example["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )
   return inputs
batch_encoding = batch.map(pre_process_small_batch,batched=True,remove_columns=raw_datasets["validation"].column_names,)

  0%|          | 0/1 [00:00<?, ?ba/s]

In [16]:
batch_encoding, len(batch_encoding) ## 100 documents have been splitted among 101 documents

(Dataset({
     features: ['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping'],
     num_rows: 16
 }),
 16)

In [13]:
batch_encoding['offset_mapping'][5]

[[0, 0],
 [0, 3],
 [4, 7],
 [8, 11],
 [12, 16],
 [17, 21],
 [22, 24],
 [25, 29],
 [30, 33],
 [34, 35],
 [36, 43],
 [44, 46],
 [47, 55],
 [55, 56],
 [0, 0],
 [0, 3],
 [4, 11],
 [12, 19],
 [20, 29],
 [30, 32],
 [33, 36],
 [37, 41],
 [42, 45],
 [46, 47],
 [47, 51],
 [52, 57],
 [58, 65],
 [65, 66],
 [67, 71],
 [72, 75],
 [76, 80],
 [81, 90],
 [91, 95],
 [96, 99],
 [100, 103],
 [104, 111],
 [111, 112],
 [113, 116],
 [117, 121],
 [122, 129],
 [130, 134],
 [135, 140],
 [141, 143],
 [144, 147],
 [148, 152],
 [153, 165],
 [166, 176],
 [176, 177],
 [178, 181],
 [182, 191],
 [192, 201],
 [202, 213],
 [214, 224],
 [225, 231],
 [232, 240],
 [240, 241],
 [242, 247],
 [248, 251],
 [252, 259],
 [259, 260],
 [261, 264],
 [265, 273],
 [274, 276],
 [277, 284],
 [285, 293],
 [294, 300],
 [301, 303],
 [304, 309],
 [310, 315],
 [316, 318],
 [319, 322],
 [323, 329],
 [329, 330],
 [331, 338],
 [339, 349],
 [350, 354],
 [355, 362],
 [363, 368],
 [369, 377],
 [378, 387],
 [388, 390],
 [391, 394],
 [395, 402],
 

In [17]:
batch_offset_mapping = batch_encoding['offset_mapping']
batch_sample_mapping = batch_encoding['overflow_to_sample_mapping']
batch_encoding = batch_encoding.remove_columns(['offset_mapping','overflow_to_sample_mapping'])
batch_encoding.set_format('torch')

In [18]:
batch_encoding,batch_encoding.column_names

(Dataset({
     features: ['input_ids', 'token_type_ids', 'attention_mask'],
     num_rows: 16
 }),
 ['input_ids', 'token_type_ids', 'attention_mask'])

In [19]:
input_for_model ={k : batch_encoding[k].to(device) for k in batch_encoding.column_names}

In [20]:
with torch.no_grad():
 output = model(**input_for_model)

In [21]:
output.start_logits.shape,output.end_logits.shape, ## each of the logits should be (batch_size,384)

(torch.Size([16, 384]), torch.Size([16, 384]))

In [15]:
# let's grab the first logit
# select n_best logits
#for each start_logits score (each_start_logits,each_end_logits)
#sort the score
n_best_size = 20
import numpy as np
idx =0 
start_logits = output.start_logits[idx].cpu().numpy()
end_logits = output.end_logits[idx].cpu().numpy()
start_indices = np.argsort(start_logits)[-1:-n_best_size-1:-1].tolist()
end_indices = np.argsort(end_logits)[-1:-n_best_size-1:-1].tolist()
answers = []
for each_start_index in start_indices:
    for each_end_index in end_indices:
        if(each_start_index<=each_end_index):
            logit_score = start_logits[each_start_index]+end_logits[each_end_index]
            context = batch['context'][batch_sample_mapping[idx]]
            answer_start,_ = batch_offset_mapping[idx][each_start_index]
            _,answer_end = batch_offset_mapping[idx][each_end_index]
            answer = context[answer_start:answer_end]
            answers.append({'logit_score':logit_score,'answer':answer})
best_answer = max(answers,key = lambda x: x['logit_score'])

In [16]:
best_answer

{'logit_score': -1.6018867, 'answer': 'Diffie–Hellman key exchange'}

In [17]:
# let's extract all the answers in the batch of 8 documents
n_best_size = 20
import numpy as np
start_logits = output.start_logits.cpu().numpy()
end_logits = output.end_logits.cpu().numpy()
print(start_logits.shape[0])
predicted_ansers = []
for idx in range(start_logits.shape[0]):
    start_indices = np.argsort(start_logits[idx])[-1:-n_best_size-1:-1].tolist()
    end_indices = np.argsort(end_logits[idx])[-1:-n_best_size-1:-1].tolist()
    answers = []
    for each_start_index in start_indices:
        for each_end_index in end_indices:
            if(each_start_index<=each_end_index):
                logit_score = start_logits[idx][each_start_index]+end_logits[idx][each_end_index]
                context = batch['context'][batch_sample_mapping[idx]]
                id = batch['id'][batch_sample_mapping[idx]]
                answer_start,_ = batch_offset_mapping[idx][each_start_index]
                _,answer_end = batch_offset_mapping[idx][each_end_index]
                answer = context[answer_start:answer_end]
                answers.append({'logit_score':logit_score,'answer':answer,'id':id})
    best_answer = max(answers,key = lambda x: x['logit_score'])
    predicted_ansers.append({'prediction_text':best_answer['answer'],'id':best_answer['id']})

16


In [18]:
predicted_ansers

[{'prediction_text': 'Diffie–Hellman key exchange',
  'id': '572996c73f37b319004784b3'},
 {'prediction_text': 'toward the Atlantic', 'id': '5725c071271a42140099d128'},
 {'prediction_text': 'Thomas Edison', 'id': '56e0d54a7aa994140058e76c'},
 {'prediction_text': 'over the age of 18', 'id': '572fdb17b2c2fd140056851f'},
 {'prediction_text': 'the convection of the mantle',
  'id': '57265d08708984140094c39a'},
 {'prediction_text': 'deep-level', 'id': '57268a8fdd62a815002e88d0'},
 {'prediction_text': 'Émile Girardeau', 'id': '56e108abe3433e1400422b0f'},
 {'prediction_text': 'Asia', 'id': '5726d4a45951b619008f7f6a'},
 {'prediction_text': 'Cape of Good Hope', 'id': '571077ecb654c5140001f909'},
 {'prediction_text': 'Galileo Ferraris', 'id': '56e0dbb57aa994140058e77b'},
 {'prediction_text': 'mercuric oxide', 'id': '571a4d1a4faf5e1900b8a95a'},
 {'prediction_text': "IPCC does not carry out its own research, it operates on the basis of scientific papers and independently documented results from oth

In [19]:
for gold_answer,predicted_answer in zip(batch['answers'],predicted_ansers):
    print(f"gold answer :: {gold_answer['text']} and predicted answer :::: {predicted_answer['prediction_text']}")

gold answer :: ['RSA', 'RSA', 'RSA', 'RSA'] and predicted answer :::: Diffie–Hellman key exchange
gold answer :: ['Water on the eastern side flowed toward the Atlantic,', 'toward the Atlantic', 'toward the Atlantic'] and predicted answer :::: toward the Atlantic
gold answer :: ['Thomas Edison', 'Thomas Edison', 'Thomas Edison'] and predicted answer :::: Thomas Edison
gold answer :: ['over the age of 18', 'over the age of 18', '18'] and predicted answer :::: over the age of 18
gold answer :: ['the convecting mantle', 'convection of the mantle', 'convection of the mantle', 'the convecting mantle'] and predicted answer :::: the convection of the mantle
gold answer :: ['deep-level', 'deep-level', 'deep-level tunnels'] and predicted answer :::: deep-level
gold answer :: ['Émile Girardeau', 'Émile Girardeau', 'Émile Girardeau,'] and predicted answer :::: Émile Girardeau
gold answer :: ['Asia', 'Asia', 'Asia'] and predicted answer :::: Asia
gold answer :: ['at the Cape of Good Hope', 'Cape of

For computing the metrics , we will load squad metrics and to compute the metrics 

In [22]:
from datasets import load_metric

metric = load_metric("squad")

Downloading builder script:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

In [21]:
theoretical_answers = [
    {"id": ex["id"], "answers": ex["answers"]} for ex in batch
]

In [22]:
metric.compute(predictions=predicted_ansers, references=theoretical_answers)

{'exact_match': 81.25, 'f1': 86.58719931271477}

In [7]:
validation_dataset = raw_datasets["validation"]
validation_dataset

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 10570
})

In [11]:
encoded_validation_data = validation_dataset.map(pre_process_small_batch,
                                                 batched=True,
                                                 remove_columns=raw_datasets["validation"].column_names,)

NameError: name 'pre_process_small_batch' is not defined

In [None]:
len(encoded_validation_data),len(encoded_validation_data['input_ids'])

In [None]:
eval_offset_mapping = encoded_validation_data['offset_mapping']
eval_sample_mapping = encoded_validation_data['overflow_to_sample_mapping']
encoded_validation_data = encoded_validation_data.remove_columns(['offset_mapping','overflow_to_sample_mapping'])
encoded_validation_data.set_format('torch')

In [None]:
encoded_validation_data

In [None]:
eval_for_model ={k : encoded_validation_data[k].to(device) for k in encoded_validation_data.column_names}

In [None]:
with torch.no_grad():
  eval_output = model(**eval_for_model)

In [23]:
max_length = 384
stride = 128
# make the offset_mapping of questions to None
# return id of the sample in example_id column
def pre_process_eval_data(example):
    questions = [q.strip() for q in example["question"]]
    inputs = tokenizer(
        questions,
        example["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )
    sample_mapping = inputs.pop('overflow_to_sample_mapping')
    example_ids = []
    for idx in range(len(inputs['input_ids'])):
        sequence_ids = inputs.sequence_ids(idx)
        original_sample = sample_mapping[idx]
        sample_id = example['id'][original_sample]
        example_ids.append(sample_id)
        offsets = inputs['offset_mapping'][idx]
        inputs['offset_mapping'][idx] = [o if sequence_ids[k]==1 else None for k,o in 
                                       enumerate(offsets)]
    inputs['example_id'] = example_ids
    return inputs
    

In [31]:
batch = raw_datasets["validation"].shuffle().select(range(16))

In [34]:
encoding = batch.map(pre_process_eval_data,batched=True,remove_columns=raw_datasets["validation"].column_names,)
encoding

  0%|          | 0/1 [00:00<?, ?ba/s]

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'example_id'],
    num_rows: 16
})

In [35]:
import torch

eval_set_for_model = encoding.remove_columns(["example_id", "offset_mapping"])
eval_set_for_model.set_format("torch")

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
batch_model = {k: eval_set_for_model[k].to(device) for k in eval_set_for_model.column_names}

with torch.no_grad():
    outputs = model(**batch_model)

In [36]:
outputs.start_logits.shape,outputs.end_logits.shape,

(torch.Size([16, 384]), torch.Size([16, 384]))

In [18]:
tokenized_validation_dataset = validation_dataset.map(pre_process_eval_data,batched=True,
                                                     remove_columns=validation_dataset.column_names)

  0%|          | 0/11 [00:00<?, ?ba/s]

In [19]:
len(tokenized_validation_dataset),len(validation_dataset) # 10570 features have been splitted into 10784 features

(10784, 10570)

In [2]:
tokenized_validation_dataset['example_id'][:5]

NameError: name 'tokenized_validation_dataset' is not defined

Once we feed our tokeinized data to model, we get start and end logits of shape (no_of_features,sequence_length). Due to long contexts, one original context may have been broken into several features and here what we will do :

1. We will find out which list of features assosiated with that original example
2. We will iterate through all those features linked with that example and produce one best answer

In [51]:

max_answer_length = 30
n_best = 20
import collections
from tqdm.auto import tqdm
import numpy as np
def compute_metrics(start_logits,end_logits,features,examples):
    example_to_features = collections.defaultdict(list)
    for idx,feature in enumerate(features):
        example_to_features[feature['example_id']].append(idx)
    predicted_answers =[]
    for example in tqdm(examples):
        example_id = example['id']
        context = example['context']
        answers = []
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features['offset_mapping'][feature_index]
            start_indexes = np.argsort(start_logit)[-1 : -20 - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -20 - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue
                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)
        # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})
    theoretical_answers = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]
    return metric.compute(predictions=predicted_answers, references=theoretical_answers)
    

In [52]:
compute_metrics(outputs.start_logits.cpu().numpy(), outputs.end_logits.cpu().numpy(), encoding, batch)

  0%|          | 0/16 [00:00<?, ?it/s]

{'exact_match': 93.75, 'f1': 96.06481481481481}

In [56]:
from huggingface_hub import Repository, get_full_repo_name

model_name = "bert-finetuned-squad-tpu"
repo_name = get_full_repo_name(model_name)
repo_name

'susghosh/bert-finetuned-squad-tpu'