# Artificial Intelligence II - Homework 4
# Fine-tuning BERT for question answering with SQuAD 2.0 

**Notes:** 
1. Some changes in paths will be needed.
2. I run the fine-tuning first, saved the model and did the evaluation on another run because the tokenizer used a lot of ram and crashed. 


In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
# PATH = '/content/drive/MyDrive/Colab Notebooks/Artificial Intelligence II/bert/'

# Import Libraries and Read Datasets

Import libraries that will be used in this notebook, define a seeding function and set device to cuda if available.


In [None]:
# Imports
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import os

import numpy as np
from numpy import unravel_index
import pandas as pd
import math

import random
import sys
from IPython.display import Image
import time

# for text preprocessing
import re
import string

!CUBLAS_WORKSPACE_CONFIG=:4096:2 # for cuda deterministic behavior

######### BERT ############
# first install transformers from hugging face
!pip install transformers

# imports
from transformers import BertTokenizer, BertForQuestionAnswering

# dataloaders 
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

def set_seed(seed = 1234):
    '''Sets the seed of the entire notebook so results are the same every time we run.
    This is for REPRODUCIBILITY.'''
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    # When running on the CuDNN backend, two further options must be set
    torch.backends.cudnn.deterministic = True
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.enabled = False
    torch.backends.cudnn.benchmark = False
    # torch.use_deterministic_algorithms(False)
    # Set a fixed value for the hash seed
    os.environ['PYTHONHASHSEED'] = str(seed)
    
set_seed()

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print('Working on:', device)

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 3.2 MB/s 
[?25hCollecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 50.5 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 45.6 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 3.9 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 47.3 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Foun

Load SQuAD 2.0 Dataset

I used the datasets library from hugging face.

In [None]:
!pip install datasets
from datasets import load_dataset

Collecting datasets
  Downloading datasets-1.18.4-py3-none-any.whl (312 kB)
[K     |████████████████████████████████| 312 kB 3.1 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 50.5 MB/s 
[?25hCollecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.2.0-py3-none-any.whl (134 kB)
[K     |████████████████████████████████| 134 kB 43.6 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 51.7 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 1.7 MB/s 
[?25hCollecting yarl<2.0,>=1.0
  Downloading yarl-1.7.2-cp37-cp37

In [None]:
train_dataset = load_dataset('squad_v2', split='train')

Downloading:   0%|          | 0.00/1.87k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading and preparing dataset squad_v2/squad_v2 (download: 44.34 MiB, generated: 122.41 MiB, post-processed: Unknown size, total: 166.75 MiB) to /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d...


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/9.55M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/801k [00:00<?, ?B/s]

  0%|          | 0/2 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset squad_v2 downloaded and prepared to /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d. Subsequent calls will reuse this data.


In [None]:
validation_dataset = load_dataset('squad_v2', split='validation')

Reusing dataset squad_v2 (/root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d)


Overview of the feature names of the dataset.

In [None]:
train_dataset

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 130319
})

Let's print the first example.

We see that for 'answers' column the dataset contains a dictionary with keys 'text' and 'answer_start', that each contain a list with one element. 

In [None]:
train_dataset[0]

{'answers': {'answer_start': [269], 'text': ['in the late 1990s']},
 'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".',
 'id': '56be85543aeaaa14008c9063',
 'question': 'When did Beyonce start becoming popular?',
 'title': 'Beyoncé'}

Same features for the validation dataset.

In [None]:
validation_dataset

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 11873
})

In validation, the column 'answers' contains a dictionary with keys 'text' and 'answer_start' as well, but each contains a list with multiple elements

In [None]:
validation_dataset[0]

{'answers': {'answer_start': [159, 159, 159, 159],
  'text': ['France', 'France', 'France', 'France']},
 'context': 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.',
 'id': '56ddde6b9a695914005b9628',
 'question': 'In what country is Normandy located?',
 'title': 'Normans'}

If there is no answer, the lists are empty

In [None]:
validation_dataset[-10]

{'answers': {'answer_start': [], 'text': []},
 'context': 'The connection between macroscopic nonconservative forces and microscopic conservative forces is described by detailed treatment with statistical mechanics. In macroscopic closed systems, nonconservative forces act to change the internal energies of the system, and are often associated with the transfer of heat. According to the Second law of thermodynamics, nonconservative forces necessarily result in energy transformations within closed systems from ordered to more random conditions as entropy increases.',
 'id': '5ad28a57d7d075001a4299b3',
 'question': 'What does not change macroscopic closed systems?',
 'title': 'Force'}

In [None]:
df = pd.DataFrame(train_dataset)

In [None]:
df.tail(10)

Unnamed: 0,id,title,context,question,answers
130309,5a7e05ef70df9f001a875425,Matter,These quarks and leptons interact through four...,How many quarks and leptons are there?,"{'text': [], 'answer_start': []}"
130310,5a7e05ef70df9f001a875426,Matter,These quarks and leptons interact through four...,What model satisfactorily explains gravity?,"{'text': [], 'answer_start': []}"
130311,5a7e05ef70df9f001a875427,Matter,These quarks and leptons interact through four...,Interactions between quarks and leptons are th...,"{'text': [], 'answer_start': []}"
130312,5a7e05ef70df9f001a875428,Matter,These quarks and leptons interact through four...,Mass and energy can always be compared to what?,"{'text': [], 'answer_start': []}"
130313,5a7e05ef70df9f001a875429,Matter,These quarks and leptons interact through four...,What relation explains the carriers of the ele...,"{'text': [], 'answer_start': []}"
130314,5a7e070b70df9f001a875439,Matter,"The term ""matter"" is used throughout physics i...",Physics has broadly agreed on the definition o...,"{'text': [], 'answer_start': []}"
130315,5a7e070b70df9f001a87543a,Matter,"The term ""matter"" is used throughout physics i...",Who coined the term partonic matter?,"{'text': [], 'answer_start': []}"
130316,5a7e070b70df9f001a87543b,Matter,"The term ""matter"" is used throughout physics i...",What is another name for anti-matter?,"{'text': [], 'answer_start': []}"
130317,5a7e070b70df9f001a87543c,Matter,"The term ""matter"" is used throughout physics i...",Matter usually does not need to be used in con...,"{'text': [], 'answer_start': []}"
130318,5a7e070b70df9f001a87543d,Matter,"The term ""matter"" is used throughout physics i...",What field of study has a variety of unusual c...,"{'text': [], 'answer_start': []}"


# Preprocessing the dataset

In [None]:
train_dataset[1]

{'answers': {'answer_start': [207], 'text': ['singing and dancing']},
 'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".',
 'id': '56be85543aeaaa14008c9065',
 'question': 'What areas did Beyonce compete in when she was growing up?',
 'title': 'Beyoncé'}

For the training dataset, I noticed that for each question there is only one answer, so there is no need to keep the values of the answers dictionary in lists. For example: `'answers': {'text': ['singing and dancing'], 'answer_start': [207]}}` can be reformated to `'answers': {'text': 'singing and dancing', 'answer_start': 207}}`. As for questions that are inanswerable (they look like this:`'answers': {'text': [], 'answer_start': []}}` we can just have `'answers': {'text': "", 'answer_start': 0}}`.

In [None]:
def find_end(example):

    if (len(example['answers']['text']) != 0):
        context = example['context']
        text = example['answers']['text'][0]
        start_idx = example['answers']['answer_start'][0]

        end_idx = start_idx + len(text)
        
        temp = example['answers'] # to change the value
        temp['answer_end']=end_idx 
        temp['answer_start'] = start_idx # [num]->num
        temp['text'] = text # ['text']->text
    
    else:
        temp = example['answers']
        temp['answer_end'] = 0 # []->0
        temp['answer_start'] = 0 # []->0
        temp['text'] = "" # []->""
        
    return example

train_dataset = train_dataset.map(find_end)

0ex [00:00, ?ex/s]

Check some examples:

In [None]:
train_dataset[1]

{'answers': {'answer_end': 226,
  'answer_start': 207,
  'text': 'singing and dancing'},
 'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".',
 'id': '56be85543aeaaa14008c9065',
 'question': 'What areas did Beyonce compete in when she was growing up?',
 'title': 'Beyoncé'}

In [None]:
train_dataset[10]

{'answers': {'answer_end': 524,
  'answer_start': 505,
  'text': 'Dangerously in Love'},
 'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".',
 'id': '56d43c5f2ccc5a1400d830ab',
 'question': 'What was the first album Beyoncé released as a solo artist?',
 'title': 'Beyoncé'}

Example with no answer

In [None]:
train_dataset[-10]

{'answers': {'answer_end': 0, 'answer_start': 0, 'text': ''},
 'context': 'These quarks and leptons interact through four fundamental forces: gravity, electromagnetism, weak interactions, and strong interactions. The Standard Model of particle physics is currently the best explanation for all of physics, but despite decades of efforts, gravity cannot yet be accounted for at the quantum level; it is only described by classical physics (see quantum gravity and graviton). Interactions between quarks and leptons are the result of an exchange of force-carrying particles (such as photons) between quarks and leptons. The force-carrying particles are not themselves building blocks. As one consequence, mass and energy (which cannot be created or destroyed) cannot always be related to matter (which can be created out of non-matter particles such as photons, or even out of pure energy, such as kinetic energy). Force carriers are usually not considered matter: the carriers of the electric force (p

Tokenize train dataset and find end and start tokens. The sequence lenght will be 512, the maximum one for bert.

In [None]:
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

tokenized_train = tokenizer(train_dataset['context'], train_dataset['question'], truncation=True, padding=True)

In [None]:
def find_token_indexes(tokenized, dataset):
    start_token_list = []
    end_token_list = []
    answers = dataset['answers']
    for i in range(len(answers)):
        if (answers[i]['text'] != ''):
            start_token = tokenized.char_to_token(i, answers[i]['answer_start'])
            end_token = tokenized.char_to_token(i, answers[i]['answer_end'] - 1)
            
            # if start token is None, the answer passage has been truncated
            if start_token is None:
                start_token = tokenizer.model_max_length
            if end_token is None:
                end_token = tokenizer.model_max_length
        else:
            start_token = 0
            end_token = 0
            
        start_token_list.append(start_token)
        end_token_list.append(end_token)

    return start_token_list, start_token_list
    
s, e = find_token_indexes(tokenized_train, train_dataset)
train_dataset = train_dataset.add_column("start_position", s)
train_dataset = train_dataset.add_column("end_position", e)

In [None]:
train_dataset

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers', 'start_position', 'end_position'],
    num_rows: 130319
})

In [None]:
batch_size = 8
train_data = TensorDataset(torch.tensor(tokenized_train['input_ids'], dtype=torch.int64), 
                           torch.tensor(tokenized_train['token_type_ids'], dtype=torch.int64), 
                           torch.tensor(tokenized_train['attention_mask'], dtype=torch.float), 
                           torch.tensor(train_dataset['start_position'], dtype=torch.int64), 
                           torch.tensor(train_dataset['start_position'], dtype=torch.int64))

train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

Validation dataset does not need that much preprocessing. I pass to the dataloader only the input_ids, token_type_ids and attention masks, that will be passed to bert model in batches. I use a Sequential sampler to keep the indexing same as the validation dataset. We will need the offsets mapping to construct the sentence from the predicted start and end tokens and compare it with the actual answers.

In [None]:
tokenized_validation = tokenizer(validation_dataset['context'], validation_dataset['question'], truncation=True, padding=True, return_offsets_mapping=True)

In [None]:
batch_size = 8
val_data = TensorDataset(torch.tensor(tokenized_validation['input_ids'], dtype=torch.int64), 
                        torch.tensor(tokenized_validation['token_type_ids'], dtype=torch.int64), 
                        torch.tensor(tokenized_validation['attention_mask'], dtype=torch.float))
val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)

# Fine-Tune

Let's load the bert model for question answering. This model gives as outputs the start and end logits, as described in the BERT paper, before the softmax.

In [None]:
model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased a

The training for each epoch took aprox. 2 hours so I couldn't try many epochs and do many runs when using the whole dataset.

For optimizer, I used AdamW (Adam with weight decay) which is the one that was used in BERT during pre-training. 

In [None]:
model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')
epochs = 3
model.to(device)
optimizer = optim.AdamW(model.parameters(), lr=1e-5)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased a

In [None]:
from tqdm import tqdm

for epoch in range(epochs):
    epoch_loss = []
    validation_loss = []
    
    total_loss = 0
    model.train()

    count=-1
    progress_bar = tqdm(train_dataloader, leave=True, position=0)
    progress_bar.set_description(f"Epoch {epoch+1}")
    for batch in progress_bar:
        count+=1
        input_ids, segment_ids, mask, start, end  = tuple(t.to(device) for t in batch)

        model.zero_grad()
        loss, start_logits, end_logits = model(input_ids = input_ids, 
                                                token_type_ids = segment_ids, 
                                                attention_mask = mask, 
                                                start_positions = start, 
                                                end_positions = end,
                                                return_dict = False)           

        total_loss += loss.item()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        
        if (count % 20 == 0 and count != 0):
            avg = total_loss/count
            progress_bar.set_postfix(Loss=avg)
            
    torch.save(model.state_dict(), "./bert2_" + str(epoch) + ".h5") # save for later use
    avg_train_loss = total_loss / len(train_dataloader)
    epoch_loss.append(avg_train_loss)
    print(f"Epoch {epoch} Loss: {avg_train_loss}\n")

Epoch 1: 100%|██████████| 16290/16290 [2:05:11<00:00,  2.17it/s, Loss=1.41] 


Epoch 0 Loss: 1.4090361924827135



Epoch 2: 100%|██████████| 16290/16290 [2:05:08<00:00,  2.17it/s, Loss=0.877] 


Epoch 1 Loss: 0.8765695926311962



Epoch 3: 100%|██████████| 16290/16290 [2:04:56<00:00,  2.17it/s, Loss=0.634] 


Epoch 2 Loss: 0.6342329508682142



# EVALUATION

In [None]:
from tqdm import tqdm
# model.load_state_dict(torch.load("../input/bert-weights/bert2_2.h5"))

threshold = 1.0
epoch_i = 0
correct = 0 
pred_dict = {}
na_prob_dict = {}

model.eval()
correct = 0
batch_val_losses = []
row = 0
for test_batch in tqdm(val_dataloader):
    input_ids, segment_ids, masks = tuple(t.to(device) for t in test_batch)

    with torch.no_grad():
        # prediction logits
        start_logits, end_logits = model(input_ids=input_ids,
                                        token_type_ids=segment_ids,
                                        attention_mask=masks,
                                        return_dict=False)

    # to cpu
    start_logits = start_logits.detach().cpu()
    end_logits = end_logits.detach().cpu()

    # for every sequence in batch 
    for bidx in range(len(start_logits)):
        # apply softmax to logits to get scores
        start_scores = np.array(F.softmax(start_logits[bidx], dim = 0))
        end_scores = np.array(F.softmax(end_logits[bidx], dim = 0))

        # find max for start<=end
        size = len(start_scores)
        scores = np.zeros((size, size))

        for j in range(size):
            for i in range(j+1): # include j
                scores[i,j] = start_scores[i] + end_scores[j]

        # find best i and j
        start_pred, end_pred = unravel_index(scores.argmax(), scores.shape)
        answer_pred = ""
        if (scores[start_pred, end_pred] > scores[0,0]+threshold):

            offsets = tokenized_validation.offset_mapping[row]
            pred_char_start = offsets[start_pred][0]

            if end_pred < len(offsets):
                pred_char_end = offsets[end_pred][1]
                answer_pred = validation_dataset[row]['context'][pred_char_start:pred_char_end]
            else:
                answer_pred = validation_dataset[row]['context'][pred_char_start:]

            if answer_pred in validation_dataset[row]['answers']['text']:
                correct += 1

        else:
            if (len(validation_dataset[row]['answers']['text']) ==0):
                correct += 1    

        pred_dict[validation_dataset[row]['id']] = answer_pred
        na_prob_dict[validation_dataset[row]['id']] = scores[0,0]

        row+=1


accuracy = correct/validation_dataset.num_rows
print("accuracy is: ", accuracy)

100%|██████████| 1485/1485 [28:54<00:00,  1.17s/it]

accuracy is:  0.6711025014739325





Save prediction dictionary and no answer probability dictionary as .json

In [None]:
import json 
with open("pred.json", "w") as outfile:
    json.dump(pred_dict, outfile)

In [None]:
with open("na_prob.json", "w") as outfile:
    json.dump(na_prob_dict, outfile)

In [None]:
print(f"Context: {validation_dataset[0]['context']}\n")
for i in range(5):
    print(f"Question: {validation_dataset[i]['question']}")
    print(f"Predicted answer: {pred_dict[validation_dataset[i]['id']]}")
    print(f"Answers: {validation_dataset[i]['answers']['text']}\n")

Write the official evaluation

In [None]:
!wget https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/ -O evaluation.py

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
--2022-03-03 19:33:12--  https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/
Resolving worksheets.codalab.org (worksheets.codalab.org)... 13.68.212.115
Connecting to worksheets.codalab.org (worksheets.codalab.org)|13.68.212.115|:443... connected.
HTTP request sent, awaiting response... 200 OK
Syntax error in Set-Cookie: codalab_session=""; expires=Thu, 01 Jan 1970 00:00:00 GMT; Max-Age=-1; Path=/ at position 70.
Length: unspecified [text/x-python]
Saving to: ‘evaluation.py’

evaluation.py           [ <=>                ]  10.30K  --.-KB/s    in 0s      

2022-03-03 19:33:13 (120 MB/s) - ‘evaluation.py’ saved [10547]



In [None]:
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json -O dev-v2.0.json

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
--2022-03-03 19:33:52--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.111.153, 185.199.108.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.111.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘dev-v2.0.json’


2022-03-03 19:33:53 (50.7 MB/s) - ‘dev-v2.0.json’ saved [4370528/4370528]



In [None]:
!python evaluation.py dev-v2.0.json pred.json --na-prob-file na_prob.json --na-prob-thresh 1 --out-image-dir ./

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
{
  "exact": 68.01145456076813,
  "f1": 70.031884842807,
  "total": 11873,
  "HasAns_exact": 55.3306342780027,
  "HasAns_f1": 59.37728892352334,
  "HasAns_total": 5928,
  "NoAns_exact": 80.6560134566863,
  "NoAns_f1": 80.6560134566863,
  "NoAns_total": 5945,
  "best_exact": 68.02829950307421,
  "best_exact_thresh": 0.4342222809791565,
  "best_f1": 70.04872978511327,
  "best_f1_thresh": 0.4342222809791565,
  "pr_exact_ap": 34.938071259967515,
  "pr_f1_ap": 40.40336353982701,
  "pr_oracle_ap": 76.64267454519707
}
