# Fine-tuning of **bert-base-uncased** on SQuAD dataset 
## <div> Vassilis Panagakis </div>

In [1]:
! pip install -U sentence-transformers

import json
import torch
import pandas as pd
import numpy as np
import transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/6a/e2/84d6acfcee2d83164149778a33b6bdd1a74e1bcb59b2b2cd1b861359b339/sentence-transformers-0.4.1.2.tar.gz (64kB)
[K     |████████████████████████████████| 71kB 4.9MB/s 
[?25hCollecting transformers<5.0.0,>=3.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/f9/54/5ca07ec9569d2f232f3166de5457b63943882f7950ddfcc887732fc7fb23/transformers-4.3.3-py3-none-any.whl (1.9MB)
[K     |████████████████████████████████| 1.9MB 8.0MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 44.0MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37

### Device

In [2]:
# enable gpu for faster execution
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device available for running: ")
print(device)

Device available for running: 
cuda


## Load Data

In [3]:
from google.colab import drive 
drive.mount('/content/gdrive')

Mounted at /content/gdrive


### Load SQuAD data to dataframes

In [4]:
# function that loads data of a json file to a pandas dataframe
def squad_data_to_dataframe(file):
    f = open(file , "r") 
    data = json.loads(f.read())               # load the json file
    id_list = []                                  
    title_list = []                                  # create empty lists to store values
    context_list = []
    question_list = []
    ans_start_list = []
    text_list = []
    
    for i in range(len(data['data'])):       # root tag of the json file contains 'title' tag & 'paragraphs' list
        title = data['data'][i]['title']
        for p in range(len(data['data'][i]['paragraphs'])):  # 'paragraphs' list contains 'context' tag & 'qas' list
            context = data['data'][i]['paragraphs'][p]['context']
            for q in range(len(data['data'][i]['paragraphs'][p]['qas'])):  # 'qas' list contains 'question', 'Id' tag & 'answers' list
                question = data['data'][i]['paragraphs'][p]['qas'][q]['question']
                id = data['data'][i]['paragraphs'][p]['qas'][q]['id']
                for a in range(len(data['data'][i]['paragraphs'][p]['qas'][q]['answers'])): # 'answers' list contains 'ans_start', 'text' tags 
                    ans_start = data['data'][i]['paragraphs'][p]['qas'][q]['answers'][a]['answer_start']
                    text = data['data'][i]['paragraphs'][p]['qas'][q]['answers'][a]['text']
                    
                    # storing values to lists
                    title_list.append(title)
                    context_list.append(context)
                    question_list.append(question)                    
                    id_list.append(id)
                    ans_start_list.append(ans_start)
                    text_list.append(text)

    df = pd.DataFrame(columns=['Id', 'title', 'context', 'question', 'ans_start', 'text']) 
    df.Id = id_list
    df.title = title_list           
    df.context = context_list
    df.question = question_list
    df.ans_start = ans_start_list
    df.text = text_list

    squad_df = df.drop_duplicates(keep='first')  # drop duplicate rows from the created dataframe
    
    return squad_df

In [5]:
# load train data
train_path = 'gdrive/My Drive/Colab Notebooks/train-v2.0.json'

train = squad_data_to_dataframe(train_path)
train.head()

Unnamed: 0,Id,title,context,question,ans_start,text
0,56be85543aeaaa14008c9063,Beyoncé,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,When did Beyonce start becoming popular?,269,in the late 1990s
1,56be85543aeaaa14008c9065,Beyoncé,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,What areas did Beyonce compete in when she was...,207,singing and dancing
2,56be85543aeaaa14008c9066,Beyoncé,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,When did Beyonce leave Destiny's Child and bec...,526,2003
3,56bf6b0f3aeaaa14008c9601,Beyoncé,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,In what city and state did Beyonce grow up?,166,"Houston, Texas"
4,56bf6b0f3aeaaa14008c9602,Beyoncé,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,In which decade did Beyonce become famous?,276,late 1990s


In [6]:
# load test data
test_path = 'gdrive/My Drive/Colab Notebooks/dev-v2.0.json'

test = squad_data_to_dataframe(test_path)
test.head()

Unnamed: 0,Id,title,context,question,ans_start,text
0,56ddde6b9a695914005b9628,Normans,The Normans (Norman: Nourmands; French: Norman...,In what country is Normandy located?,159,France
4,56ddde6b9a695914005b9629,Normans,The Normans (Norman: Nourmands; French: Norman...,When were the Normans in Normandy?,94,10th and 11th centuries
5,56ddde6b9a695914005b9629,Normans,The Normans (Norman: Nourmands; French: Norman...,When were the Normans in Normandy?,87,in the 10th and 11th centuries
8,56ddde6b9a695914005b962a,Normans,The Normans (Norman: Nourmands; French: Norman...,From which countries did the Norse originate?,256,"Denmark, Iceland and Norway"
12,56ddde6b9a695914005b962b,Normans,The Normans (Norman: Nourmands; French: Norman...,Who was the Norse leader?,308,Rollo


**We utilize the official Hugging face example [
Fine-tuning a model on a question-answering task](https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb) where `distilbert-base-uncased` is trained on the SQuAD dataset.
Our goal is to create a lighter version of it where `bert-base-uncased` is trained on the SQuAD dataset for a reduced amount of data. For a more detailed clarification of the code's functionality follow the link above.**

In [7]:
! pip install datasets

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/a2/12/5fd53adc5ba8a8d562b19f2c1c859547659e96b87a767cd52556538d205e/datasets-1.3.0-py3-none-any.whl (181kB)
[K     |█▉                              | 10kB 19.0MB/s eta 0:00:01[K     |███▋                            | 20kB 15.3MB/s eta 0:00:01[K     |█████▍                          | 30kB 13.2MB/s eta 0:00:01[K     |███████▎                        | 40kB 11.9MB/s eta 0:00:01[K     |█████████                       | 51kB 7.1MB/s eta 0:00:01[K     |██████████▉                     | 61kB 8.3MB/s eta 0:00:01[K     |████████████▋                   | 71kB 8.6MB/s eta 0:00:01[K     |██████████████▌                 | 81kB 8.9MB/s eta 0:00:01[K     |████████████████▎               | 92kB 8.8MB/s eta 0:00:01[K     |██████████████████              | 102kB 8.1MB/s eta 0:00:01[K     |████████████████████            | 112kB 8.1MB/s eta 0:00:01[K     |█████████████████████▊          | 122kB 8.1MB/s e

### Load to datasets

In [8]:
from datasets import Dataset

# load datasets from pandas datframes
train_dataset = Dataset.from_pandas(train)
test_dataset = Dataset.from_pandas(test)

## Data Pre-processing 

**We preprocess the data using a Transformers `Tokenizer` which tokenizes the input data (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and puts it in a format the model expects, as well as generate the other inputs that the model requires.**

In [9]:
from transformers import AutoTokenizer
    
tokenizer1 = AutoTokenizer.from_pretrained("bert-base-uncased")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




In [10]:
# assertion that ensures that we use a fast tokenizer 
assert isinstance(tokenizer1, transformers.PreTrainedTokenizerFast)

max_length = 384 # maximum length of a feature (question and context)
doc_stride = 128 # overlap between two parts of the context when splitting is needed

pad_on_right = tokenizer1.padding_side == "right" # set padding on the right side as default option

In [11]:
# function that executes the tokenization of the train data
def prepare_train_features(examples):
    # we tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature
    tokenized_examples = tokenizer1(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # since one example might give us several features if it has a long context, we need a map from a feature to its corresponding example
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # the offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start and end positions
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # examples labeling
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # we label impossible answers with the index of the CLS token
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer1.cls_token_id)

        # get the sequence corresponding to each example, in order to know what is the context and what is the question
        sequence_ids = tokenized_examples.sequence_ids(i)

        # the index of the example containing this span of text
        sample_index = sample_mapping[i]
        # get the start id and the text of this example
        answers_start = examples["ans_start"][sample_index]
        answers_text = examples["text"][sample_index]
        # if no answers are given, set the CLS token's index as answer
        if answers_start < 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # start / end character index of the answer in the text
            start_char = answers_start
            end_char = start_char + len(answers_text)

            # start token index of the current span in the text
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # end token index of the current span in the text
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # if the answer is out of the span, the feature is labeled with the CLS index
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # else move the token_start_index and token_end_index to the two ends of the answer
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

**To apply this function on all the sentences (or pairs of sentences) in our dataset, we just use the `map` method of our `dataset` object we created earlier. Since our preprocessing changes the number of samples, we need to remove the old columns when applying it. Moreover, we pass `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer, which uses multi-threading to treat the texts in a batch concurrently.**

In [12]:
# tokenize the train data
tokenized_train = train_dataset.map(prepare_train_features, batched=True, remove_columns=train_dataset.column_names)

HBox(children=(FloatProgress(value=0.0, max=87.0), HTML(value='')))




In [13]:
# tokenize the test data
tokenized_test = test_dataset.map(prepare_train_features, batched=True, remove_columns=test_dataset.column_names)

HBox(children=(FloatProgress(value=0.0, max=11.0), HTML(value='')))




## Model fine-tuning

**Now that our data is ready for training, we can download the pretrained model and fine-tune it. Since our task is question answering, we use the `AutoModelForQuestionAnswering` class.** 

In [14]:
from transformers import AutoModelForQuestionAnswering

model1 = AutoModelForQuestionAnswering.from_pretrained("bert-base-uncased")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased a

**We define the `TrainingArguments`, which is a class that contains all the attributes to customize the training. We use the suggested learning rate and weight decay values of the Hugging face example. We also use that biggest possible batch size, in order to train our model without the Google Colab's RAM crashing. We train the model for just 3 epochs, as each epoch requires many hours to execute, while the Google Colab resources have a time limitation and an overuse leads to a temporary ban.**

In [15]:
from transformers import TrainingArguments

batch_size = 16 # batch size

args = TrainingArguments(
    output_dir="gdrive/My Drive/Colab Notebooks/checkpoints", # directory to save checkpoints
    evaluation_strategy = "epoch",
    learning_rate=2e-5, # learning rate
    per_device_train_batch_size=batch_size, # the batch size per GPU core for training
    per_device_eval_batch_size=batch_size,  # the batch size per GPU core for evaluation
    num_train_epochs=3, # number of epochs
    weight_decay=0.01, # weight decay
    warmup_steps=1000, # number of steps used for a linear warmup from 0 to learning_rate 
    save_steps=1000, # number of updates steps before two checkpoint saves
    save_total_limit=10, # if a value is passed, the total amount of checkpoints is limited
    load_best_model_at_end=True # load the best model found during training at the end of training
)

**We use the default data collator to batch our processed examples together.**

In [16]:
from transformers import default_data_collator

data_collator = default_data_collator

### Train model

**We initialize the `Trainer` using the `bert-base-uncased` model we want to fine-tune, the training arguments, the data collator, the tokenizer and the tokenized data.**

In [17]:
from transformers import Trainer

trainer = Trainer(
    model1,
    args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    data_collator=data_collator,
    tokenizer=tokenizer1,
)

**We fine-tune our model by just calling the `train` method. As we mentioned before training is a high time-demanding operation. As you can see, we train our model only for 3 epochs and the time needed to complete the procedure is more than 3 hours, which means that the time needed per epoch is about 1 hour. Due to the time limitation of the Colab notebook we don't have the resources to train our model for a big amount of epochs. However, we are still able to make some crucial observations about the training's progress. The most important, remark that occurs from the model's training is the contrast between the progress of the Training and Validation losses. Concerning the Training loss, we can see that the model is converging fast and the loss is dropping by approximately 0.2 per epoch. More precisely, after the execution of the first epoch the Training loss value is 1.09 and it ends up being 0.59 after the third epoch. On the other hand, the progress of the Validation loss is not equally satisfying, as the value after the execution of the first epoch is 1.53 and it ends up being 1.69 after the third epoch. This means that the model tends to overfit and it loses its ability to generalize as it gets too familiar with the input data. We will see at the end of our project if / how much the overfitting affects our model's predicting ability.**

In [18]:
trainer.train()

Epoch,Training Loss,Validation Loss,Runtime,Samples Per Second
1,1.0987,1.53185,137.6959,77.047
2,0.8094,1.536165,137.7233,77.031
3,0.5907,1.697627,137.694,77.048


TrainOutput(global_step=16452, training_loss=0.9971871741230169, metrics={'train_runtime': 11399.905, 'train_samples_per_second': 1.443, 'total_flos': 66038486951490048, 'epoch': 3.0})

**After training is completed we save the model, in order to avoid re-training from the last checkpoint.**

In [19]:
trainer.save_model("gdrive/My Drive/Colab Notebooks/final-squad-trained")

### Load fine-tuned model

**Now that our model is fine-tuned, we can instantly load it from our google drive, without having to train it again.** 

In [20]:
from transformers import AutoModelForQuestionAnswering

model2 = AutoModelForQuestionAnswering.from_pretrained("gdrive/My Drive/Colab Notebooks/final-squad-trained")

**This time we initialize a new `Trainer` object using our fine-tuned model `final-squad-trained`. We also use the same training arguments, data collator,  tokenizer and tokenized data, that we used to fine-tune the model. We use this `Trainer` object for our following Evaluation tasks.** 

In [21]:
from transformers import Trainer

new_trainer = Trainer(
    model2,
    args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    data_collator=data_collator,
    tokenizer=tokenizer1,
)

## Evaluation

In [22]:
# function that executes the tokenization of the train data
def prepare_validation_features(examples):
    # we tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature
    tokenized_examples = tokenizer1(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # since one example might give us several features if it has a long context, we need a map from a feature to its corresponding example
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We store the offset mappings on the corresponding example_id  
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # get the sequence corresponding to each example, in order to know what is the context and what is the question
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # the index of the example containing this span of text
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["Id"][sample_index])

        # Set the offset_mapping that are not part of the context to None
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

In [23]:
# tokenize the test data
validation_features = test_dataset.map(prepare_validation_features, batched=True, remove_columns=test_dataset.column_names)

HBox(children=(FloatProgress(value=0.0, max=11.0), HTML(value='')))




**The `Trainer` can't handle the data of the `example_id` and `offset_mapping` columns because they contain `NoneType` values. Yet, these columns are not needed by the `predict` method of the `Trainer` class. So we can drop them temporarly and call the `predict` method to get the model's predictions on the test set.**    

In [24]:
val_features = validation_features.map(remove_columns=['example_id', 'offset_mapping'])

HBox(children=(FloatProgress(value=0.0, max=10609.0), HTML(value='')))




In [25]:
raw_predictions = new_trainer.predict(val_features)

**The `Trainer` hides the columns that are not used by the model (here `example_id` and `offset_mapping`). We can set the back as follows.**

In [26]:
validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))

In [27]:
import collections
from tqdm.auto import tqdm

# function that processes raw predictions and returns the best answer for each example based on a score
def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # build a map example to its corresponding features
    example_id_to_index = {k: i for i, k in enumerate(examples["Id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # initialize dictionaries to store the predictions 
    predictions = collections.OrderedDict()

    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # repeat for every example
    for example_index, example in enumerate(tqdm(examples)):
        # get the indices of the features associated to the current example
        feature_indices = features_per_example[example_index]

        valid_answers = []
        
        context = example["context"]
        # loop through all the features associated to the current example
        for feature_index in feature_indices:
            # get the predictions of the model for this feature
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # map some of the positions in our logits to span of texts in the original context
            offset_mapping = features[feature_index]["offset_mapping"]

            # at each loop update minimum prediction
            cls_index = features[feature_index]["input_ids"].index(tokenizer1.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            
            # traverse through the possibilities of the 'n_best_size' best start and end logits
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # ignore out-of-scope answers
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # ignore answers with a length that is either < 0 or > max_answer_length
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # in case we don't have a non-null prediction, we create a zero prediction to avoid failure
            best_answer = {"text": "", "score": 0.0}
        
        predictions[example["Id"]] = best_answer["text"]
        
    return predictions

**We apply our post-processing function to our raw predictions.**

In [30]:
final_predictions = postprocess_qa_predictions(test_dataset, validation_features, raw_predictions.predictions)

Post-processing 10388 example predictions split into 10609 features.


HBox(children=(FloatProgress(value=0.0, max=10388.0), HTML(value='')))




**We can load the squad metric from the datasets library, which evaluates all the versions of the squad dataset based on the `exact match` and `f1 score` methods.**

In [31]:
from datasets import load_metric

metric = load_metric("squad")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1726.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1132.0, style=ProgressStyle(description…




**We have to format the model's predictions and the squad answers in order to pass them in the `metric` method.** 

In [32]:
predictions = [{"prediction_text": v, "id": k} for k, v in final_predictions.items()]

In [33]:
references = [] # initialize list to store references 
prevId = "" # variable to store Id of previous checked example

for ex in test_dataset:
  if ex["Id"] != prevId: # keep only first answer for examples with multiple answers
    references.append({"answers": {'answer_start': [ex['ans_start']], 'text': [ex["text"]]}, "id": ex["Id"]})
  prevId=ex["Id"]

**We notice that for some questions of the squad dataset there are more than one possible answers, while our model's predictions provide one answer per question. In order to compare each prediction to the respective squad answer we have to keep just one answer for each question. In cases where more than one answers are provided we observe that the answers look very similar and most of the times their difference, is due to the positioning of prepositions or articles. For example, for the question 'When were the Normans in Normandy?' the possible answers are '10th and 11th centuries' and 'in the 10th and 11th centuries'. As we can see the first answer is a simplified edition of the second answer, where preposition 'in' and article 'the' are removed. Same goes for most QAs in the dataset. Moreover, we notice that our model's prediction for this question matches to the first, simplified answer '10th and 11th centuries'. This hapens for the rest of the questions as well. That's why we decide to keep the first, simplified squad answer for each question and we ignore the others. This helps us produce a higher exact match score.**

In [34]:
# first squad answer for the question: When were the Normans in Normandy?
test_dataset[1]

{'Id': '56ddde6b9a695914005b9629',
 '__index_level_0__': 4,
 'ans_start': 94,
 'context': 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.',
 'question': 'When were the Normans in Normandy?',
 'text': '10th and 11th centuries',
 'title': 'Normans'}

In [35]:
# second squad answer for the question: When were the Normans in Normandy?
test_dataset[2]

{'Id': '56ddde6b9a695914005b9629',
 '__index_level_0__': 5,
 'ans_start': 87,
 'context': 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.',
 'question': 'When were the Normans in Normandy?',
 'text': 'in the 10th and 11th centuries',
 'title': 'Normans'}

In [36]:
# model's predicted answer for the question: When were the Normans in Normandy?
predictions[1]

{'id': '56ddde6b9a695914005b9629',
 'prediction_text': '10th and 11th centuries'}

**Finally, we execute the `metric.compute` method that computes the `f1 score` and `exact match` metrics. We achieve a pretty high f1 score, almost 80%. The exact match score is also very descent, meaning that our model is pretty good at predicting at least one of the squad answers for each question. We should also keep in mind that we trained our model only for 3 epochs, which makes the results even more impressive. Luckily, the possible overfitting caused on the Validation loss doesn't seem to have a negative effect on our model's predictions. To sum up, it would be really interesting to examine the model's behaviour for a bigger amount of epochs and specially focus on the extent of overefitting and its effect on the prediction results.**

In [37]:
metric.compute(predictions=predictions, references=references)

{'exact_match': 62.53373819163293, 'f1': 78.27969934966353}