# Question Answering on SQuAD dataset


we will use a transformer-based architecture.<br>
The transformer used will be pre-trained on a generic task and then finetuned on the task at hand.<br>
The transformers' implementation that will be used will be provided by **HuggingFace** library.<br>
Let's start by installing it.

In [1]:
! pip install datasets transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.4.0-py3-none-any.whl (365 kB)
[K     |████████████████████████████████| 365 kB 30.6 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.21.0-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 57.3 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 67.5 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 11.5 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting fsspec[http]>=2021.11.1
  Downloading fsspec-2022.7.1-py3-none-any.whl (141 kB)
[K     |████████████████████████████████| 141 kB 60.8 MB/s 
Collecting pyyaml>=5.1
  Downlo

## Loading the Dataset

### Dataset Downloading 
The dataset is a .json file loaded in a google drive.


In [2]:
!gdown --id "1aURk7-EAowXK-KXy7Ut1Y3z1X18kHv0E"

Downloading...
From: https://drive.google.com/uc?id=1aURk7-EAowXK-KXy7Ut1Y3z1X18kHv0E
To: /content/training_set.json
100% 30.3M/30.3M [00:00<00:00, 269MB/s]


### Dataset Creation

The dataset will be loaded using HuggingFace's loading function.

In [3]:
from datasets import load_dataset

json_file_path = "training_set.json"
ds_original = load_dataset('json', data_files= json_file_path, field='data')

Using custom data configuration default-b16d45e41968cbb2


Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-b16d45e41968cbb2/0.0.0/a3e658c4731e59120d44081ac10bf85dc7e1388126b92338344ce9661907f253...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-b16d45e41968cbb2/0.0.0/a3e658c4731e59120d44081ac10bf85dc7e1388126b92338344ce9661907f253. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

HuggingFace's loading function returns a dict-link object called `DatasetDict` that incapsulate the real dataset.
The dataset loaded will be stored under the key "train", as such it will subsequently splitted according to the projects requirenmets.

In [4]:
ds_original

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
})

In [5]:
# Print the 1st row
ds_original['train'][0]

{'paragraphs': [{'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
   'qas': [{'answers': [{'answer_start': 515,
       'text': 'Saint Bernadette Soubirous'}],
     'id': '5733be284776f41900661182',
     'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?'},
    {'answers': [{'answer_start': 188, 'text': 

we need to convert json file to dataframe to facilitaye dealing wuth it.

In [6]:
def generate_dataset(dataset, test = False):
  for data in dataset["train"]:
    title = data.get("title", "").strip()
    for paragraph in data["paragraphs"]:
      context = paragraph["context"].strip()
      for qa in paragraph["qas"]:
          # Handling questions
          question = qa["question"].strip()
          id_ = qa["id"]
          # Answers won't be present in the testing (compute_answers.py)
          if not test:
              # Handling answers
              for answer in qa["answers"]:
                answer_start = [answer["answer_start"]]
              for answer in qa["answers"]:
                answer_text = [answer["text"].strip()]

              yield id_, {
                "title": title,
                "context": context,
                "question": question,
                "id": id_,
                "answers": {
                    "answer_start": answer_start,
                    "text": answer_text,
                },
              }
          else:
              yield id_, {
              "title": title,
              "context": context,
              "question": question,
              "id": id_,
            }

The `generate_dataset` is then used to create a `DataFrame` that will contain the whole dataset framed as described above.

In [7]:
import pandas as pd

# Create a pandas dataframe that contains all the data
df = pd.DataFrame(
    [value[1] for value in generate_dataset(ds_original)]
)

The result is:

In [8]:
from IPython.display import display, HTML

def display_dataframe(df):
    display(HTML(df.to_html()))

In [9]:
display_dataframe(df.head())

Unnamed: 0,title,context,question,id,answers
0,University_of_Notre_Dame,"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend ""Venite Ad Me Omnes"". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.",To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?,5733be284776f41900661182,"{'answer_start': [515], 'text': ['Saint Bernadette Soubirous']}"
1,University_of_Notre_Dame,"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend ""Venite Ad Me Omnes"". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.",What is in front of the Notre Dame Main Building?,5733be284776f4190066117f,"{'answer_start': [188], 'text': ['a copper statue of Christ']}"
2,University_of_Notre_Dame,"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend ""Venite Ad Me Omnes"". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.",The Basilica of the Sacred heart at Notre Dame is beside to which structure?,5733be284776f41900661180,"{'answer_start': [279], 'text': ['the Main Building']}"
3,University_of_Notre_Dame,"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend ""Venite Ad Me Omnes"". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.",What is the Grotto at Notre Dame?,5733be284776f41900661181,"{'answer_start': [381], 'text': ['a Marian place of prayer and reflection']}"
4,University_of_Notre_Dame,"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend ""Venite Ad Me Omnes"". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.",What sits on top of the Main Building at Notre Dame?,5733be284776f4190066117e,"{'answer_start': [92], 'text': ['a golden statue of the Virgin Mary']}"


In [10]:
df.to_csv('Q_A.csv')

Number of newly generated rows:

In [11]:
n_answers = df['answers'].count()
print("Total samples:\n{}".format(n_answers))

Total samples:
87599


### Dataset Split
The dataset has to be splitted into training set and validation set.

In [12]:
from datasets import Dataset, DatasetDict

def split_train_validation(df, train_size):
    """
    Returns a DatasetDict with the train and validation splits.

    Parameters
    ----------
    df: Pandas.Dataframe
        Dataframe to split.
    train_size : int or float
        A number that specifies the size of the train split.
        If it is less or equal than 1, represents a percentage, else
        the train's number of samples 
    
    Returns
    -------
    DatasetDict(**dataset) : datasets.dataset_dict
        Dictionary containing as keys the train and validation split and 
        as values a dataset.

    """

    dataset = {}
    # Number of samples in df
    n_answers = df['answers'].count()
    if train_size <= 1 : s_train = n_answers * train_size 
    else: s_train= train_size
    # Count of answers by title, output is sorted asc
    df_bytitle = df.groupby(by='title')['answers'].count()
    # Cumulative sum over the DataFrame in order to select the train/validation titles
    # according to the train size
    train_title = df_bytitle[df_bytitle.sort_values().cumsum() < s_train]
    # Splitting the two dataframes
    df_train = df[df.title.isin(train_title.index.tolist())].reset_index(drop=True)
    df_validation = df[~df.title.isin(train_title.index.tolist())].reset_index(drop=True)
    # Building the two HuggingFace's datasets using train and validation dataframes
    dataset["train"]= Dataset.from_pandas(df_train)
    dataset["validation"]= Dataset.from_pandas(df_validation)

    return DatasetDict(**dataset)

Call `split_train_validation` in order to split in training and validation set the previously created `DataFrame`.

In [13]:
datasets = split_train_validation(df, 0.9)

The result is:

In [14]:
datasets

DatasetDict({
    train: Dataset({
        features: ['title', 'context', 'question', 'id', 'answers'],
        num_rows: 78428
    })
    validation: Dataset({
        features: ['title', 'context', 'question', 'id', 'answers'],
        num_rows: 9171
    })
})

## Preprocessing the Data

### Choosing the Model
As stated in the beginning what will be used is a transformer that has been pretrained on a generic task. Hence, in order to finetune it, it is important to faithfully **repeat the preprocessing steps used during the pre-training phase**. As such it's needed to define the model that it's going to be used straight from the preprocessing phase.<br>
Since in this context it's required to answer the questions not by generating new text but by extracting substring from a paragraph, the ideal type of transformer to be used is the **encoder** kind.
<figure class="image">
<img src="https://drive.google.com/uc?export=view&id=1A9BFo4m5zuVNceYccmS_thUiUhwQfJmm">
<figcaption>Typical structure of an encoder-based transformer.</figcaption>
</figure>

From this family of transformers it has been decided to use DistilBERT.

In [15]:
model_checkpoint = "distilbert-base-uncased"

### Loading the Tokenizer
The preprocessing it's handled by HuggingFace's `Tokenizer` class.<br>
This class is able to handle the preprocessing of the dataset in conformity with the specification of each pre-trained model present in HuggingFace's model hub. In particular they hold the vocabulary built in the pre-training phase and the tokenization methodology used: it generally is word-based, character-based or subword-based. DistilBERT uses the same as BERT, namely, end-to-end tokenization: punctuation splitting and wordpiece (subword segmentation).<br>
The method `AutoTokenizer.from_pretrained` will download the appropriate tokenizer.

In [16]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/455k [00:00<?, ?B/s]

### Handling Long Sequences
The transformer models have a maximum number of tokens they are able to process with this quantity varying depending on the architecture.<br>
A solution usually adopted in case of sequences longer than the limited amount (other than choosing a model that can handle longer sequences) is to **truncate** the sentence.<br>
While this approach may be effective for some tasks in this case it's **not a valid solution** since there would be the risk of truncating out from the context the answer to the question.<br>
In order to overcome this limitation what will be done is **sliding** the input sentence over the model with a certain **stride** allowing a certain degree of **overlap**. The overlap is necessary as to avoid the truncation of a sentence in a point where an answer lies.

In [17]:
max_length = 384 # Max length of the input sequence
stride = 128 # Overlap of the context

HuggingFace's tokenizer allow to perform this kind of operation by passing to the tokenizer the argument `return_overflowing_tokens=True` and by specifying the stride through the argument `stride`.

In [18]:
def tokenize(tokenizer, max_length, stride, row):
    pad_on_right = tokenizer.padding_side == "right"
    
    return tokenizer(
        row["question" if pad_on_right else "context"],
        row["context" if pad_on_right else "question"],
        max_length=max_length,
        truncation="only_second" if pad_on_right else "only_first",
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        stride=stride,
        padding="max_length"
    )

The division of a context in numerous truncated context create some issues regarding the detection of the answer inside the context since a pair of question-context may generate multiple pairs question-truncated context. This implies that using `answers["answer_start"]` is not sufficient anymore. As such, an ulterior preprocessing steps needs to be integrated in the preprocessing pipeline: the detection of the answers in the truncated contexts.

In [19]:
import collections

# This structure is used as an aid to the following functions since they will have to deal with a lot of start and end indexes.
Position = collections.namedtuple("Position", ["start","end"])

The first step is to retrieve the answer position in the original context.

In [20]:
def get_answer_position_in_context(answers):
    # Index of the answer starting character inside the context.
    start_char = answers["answer_start"][0]
    # Index of the answer ending character inside the context.
    end_char = start_char + len(answers["text"][0])
    
    return Position(start=start_char, end=end_char)

Since the tokenized input sequence encodes both the question and the context it is necessary to indentify which part of the sequence match the context.<br>
In order to complete this task the method `sequence_ids()` come into aid.<br>
In particular `sequence_ids()` tags the input tokens as `0` if they belong to the quesiton and `1` if they belong to the context (the reverse is instead true in the case the model pad the sequence to the left); `None` is for special tokens.

In [21]:
def get_context_position_in_tokenized_input(tokenized_row, i, pad_on_right):
    # List that holds for each index (up to the lenght of the tokenized input sequence)
    # 1 if its corresponding token is a context's token, 0 if it's a question's token
    # (the contrair if pad_on_right is true). Null for the special tokens.
    sequence_ids = tokenized_row.sequence_ids(i)

    # Start context's token's index inside the input sequence.
    token_start_index = sequence_ids.index(1 if pad_on_right else 0)

    # End context's token's index inside the input sequence.
    token_end_index = len(sequence_ids)-1 - list(reversed(sequence_ids)).index(1 if pad_on_right else 0)

    return Position(start=token_start_index, end=token_end_index)

In order to properly tag the position of an answer in a truncated context the answer itself needs to be fully included inside the truncated context, since partial answers may not be fully explicative, nor have grammatical consistence, ecc...<br>
Having the start and end answer's indexes inside the original context and the position of the truncated context inside the tokenized input sequence (which is composed by the question and the context), what's left it to identify the position of the answer in the tokenized and truncated context.<br>
This is done through the aid of the tokenized sequence attribute `offset_mapping` (obtained using the argument `return_offsets_mapping=True` to call the tokenizer) which indicates for each tokenized word its starting and ending index in the original sequence.

In [22]:
def get_answer_position_in_tokenized_input(offsets, char_pos, token_pos, cls_index):
    # Check if the answer fully included in the context.
    if offsets[token_pos.start][0] <= char_pos.start and offsets[token_pos.end][1] >= char_pos.end:
        # Starting token's index of the answer with respect to the input sequence.
        start_position = token_pos.start + next(i for i,v in enumerate([offset[0] for offset in offsets[token_pos.start:]]) if v > char_pos.start or i==token_pos.end+1) - 1
        # Ending token's index of the answer with respect to the input sequence.
        end_position = next(i for i,v in reversed(list(enumerate([offset[1] for offset in offsets[:token_pos.end+1]]))) if v < char_pos.end or i==token_pos.start-1) + 1

        return Position(start=start_position, end=end_position)
    else:
        return Position(start=cls_index, end=cls_index)

In [23]:
def preprocess_train(tokenizer, max_length, stride):
    pad_on_right = tokenizer.padding_side == "right"

    def preprocess_train_impl(rows):
        tokenized_rows = tokenize(tokenizer, max_length, stride, rows)
        # overflow_to_sample_mapping keeps the corrispondence between a feature and the row it was generated by.
        sample_mapping = tokenized_rows.pop("overflow_to_sample_mapping")
        # offset_mapping hold for each input token it's position in the textual counterpart
        # (be it the question or the context).
        offset_mapping = tokenized_rows.pop("offset_mapping")

        tokenized_rows["start_positions"] = []
        tokenized_rows["end_positions"] = []
        for i, offsets in enumerate(offset_mapping):
            input_ids = tokenized_rows["input_ids"][i]

            # cls is a special token. It will be used to label "impossible answers".
            cls_index = input_ids.index(tokenizer.cls_token_id)

            # One row can generate several truncated context, this is the index of the row containing this portion of context.
            sample_index = sample_mapping[i]
            answers = rows["answers"][sample_index]
            # If no answers are given, set the cls_index as answer.
            if len(answers["answer_start"]) == 0:
                pos = Position(cls_index,cls_index)
            else:
                char_pos = get_answer_position_in_context(answers)
                token_pos = get_context_position_in_tokenized_input(tokenized_rows, i, pad_on_right)
                pos = get_answer_position_in_tokenized_input(offsets, char_pos, token_pos, cls_index)

            tokenized_rows["start_positions"].append(pos.start)
            tokenized_rows["end_positions"].append(pos.end)

        return tokenized_rows
    return preprocess_train_impl

### Calling the Preprocessing Method
The `map` method of the DatasetDict apply a given function to each row of the dataset (to each dataset's split).

In [24]:
tokenized_datasets = datasets.map(preprocess_train(tokenizer, max_length, stride),
                                  batched=True,
                                  remove_columns=datasets["train"].column_names)

  0%|          | 0/79 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

The result is:

In [25]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'start_positions', 'end_positions'],
        num_rows: 79245
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'start_positions', 'end_positions'],
        num_rows: 9279
    })
})

## Training

As previously mentioned it's going to be used a pretrained model and then finetuned on the task at hand. In particular DistilBERT, just like BERT, is trained to be used mainly on masked language modeling and next sentence prediction tasks.<br>
Since the model has already been defined during the preprocessing phase, it's now possible to direcly download it for HuggingFace Model Hub using the `from_pretrained` method.<br>
`AutoModel` is the class that instantiate the correct architecture based on the model downloaded from the hub. `AutoModelForQuestionAnswering` in addition attaches to the pretrained backbone the head needed to perform this kind of task (which is not pretrained).

In [26]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
import zipfile

#model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

!gdown --id "1ThyHyaFwci_SXLB6jrBnm6aacN74_YCd"

with zipfile.ZipFile('squad_trained.zip', 'r') as zip_ref:
    zip_ref.extractall('./')

model = AutoModelForQuestionAnswering.from_pretrained("squad_trained")

Downloading...
From: https://drive.google.com/uc?id=1ThyHyaFwci_SXLB6jrBnm6aacN74_YCd
To: /content/squad_trained.zip
100% 245M/245M [00:03<00:00, 80.7MB/s]


### Trainer Class Definition
The pretraining of the model will be handled by the class `Trainer`.<br>
Still, some things needs to be defined before being able to use the `Trainer` class.<br>
The first thing is the `TrainingArguments` which specify the saving folder, batch's size, learning rate, ecc...

In [27]:
batch_size = 16

args = TrainingArguments(
    "squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01
)

The second and last thing to define is the data collator, which is used to batch together sequences having different length.

In [28]:
from transformers import default_data_collator

data_collator = default_data_collator

Now it's finally possible to define the Trainer class.

In [29]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

### Finetuning
The method `train` of the `Trainer` class is used to trigger the finetuning process.

In [30]:
trainer.train()

***** Running training *****
  Num examples = 79245
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 14859


Epoch,Training Loss,Validation Loss
1,0.6399,1.391012
2,0.6449,1.335486
3,0.4452,1.46179


Saving model checkpoint to squad/checkpoint-500
Configuration saved in squad/checkpoint-500/config.json
Model weights saved in squad/checkpoint-500/pytorch_model.bin
tokenizer config file saved in squad/checkpoint-500/tokenizer_config.json
Special tokens file saved in squad/checkpoint-500/special_tokens_map.json
Saving model checkpoint to squad/checkpoint-1000
Configuration saved in squad/checkpoint-1000/config.json
Model weights saved in squad/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in squad/checkpoint-1000/tokenizer_config.json
Special tokens file saved in squad/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to squad/checkpoint-1500
Configuration saved in squad/checkpoint-1500/config.json
Model weights saved in squad/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in squad/checkpoint-1500/tokenizer_config.json
Special tokens file saved in squad/checkpoint-1500/special_tokens_map.json
Saving model checkpoint to squad/checkpoint-2000

TrainOutput(global_step=14859, training_loss=0.5720936070301065, metrics={'train_runtime': 8939.0667, 'train_samples_per_second': 26.595, 'train_steps_per_second': 1.662, 'total_flos': 2.329561105208064e+16, 'train_loss': 0.5720936070301065, 'epoch': 3.0})

Saving the model.

In [31]:
trainer.save_model("squad-trained")

Saving model checkpoint to squad-trained
Configuration saved in squad-trained/config.json
Model weights saved in squad-trained/pytorch_model.bin
tokenizer config file saved in squad-trained/tokenizer_config.json
Special tokens file saved in squad-trained/special_tokens_map.json


## Evaluation

The evaluation phase it's not straightforward and requires some additional steps in order to perform it.<br>
In particular the output of the model are the loss and two scores indicating the likelihood of a token being the start and end of the answer.<br>
Simply taking the argmax of both will not do since it may create unfeasible situations: start position greater than end position and/or start position at question (remember that the input senquence is composed by the union of the tokenized answer and tokenized context).

### Preprocessing the Evaluation Data
Before evaluating the model some processing steps are required: all the data necessary to avoid the aforementioned problems needs to be added to the dataset.<br>
The problem of the answer being located inside the question is addressed by adding the starting token of the context inside the unified input sequence.<br>
Thanks to the column `overflow_to_sample_mapping` it's also possible to have a reference between the features and the corresponding row.

In [32]:
def preprocess_eval(tokenizer, max_length, stride):
    pad_on_right = tokenizer.padding_side == "right"
    def preprocess_eval_impl(rows):
        # Tokenize the rows
        tokenized_rows = tokenize(tokenizer, max_length, stride, rows)

        # overflow_to_sample_mapping keeps the corrispondence between a feature and the row it was generated by.
        sample_mapping = tokenized_rows.pop("overflow_to_sample_mapping")

        # For each feature save the row that generated it.
        tokenized_rows["row_id"] = [rows["id"][sample_index] for sample_index in sample_mapping]

        # Save the start and end context's token's position inside the tokenized input sequence (composed by question plus context)
        context_pos = [get_context_position_in_tokenized_input(tokenized_rows,i,pad_on_right) for i in range(len(tokenized_rows["input_ids"]))]
        tokenized_rows["context_start"], tokenized_rows["context_end"] = [index.start for index in context_pos], [index.end for index in context_pos]

        return tokenized_rows
    return preprocess_eval_impl

In [33]:
validation_features = datasets["validation"].map(
    preprocess_eval(tokenizer, max_length, stride),
    batched=True,
    remove_columns=datasets["validation"].column_names
)

  0%|          | 0/10 [00:00<?, ?ba/s]

The validation's features generated from the preprocessing are used to compute the predictions.

In [34]:
raw_valid_predictions = trainer.predict(validation_features)

The following columns in the test set don't have a corresponding argument in `DistilBertForQuestionAnswering.forward` and have been ignored: context_start, row_id, offset_mapping, context_end. If context_start, row_id, offset_mapping, context_end are not expected by `DistilBertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 9279
  Batch size = 16


Since the `Trainer` class hides the columns not used during the prediction they have to be set back.

In [35]:
validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))

### Posprocessing the Evaluation Data
The aim of the posprocessing is: given the raw prediction (composed by the likelihoods of each input token to be the starting and ending token of the answer) the function retrieve the portion of the context's text corresponding to the predicted answer.

`get_best_feasible_position` function select the best possible pairs of starting and ending tokens for each answer.<br>
The problem is easily shapeable as a linear optimization problem.<br>
The function has been originally implemented by using `z3` library, but it has been sucessively discarded because of performance issues.<br>
The used implementation can be found after `z3`'s.

In [36]:
!pip install z3-solver

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting z3-solver
  Downloading z3_solver-4.10.1.0-py2.py3-none-manylinux1_x86_64.whl (52.9 MB)
[K     |████████████████████████████████| 52.9 MB 71 kB/s 
[?25hInstalling collected packages: z3-solver
Successfully installed z3-solver-4.10.1.0


In [37]:
from z3 import *

Score = collections.namedtuple("Score", ["index","score"])

def get_best_feasible_position(context_start, context_end, start_logits, end_logits):
    start_index = Int("start_index")
    end_index = Int("end_index")
    st_log = Array('st_log', IntSort(), RealSort())
    e_log = Array('e_log', IntSort(), RealSort())
    for i,sl in enumerate(start_logits):
        st_log = Store(st_log, i, sl)
    for i,el in enumerate(end_logits):
        e_log = Store(e_log, i, el)

    constraint = And(start_index < end_index,
                     start_index >= context_start,
                     end_index <= context_end)
    opt = Optimize()
    opt.add(constraint)
    opt.maximize(st_log[start_index]+e_log[end_index])
    if opt.check() == sat:
        model = opt.model()
        return Score(index=Position(start=model.evaluate(start_index).as_long(),
                                    end=model.evaluate(end_index).as_long()),
                     score=st_log[start_index]+e_log[end_index])
    else:
        raise StopIteration


In [38]:
Score = collections.namedtuple("Score", ["index","score"])

def get_best_feasible_position(start_logits, end_logits, context_start, context_end, n_logits=0.15):
    #Sort logits in ascending order
    sorted_start_logit = sorted(enumerate(start_logits), key=lambda x: x[1], reverse=True)[:int(len(start_logits)*n_logits)]
    sorted_end_logit = sorted(enumerate(end_logits), key=lambda x: x[1], reverse=True)[:int(len(end_logits)*n_logits)]

    # Associate the positions of each pair of start and end tokens to their score and sort them in descending order of score
    sorted_scores = collections.OrderedDict(
                            sorted({Position(start=i, end=j):sl+el for i,sl in sorted_start_logit for j,el in sorted_end_logit}.items(),
                                    key=lambda x: x[1],
                                    reverse=True)
                    )
    
    # Return the position of the pair of higher score that respects the consistency constraints
    return next(Score(index=pos, score=score) for pos,score in sorted_scores.items() \
                if pos.start <= pos.end and pos.start >= context_start and pos.end <= context_end)

`map_feature_to_row` uses the `row_id` that has been added during the preprocessing step in order to create a corrispondence between a feature and the row it belong to.

In [39]:
def map_feature_to_row(dataset, features):
    # Associate rows' id with an index
    row_id_to_index = {k: i for i, k in enumerate(dataset["id"])}
    features_per_row = collections.defaultdict(list)
    # Create a corrispondence beween the previously computed rows' index with
    # the index of the features that belong to the said rows
    for i, feature in enumerate(features):
        features_per_row[row_id_to_index[feature["row_id"]]].append(i)

    return features_per_row

The `postprocess_eval` function use the two function defined above and for each raw prediction returns a portion of context's text that best match it taking into account:
- The logits values outputted by the model.
- The consistency constraints mentioned above.

In [40]:
from tqdm.notebook import tqdm

def postprocess_eval(dataset, features, raw_predictions, verbose=True):
    all_start_logits, all_end_logits = raw_predictions

    # Map the dataset's rows to their corresponding features.
    features_per_row = map_feature_to_row(dataset, features)

    predictions = collections.OrderedDict()

    if verbose:
        print(f"Post-processing {len(dataset)} dataset predictions split into {len(features)} features.")

    for row_index, row in enumerate(tqdm(dataset)):
        valid_answers = []

        # Indices of the features associated to the current row.
        feature_indices = features_per_row[row_index]
        
        context = row["context"]
        # Loop on the features associated to the current row.
        for feature_index in feature_indices:
            context_start = features[feature_index]["context_start"]
            context_end = features[feature_index]["context_end"]

            offsets = features[feature_index]["offset_mapping"]

            # Computation of the answer from the raw preditions.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            try:
                valid_answers.append(get_best_feasible_position(start_logits, end_logits, context_start, context_end))
            except StopIteration:
                continue

        # For each row use as answer the best candidate generated by the row's features
        if len(valid_answers) > 0:
            answer_pos = sorted(valid_answers, key=lambda x: x.score, reverse=True)[0].index
            answer = context[offsets[answer_pos.start][0]: offsets[answer_pos.end][1]]
        # In case no candidates are found return an empty string
        else:
            print("Not found any consistent answer's start and/or end")
            answer = ""

        predictions[row["id"]] = answer

    return predictions

Calling the post-processing function over the validation set.

In [41]:
validation_predictions = postprocess_eval(datasets["validation"],
                                          validation_features,
                                          raw_valid_predictions.predictions)

Post-processing 9171 dataset predictions split into 9279 features.


  0%|          | 0/9171 [00:00<?, ?it/s]

### Compute Metrics
The metrics that are those provided from HuggingFace for the squad dataset: exact match and f1 score.

In [42]:
from datasets import load_metric

metric = load_metric("squad")

Downloading builder script:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

In [43]:
formatted_predictions = [{"id": k, "prediction_text": v} for k, v in validation_predictions.items()]
references = [{"id": r["id"], "answers": r["answers"]} for r in datasets["validation"]]

metric.compute(predictions=formatted_predictions, references=references)

{'exact_match': 69.1745720204994, 'f1': 80.48996193217269}

###  Error analysis
In order to analyze what kind of errors the model made, the mistaken predictions should first be retrieved.<br>
With "mistaken predictions" are intended those predictions that do not exactly match with the ground truth.

In [44]:
import re
import string

def normalize_answer(s):
    """Lower text and remove punctuation, articles and extra whitespace."""
    def remove_articles(text):
        regex = re.compile(r'\b(a|an|the)\b', re.UNICODE)
        return re.sub(regex, ' ', text)
    def white_space_fix(text):
        return ' '.join(text.split())
    def remove_punc(text):
        exclude = set(string.punctuation)
        return ''.join(ch for ch in text if ch not in exclude)
    def lower(text):
        return text.lower()
    return white_space_fix(remove_articles(remove_punc(lower(s))))

In [46]:
actual_match = pd.DataFrame([{"question":row["question"], "context":row["context"], "ground_truth":row["answers"]["text"][0], "prediction":validation_predictions[row["id"]]}
                       for row in datasets["validation"] \
                       if normalize_answer(row["answers"]["text"][0]) == normalize_answer(validation_predictions[row["id"]])])

In [47]:
display_dataframe(actual_match.head(30))

Unnamed: 0,question,context,ground_truth,prediction
0,In what city and state did Beyonce grow up?,"Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles ""Crazy in Love"" and ""Baby Boy"".","Houston, Texas","Houston, Texas"
1,In what R&B group was she the lead singer?,"Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles ""Crazy in Love"" and ""Baby Boy"".",Destiny's Child,Destiny's Child
2,What album made her a worldwide known artist?,"Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles ""Crazy in Love"" and ""Baby Boy"".",Dangerously in Love,Dangerously in Love
3,Who managed the Destiny's Child group?,"Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles ""Crazy in Love"" and ""Baby Boy"".",Mathew Knowles,Mathew Knowles
4,In what city did Beyonce grow up?,"Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles ""Crazy in Love"" and ""Baby Boy"".",Houston,Houston
5,What was the name of Beyonce's first solo album?,"Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles ""Crazy in Love"" and ""Baby Boy"".",Dangerously in Love,Dangerously in Love
6,On what date was Beyonce born?,"Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles ""Crazy in Love"" and ""Baby Boy"".","September 4, 1981","September 4, 1981"
7,What is Beyonce's full name?,"Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles ""Crazy in Love"" and ""Baby Boy"".",Beyoncé Giselle Knowles-Carter,Beyoncé Giselle Knowles-Carter
8,When did Beyoncé rise to fame?,"Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles ""Crazy in Love"" and ""Baby Boy"".",late 1990s,late 1990s
9,What role did Beyoncé have in Destiny's Child?,"Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles ""Crazy in Love"" and ""Baby Boy"".",lead singer,lead singer


In [48]:
errors = pd.DataFrame([{"question":row["question"], "context":row["context"], "ground_truth":row["answers"]["text"][0], "prediction":validation_predictions[row["id"]]}
                       for row in datasets["validation"] \
                       if normalize_answer(row["answers"]["text"][0]) != normalize_answer(validation_predictions[row["id"]])])

Total number of mistaken predictions.

In [49]:
print("Wrong answers: {}/{}".format(len(errors),len(datasets["validation"])))

Wrong answers: 2827/9171


In order to check what kind of mistakes the model made, some of the errors will be displayed.<br>
First 30 errors:

In [50]:
# display_dataframe is defined in the Datast Creation paragraph
display_dataframe(errors.head(30))

Unnamed: 0,question,context,ground_truth,prediction
0,When did Beyonce start becoming popular?,"Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles ""Crazy in Love"" and ""Baby Boy"".",in the late 1990s,late 1990s
1,What areas did Beyonce compete in when she was growing up?,"Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles ""Crazy in Love"" and ""Baby Boy"".",singing and dancing,singing and dancing competitions
2,When did Beyonce leave Destiny's Child and become a solo singer?,"Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles ""Crazy in Love"" and ""Baby Boy"".",2003,1990s
3,In which decade did Beyonce become famous?,"Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles ""Crazy in Love"" and ""Baby Boy"".",late 1990s,1990s
4,"After her second solo album, what other entertainment venture did Beyonce explore?","Following the disbandment of Destiny's Child in June 2005, she released her second solo album, B'Day (2006), which contained hits ""Déjà Vu"", ""Irreplaceable"", and ""Beautiful Liar"". Beyoncé also ventured into acting, with a Golden Globe-nominated performance in Dreamgirls (2006), and starring roles in The Pink Panther (2006) and Obsessed (2009). Her marriage to rapper Jay Z and portrayal of Etta James in Cadillac Records (2008) influenced her third album, I Am... Sasha Fierce (2008), which saw the birth of her alter-ego Sasha Fierce and earned a record-setting six Grammy Awards in 2010, including Song of the Year for ""Single Ladies (Put a Ring on It)"". Beyoncé took a hiatus from music in 2010 and took over management of her career; her fourth album 4 (2011) was subsequently mellower in tone, exploring 1970s funk, 1980s pop, and 1990s soul. Her critically acclaimed fifth studio album, Beyoncé (2013), was distinguished from previous releases by its experimental production and exploration of darker themes.",acting,B'Day
5,Which album was darker in tone from her previous work?,"Following the disbandment of Destiny's Child in June 2005, she released her second solo album, B'Day (2006), which contained hits ""Déjà Vu"", ""Irreplaceable"", and ""Beautiful Liar"". Beyoncé also ventured into acting, with a Golden Globe-nominated performance in Dreamgirls (2006), and starring roles in The Pink Panther (2006) and Obsessed (2009). Her marriage to rapper Jay Z and portrayal of Etta James in Cadillac Records (2008) influenced her third album, I Am... Sasha Fierce (2008), which saw the birth of her alter-ego Sasha Fierce and earned a record-setting six Grammy Awards in 2010, including Song of the Year for ""Single Ladies (Put a Ring on It)"". Beyoncé took a hiatus from music in 2010 and took over management of her career; her fourth album 4 (2011) was subsequently mellower in tone, exploring 1970s funk, 1980s pop, and 1990s soul. Her critically acclaimed fifth studio album, Beyoncé (2013), was distinguished from previous releases by its experimental production and exploration of darker themes.",Beyoncé,4
6,"After what movie portraying Etta James, did Beyonce create Sasha Fierce?","Following the disbandment of Destiny's Child in June 2005, she released her second solo album, B'Day (2006), which contained hits ""Déjà Vu"", ""Irreplaceable"", and ""Beautiful Liar"". Beyoncé also ventured into acting, with a Golden Globe-nominated performance in Dreamgirls (2006), and starring roles in The Pink Panther (2006) and Obsessed (2009). Her marriage to rapper Jay Z and portrayal of Etta James in Cadillac Records (2008) influenced her third album, I Am... Sasha Fierce (2008), which saw the birth of her alter-ego Sasha Fierce and earned a record-setting six Grammy Awards in 2010, including Song of the Year for ""Single Ladies (Put a Ring on It)"". Beyoncé took a hiatus from music in 2010 and took over management of her career; her fourth album 4 (2011) was subsequently mellower in tone, exploring 1970s funk, 1980s pop, and 1990s soul. Her critically acclaimed fifth studio album, Beyoncé (2013), was distinguished from previous releases by its experimental production and exploration of darker themes.",Cadillac Records,Destiny's Child
7,"In her music, what are some recurring elements in them?","A self-described ""modern-day feminist"", Beyoncé creates songs that are often characterized by themes of love, relationships, and monogamy, as well as female sexuality and empowerment. On stage, her dynamic, highly choreographed performances have led to critics hailing her as one of the best entertainers in contemporary popular music. Throughout a career spanning 19 years, she has sold over 118 million records as a solo artist, and a further 60 million with Destiny's Child, making her one of the best-selling music artists of all time. She has won 20 Grammy Awards and is the most nominated woman in the award's history. The Recording Industry Association of America recognized her as the Top Certified Artist in America during the 2000s decade. In 2009, Billboard named her the Top Radio Songs Artist of the Decade, the Top Female Artist of the 2000s and their Artist of the Millennium in 2011. Time listed her among the 100 most influential people in the world in 2013 and 2014. Forbes magazine also listed her as the most powerful female musician of 2015.","love, relationships, and monogamy","love, relationships, and monogamy, as well as female sexuality and empowerment"
8,Time magazine named her one of the most 100 what people of the century?,"A self-described ""modern-day feminist"", Beyoncé creates songs that are often characterized by themes of love, relationships, and monogamy, as well as female sexuality and empowerment. On stage, her dynamic, highly choreographed performances have led to critics hailing her as one of the best entertainers in contemporary popular music. Throughout a career spanning 19 years, she has sold over 118 million records as a solo artist, and a further 60 million with Destiny's Child, making her one of the best-selling music artists of all time. She has won 20 Grammy Awards and is the most nominated woman in the award's history. The Recording Industry Association of America recognized her as the Top Certified Artist in America during the 2000s decade. In 2009, Billboard named her the Top Radio Songs Artist of the Decade, the Top Female Artist of the 2000s and their Artist of the Millennium in 2011. Time listed her among the 100 most influential people in the world in 2013 and 2014. Forbes magazine also listed her as the most powerful female musician of 2015.",influential,Time listed her among the 100 most influential people in the world
9,Which magazine declared her the most dominant woman musician?,"A self-described ""modern-day feminist"", Beyoncé creates songs that are often characterized by themes of love, relationships, and monogamy, as well as female sexuality and empowerment. On stage, her dynamic, highly choreographed performances have led to critics hailing her as one of the best entertainers in contemporary popular music. Throughout a career spanning 19 years, she has sold over 118 million records as a solo artist, and a further 60 million with Destiny's Child, making her one of the best-selling music artists of all time. She has won 20 Grammy Awards and is the most nominated woman in the award's history. The Recording Industry Association of America recognized her as the Top Certified Artist in America during the 2000s decade. In 2009, Billboard named her the Top Radio Songs Artist of the Decade, the Top Female Artist of the 2000s and their Artist of the Millennium in 2011. Time listed her among the 100 most influential people in the world in 2013 and 2014. Forbes magazine also listed her as the most powerful female musician of 2015.",Forbes,Forbes magazine


Random 30 errors:

In [51]:
display_dataframe(errors.sample(frac=1).reset_index(drop=True).head(30))

Unnamed: 0,question,context,ground_truth,prediction
0,What was a way in which a free peasant might become an aristocrat?,"Peasant society is much less documented than the nobility. Most of the surviving information available to historians comes from archaeology; few detailed written records documenting peasant life remain from before the 9th century. Most the descriptions of the lower classes come from either law codes or writers from the upper classes. Landholding patterns in the West were not uniform; some areas had greatly fragmented landholding patterns, but in other areas large contiguous blocks of land were the norm. These differences allowed for a wide variety of peasant societies, some dominated by aristocratic landholders and others having a great deal of autonomy. Land settlement also varied greatly. Some peasants lived in large settlements that numbered as many as 700 inhabitants. Others lived in small groups of a few families and still others lived on isolated farms spread over the countryside. There were also areas where the pattern was a mix of two or more of those systems. Unlike in the late Roman period, there was no sharp break between the legal status of the free peasant and the aristocrat, and it was possible for a free peasant's family to rise into the aristocracy over several generations through military service to a powerful lord.",military service,military service to a powerful lord
1,Under whom did the Western part of Umayyad Caliphate's empire gain its independence?,"After defeating the Visigoths in only a few months, the Umayyad Caliphate started expanding rapidly in the peninsula. Beginning in 711, the land that is now Portugal became part of the vast Umayyad Caliphate's empire of Damascus, which stretched from the Indus river in the Indian sub-continent (now Pakistan) up to the South of France, until its collapse in 750. That year the west of the empire gained its independence under Abd-ar-Rahman I with the establishment of the Emirate of Córdoba. After almost two centuries, the Emirate became the Caliphate of Córdoba in 929, until its dissolution a century later in 1031 into no less than 23 small kingdoms, called Taifa kingdoms.",Abd-ar-Rahman,Abd-ar-Rahman I
2,"After Gaddafi stepped down from the GPC, what title did he take?","In December 1978, Gaddafi stepped down as Secretary-General of the GPC, announcing his new focus on revolutionary rather than governmental activities; this was part of his new emphasis on separating the apparatus of the revolution from the government. Although no longer in a formal governmental post, he adopted the title of ""Leader of the Revolution"" and continued as commander-in-chief of the armed forces. He continued exerting considerable influence over Libya, with many critics insisting that the structure of Libya's direct democracy gave him ""the freedom to manipulate outcomes"".",Leader of the Revolution,Secretary-General
3,What did researcher Geng Qingguo say was sent to the State Seismological Bureau?,"Malaysia-based Yazhou Zhoukan conducted an interview with former researcher at the China Seismological Bureau Geng Qingguo (耿庆国), in which Geng claimed that a confidential written report was sent to the State Seismological Bureau on April 30, 2008, warning about the possible occurrence of a significant earthquake in Ngawa Prefecture region of Sichuan around May 8, with a range of 10 days before or after the quake. Geng, while acknowledging that earthquake prediction was broadly considered problematic by the scientific community, believed that ""the bigger the earthquake, the easier it is to predict."" Geng had long attempted to establish a correlation between the occurrence of droughts and earthquakes; Premier Zhou Enlai reportedly took an interest in Geng's work. Geng's drought-earthquake correlation theory was first released in 1972, and said to have successfully predicted the 1975 Haicheng and 1976 Tangshan earthquakes. The same Yazhou Zhoukan article pointed out the inherent difficulties associated with predicting earthquakes. In response, an official with the Seismological Bureau stated that ""earthquake prediction is widely acknowledged around the world to be difficult from a scientific standpoint."" The official also denied that the Seismological Bureau had received reports predicting the earthquake.",written report,a confidential written report
4,What replaced Russell's administration?,"Russell's ministry, though Whig, was not favoured by the Queen. She found particularly offensive the Foreign Secretary, Lord Palmerston, who often acted without consulting the Cabinet, the Prime Minister, or the Queen. Victoria complained to Russell that Palmerston sent official dispatches to foreign leaders without her knowledge, but Palmerston was retained in office and continued to act on his own initiative, despite her repeated remonstrances. It was only in 1851 that Palmerston was removed after he announced the British government's approval of President Louis-Napoleon Bonaparte's coup in France without consulting the Prime Minister. The following year, President Bonaparte was declared Emperor Napoleon III, by which time Russell's administration had been replaced by a short-lived minority government led by Lord Derby.",a short-lived minority government led by Lord Derby,a short-lived minority government
5,Who did a work for Italian television about Chopin's life?,"Chopin's life was covered in a BBC TV documentary Chopin – The Women Behind The Music (2010), and in a 2010 documentary realised by Angelo Bozzolini and Roberto Prosseda for Italian television.",Angelo Bozzolini and Roberto Prosseda,Roberto Prosseda
6,When did the last British troops leave Egypt?,"Nasser was informed of the British–American withdrawal via a news statement while aboard a plane returning to Cairo from Belgrade, and took great offense. Although ideas for nationalizing the Suez Canal were in the offing after the UK agreed to withdraw its military from Egypt in 1954 (the last British troops left on 13 June 1956), journalist Mohamed Hassanein Heikal asserts that Nasser made the final decision to nationalize the waterway between 19 and 20 July. Nasser himself would later state that he decided on 23 July, after studying the issue and deliberating with some of his advisers from the dissolved RCC, namely Boghdadi and technical specialist Mahmoud Younis, beginning on 21 July. The rest of the RCC's former members were informed of the decision on 24 July, while the bulk of the cabinet was unaware of the nationalization scheme until hours before Nasser publicly announced it. According to Ramadan, Nasser's decision to nationalize the canal was a solitary decision, taken without consultation.",1956,13 June 1956
7,What was the dress rehearsal for?,"At the funeral of the tenor Adolphe Nourrit in Paris in 1839, Chopin made a rare appearance at the organ, playing a transcription of Franz Schubert's lied Die Gestirne. On 26 July 1840 Chopin and Sand were present at the dress rehearsal of Berlioz's Grande symphonie funèbre et triomphale, composed to commemorate the tenth anniversary of the July Revolution. Chopin was reportedly unimpressed with the composition.",Berlioz's Grande symphonie funèbre et triomphale,"Grande symphonie funèbre et triomphale, composed to commemorate the tenth anniversary of the July Revolution"
8,What were all these tents and quilts for?,"The Red Cross Society of China flew 557 tents and 2,500 quilts valued at 788,000 yuan (US$113,000) to Wenchuan County. The Amity Foundation already began relief work in the region and has earmarked US$143,000 for disaster relief. The Sichuan Ministry of Civil Affairs said that they have provided 30,000 tents for those left homeless.",those left homeless,disaster relief
9,Whose lock of hair was concealed in her left hand by flowers?,"In 1897, Victoria had written instructions for her funeral, which was to be military as befitting a soldier's daughter and the head of the army, and white instead of black. On 25 January, Edward VII, the Kaiser and Prince Arthur, Duke of Connaught, helped lift her body into the coffin. She was dressed in a white dress and her wedding veil. An array of mementos commemorating her extended family, friends and servants were laid in the coffin with her, at her request, by her doctor and dressers. One of Albert's dressing gowns was placed by her side, with a plaster cast of his hand, while a lock of John Brown's hair, along with a picture of him, was placed in her left hand concealed from the view of the family by a carefully positioned bunch of flowers. Items of jewellery placed on Victoria included the wedding ring of John Brown's mother, given to her by Brown in 1883. Her funeral was held on Saturday, 2 February, in St George's Chapel, Windsor Castle, and after two days of lying-in-state, she was interred beside Prince Albert in Frogmore Mausoleum at Windsor Great Park. As she was laid to rest at the mausoleum, it began to snow.",John Brown,John Brown's


Retrieve an error by querying by question.

In [52]:
def get_error(errors, question):
    return errors[errors['question']==question]

In [53]:
display_dataframe(get_error(errors, 'What genre of movie did Beyonce star in with Cuba Gooding, Jr?'))

Unnamed: 0,question,context,ground_truth,prediction
30,"What genre of movie did Beyonce star in with Cuba Gooding, Jr?","In July 2002, Beyoncé continued her acting career playing Foxxy Cleopatra alongside Mike Myers in the comedy film, Austin Powers in Goldmember, which spent its first weekend atop the US box office and grossed $73 million. Beyoncé released ""Work It Out"" as the lead single from its soundtrack album which entered the top ten in the UK, Norway, and Belgium. In 2003, Beyoncé starred opposite Cuba Gooding, Jr., in the musical comedy The Fighting Temptations as Lilly, a single mother whom Gooding's character falls in love with. The film received mixed reviews from critics but grossed $30 million in the U.S. Beyoncé released ""Fighting Temptation"" as the lead single from the film's soundtrack album, with Missy Elliott, MC Lyte, and Free which was also used to promote the film. Another of Beyoncé's contributions to the soundtrack, ""Summertime"", fared better on the US charts.",musical comedy,The Fighting Temptations
