# MobileBERT for Question Answering on the SQuAD dataset

### 1. Understanding the SQuAD dataset 

In these notebooks we are going use [MobileBERT implemented by HuggingFace](https://huggingface.co/docs/transformers/model_doc/mobilebert) on the question answering task by text-extraction on the [The Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer/). The data is composed by a set of questions and paragraphs that contain the answers. The model will be trained to locate the answer in the context by giving the positions where the answer starts and ends.

In this notebook we are going to explore the dataset and see how to set it up for fine-tuning.

More info from HuggingFace docs:
- [Question Answering](https://huggingface.co/tasks/question-answering)
- [Glossary](https://huggingface.co/transformers/glossary.html#model-inputs)
- [Question Answering chapter of NLP course](https://huggingface.co/learn/nlp-course/chapter7/7?fw=pt)

In [1]:
from datasets import load_dataset
from transformers import AutoTokenizer
from rich.pretty import pprint

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
from datasets.utils import disable_progress_bar
from datasets import disable_caching


disable_progress_bar()
disable_caching()

## The raw data

In [3]:
# Load the dataset
hf_dataset = load_dataset('squad')

In [4]:
# Display the dataset to check how it is partitioned
hf_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [5]:
# Let's check five train set samples to see how they look
for _squad_example in hf_dataset['train'].select(range(5)):
    pprint(_squad_example)

In [6]:
# Let's five validation sample to see how they look
for _squad_example in hf_dataset['validation'].select(range(5)):
    pprint(_squad_example)

In [7]:
# Individual samples can be accessed as a dictionary
squad_ex = hf_dataset['train'].select([20584])

In [8]:
squad_ex['title']

['Alps']

In [9]:
squad_ex['context']

['The Alps (/ælps/; Italian: Alpi [ˈalpi]; French: Alpes [alp]; German: Alpen [ˈʔalpm̩]; Slovene: Alpe [ˈáːlpɛ]) are the highest and most extensive mountain range system that lies entirely in Europe, stretching approximately 1,200 kilometres (750 mi) across eight Alpine countries: Austria, France, Germany, Italy, Liechtenstein, Monaco, Slovenia, and Switzerland. The Caucasus Mountains are higher, and the Urals longer, but both lie partly in Asia. The mountains were formed over tens of millions of years as the African and Eurasian tectonic plates collided. Extreme shortening caused by the event resulted in marine sedimentary rocks rising by thrusting and folding into high mountain peaks such as Mont Blanc and the Matterhorn. Mont Blanc spans the French–Italian border, and at 4,810 m (15,781 ft) is the highest mountain in the Alps. The Alpine region area contains about a hundred peaks higher than 4,000 m (13,123 ft), known as the "four-thousanders".']

In [10]:
squad_ex['question']

['How long has it taken for the Alps to form? ']

In [11]:
squad_ex['answers']

[{'text': ['over tens of millions of years'], 'answer_start': [475]}]

# The tokenizer

## Processing the data for training
Now we process the data so we can feed it later to the model.
The idea is to replace the words (and some word parts) by numbers using the tokenizer above and organize the training data as a set of paragraphs and questions.

In [12]:
# We will work with this model
hf_model = 'google/mobilebert-uncased'

In [75]:
# Extract the tokenizer that was used for pretraining that model
# make sure we use the same tokenizer
tokenizer = AutoTokenizer.from_pretrained(hf_model)

In [76]:
tokenizer

MobileBertTokenizerFast(name_or_path='google/mobilebert-uncased', vocab_size=30522, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

## Question

1. Check the `tokenizer` object and find out its vocabulary length

In [77]:
tokenizer.vocab_size

30522

## Processing the data

There are a few preprocessing operations that we need to do in the dataset so it can be fed to the HuggingFace MobileBERT model class:
 1. Tokenize the contexts and the answers with the tokenizer we extracted (it already outputs a dictionary in the shape the model expects)
 2. Convert the start and ending positions from relative-to-character to relative-to-token. For example, in the string `"the cat sat in the mat"`, the answer for the question 'Where did the cat sit', starts at character 16. In the dataset it would appear as `{'answer_start': [16]}`. If the sentence is tokenized as `["the", "cat", "sat", "on", "the", "mat"]`, the answer starts at token 5.
 3. Discard the question/context pairs where the the answer appears outside of the truncation lenght of the tokenizer. This will be done only to make the tutorial simpler, as it can result in loss of information and potentially impact the performance of the model. Instead of discarding the extra tokens, one can make a smaller contexts by removing the begining so the answers fit. Find more info in the [Question Answering chapter of HuggingFace's NLP course](https://huggingface.co/learn/nlp-course/chapter7/7?fw=pt)

In [43]:
squad_ex['context']

['The Alps (/ælps/; Italian: Alpi [ˈalpi]; French: Alpes [alp]; German: Alpen [ˈʔalpm̩]; Slovene: Alpe [ˈáːlpɛ]) are the highest and most extensive mountain range system that lies entirely in Europe, stretching approximately 1,200 kilometres (750 mi) across eight Alpine countries: Austria, France, Germany, Italy, Liechtenstein, Monaco, Slovenia, and Switzerland. The Caucasus Mountains are higher, and the Urals longer, but both lie partly in Asia. The mountains were formed over tens of millions of years as the African and Eurasian tectonic plates collided. Extreme shortening caused by the event resulted in marine sedimentary rocks rising by thrusting and folding into high mountain peaks such as Mont Blanc and the Matterhorn. Mont Blanc spans the French–Italian border, and at 4,810 m (15,781 ft) is the highest mountain in the Alps. The Alpine region area contains about a hundred peaks higher than 4,000 m (13,123 ft), known as the "four-thousanders".']

In [47]:
squad_ex = hf_dataset['train'].select([20584])
answer_start = squad_ex['answers'][0]['answer_start'][0]
answer_start

475

In [49]:
squad_ex = hf_dataset['train'].select([20584])
answer_start = squad_ex['answers']
answer_start

[{'text': ['over tens of millions of years'], 'answer_start': [475]}]

In [83]:
# Maximum sequence length
MAX_SEQ_LEN = 300
# works here if using a map, rather than a simple print!
# gets the questions and answers. Gets constant length, and padding
# truncation only_first means only truncate context if too long
def tokenize_dataset(squad_example, tokenizer=tokenizer):
    """Tokenize the text in the dataset and convert
    the start and ending positions of the answers
    from text to tokens"""
    max_len = MAX_SEQ_LEN
    context = squad_example['context']
    answer_start = squad_example['answers']['answer_start'][0]
    answer = squad_example['answers']['text'][0]
    squad_example_tokenized = tokenizer(
        context, squad_example['question'],
        padding='max_length',
        max_length=max_len,
        truncation='only_first',
    )
    token_start = len(tokenizer.tokenize(context[:answer_start + 1]))
    token_end = len(tokenizer.tokenize(answer)) + token_start

    # Add the "start_token_idx" and "end_token_idx" keys to the 
    # `squad_example_tokenized` dictionary
    squad_example_tokenized['start_token_idx'] = token_start
    squad_example_tokenized['end_token_idx'] = token_end

    return squad_example_tokenized


def filter_samples_by_max_seq_len(squad_example):
    """Filter out the samples where the answers are
    not within the first `MAX_SEQ_LEN` tokens"""
    max_len = MAX_SEQ_LEN
    answer_start = squad_example['answers']['answer_start'][0]
    answer = squad_example['answers']['text'][0]
    token_start = len(tokenizer.tokenize(squad_example['context'][:answer_start]))
    token_end = len(tokenizer.tokenize(answer)) + token_start
    return token_end < max_len

## Questions

1. In the function `tokenize_dataset`, what does the following code do? Try it outside of the function with one of the context and questions we extracted above. Make sure you understand all the arguments ;)
```python
    squad_example_tokenized = tokenizer(
        context, squad_example['question'],
        padding='max_length',
        max_length=max_len,
        truncation='only_first',
    )
```
2. Make sure you understand how `token_start` and `token_end` are obtained.

In [95]:
# Apply the filter
context = squad_ex['context']
max_len = 300
squad_example_tokenized = tokenizer(
        context, squad_ex['question'],
        padding='max_length',
        max_length=max_len,
        truncation='only_first',
    )

In [96]:
squad_example_tokenized

{'input_ids': [[101, 1996, 13698, 1006, 1013, 1097, 14277, 2015, 1013, 1025, 3059, 1024, 2632, 8197, 1031, 1149, 2389, 8197, 1033, 1025, 2413, 1024, 2632, 10374, 1031, 2632, 2361, 1033, 1025, 2446, 1024, 2632, 11837, 1031, 1149, 29705, 2389, 9737, 1033, 1025, 18326, 1024, 2632, 5051, 1031, 1149, 2050, 23432, 14277, 29275, 1033, 1007, 2024, 1996, 3284, 1998, 2087, 4866, 3137, 2846, 2291, 2008, 3658, 4498, 1999, 2885, 1010, 10917, 3155, 1015, 1010, 3263, 3717, 1006, 9683, 2771, 1007, 2408, 2809, 10348, 3032, 1024, 5118, 1010, 2605, 1010, 2762, 1010, 3304, 1010, 26500, 1010, 14497, 1010, 10307, 1010, 1998, 5288, 1012, 1996, 16512, 4020, 2024, 3020, 1010, 1998, 1996, 24471, 9777, 2936, 1010, 2021, 2119, 4682, 6576, 1999, 4021, 1012, 1996, 4020, 2020, 2719, 2058, 15295, 1997, 8817, 1997, 2086, 2004, 1996, 3060, 1998, 23399, 8915, 28312, 2594, 7766, 17745, 1012, 6034, 2460, 7406, 3303, 2011, 1996, 2724, 4504, 1999, 3884, 25503, 5749, 4803, 2011, 21468, 1998, 12745, 2046, 2152, 3137, 11373, 2

In [98]:
# Apply the filtering function through a filter
dataset_filtered = hf_dataset.filter(
    filter_samples_by_max_seq_len,
    num_proc=24,
)

# Display the dataset and compare with the original dataset
# to see how many samples were filtered out
dataset_filtered

# This makes sure that the data has the right format

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87289
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10511
    })
})

In [99]:
# Apply the tokenizing function through a map
# and remove the text-containing entries of the dataset
# by using map, the list of "answers" is removed (somehow...)
dataset_tok = dataset_filtered.map(
    tokenize_dataset,
    remove_columns=hf_dataset['train'].column_names,
    num_proc=24,
)

# Convert the internal format of the dataset to pytorch
dataset_tok.set_format('pt')

# Display the dataset and compare features with the
# tokenized dataset
dataset_tok

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_token_idx', 'end_token_idx'],
        num_rows: 87289
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_token_idx', 'end_token_idx'],
        num_rows: 10511
    })
})

## The training set

In [100]:
train_dataset = dataset_tok["train"]
train_dataset

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_token_idx', 'end_token_idx'],
    num_rows: 87289
})

In [101]:
# Print the sample 20299 of the training set to see how it looks
train_sample = train_dataset.select([20299])[0]
pprint(train_sample)
# 101 start, 102 end token, 0 are padding, all other numbers are the ids
# token types id:
#

### The `input_ids` key (Question)

1. What are the `"input_ids"` key in the tokenized dataset? Use `tokenizer.decode()` to "de-tokenize" back to text the sample `train_sample['input_ids']`.

Input ids are the tokens. But appended the question after the SEP.

### The `attention_mask` key
The attention masks differentiate what is text and what is padding. More info [here](https://huggingface.co/transformers/glossary.html#attention-mask).

The attention type marks the positions where the context is

In [90]:
train_sample['attention_mask']

tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [91]:
# Filter out the padding tokens by evaluating the train sample on
# train_sample['attention_mask'] == 1
context_encoded = train_sample['input_ids'][train_sample['attention_mask'] == 1]
tokenizer.decode(context_encoded)

'[CLS] in 1374 king louis of hungary approved the privilege of koszyce ( polish : " przywilej koszycki " or " ugoda koszycka " ) in kosice in order to guarantee the polish throne for his daughter jadwiga. he broadened the definition of who was a member of the nobility and exempted the entire class from all but one tax ( łanowy, which was limited to 2 grosze from łan ( an old measure of land size ) ). in addition, the king \' s right to raise taxes was abolished ; no new taxes could be raised without the agreement of the nobility. henceforth, also, district offices ( polish : " urzedy ziemskie " ) were reserved exclusively for local nobility, as the privilege of koszyce forbade the king to grant official posts and major polish castles to foreign knights. finally, this privilege obliged the king to pay indemnities to nobles injured or taken captive during a war outside polish borders. [SEP] why did king louis approve the privilege? [SEP]'

### The `token_type_ids` key
Differentiate two types of tokens, the ones that correspond to the question and the ones that correspond to the answers. More info [here](https://huggingface.co/transformers/glossary.html#token-type-ids)

This marks where the question is. This needs to be marked, as the tokenizer combined the context and question into one.

In [92]:
# Filter out the padding tokens by evaluating the train sample on
# train_sample['attention_mask'] == 1
train_sample['token_type_ids']

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [93]:
# Filter out the question tokens by evaluating the train sample on
# train_sample['token_type_ids'] == 0
paragraph_encoded = train_sample['input_ids'][train_sample['token_type_ids'] == 0]
tokenizer.decode(paragraph_encoded,skip_special_tokens=True)

'in 1374 king louis of hungary approved the privilege of koszyce ( polish : " przywilej koszycki " or " ugoda koszycka " ) in kosice in order to guarantee the polish throne for his daughter jadwiga. he broadened the definition of who was a member of the nobility and exempted the entire class from all but one tax ( łanowy, which was limited to 2 grosze from łan ( an old measure of land size ) ). in addition, the king \' s right to raise taxes was abolished ; no new taxes could be raised without the agreement of the nobility. henceforth, also, district offices ( polish : " urzedy ziemskie " ) were reserved exclusively for local nobility, as the privilege of koszyce forbade the king to grant official posts and major polish castles to foreign knights. finally, this privilege obliged the king to pay indemnities to nobles injured or taken captive during a war outside polish borders.'

In [94]:
# Filter out the context tokens by evaluating the train sample on
# train_sample['token_type_ids'] == 1
question_encoded = train_sample['input_ids'][train_sample['token_type_ids'] == 1]
tokenizer.decode(question_encoded, skip_special_tokens=True)

'why did king louis approve the privilege?'