# MobileBERT for Question Answering on the SQuAD dataset

### 1. Understanding the SQuAD dataset 

In these notebooks we are going use [MobileBERT implemented by HuggingFace](https://huggingface.co/docs/transformers/model_doc/mobilebert) on the question answering task by text-extraction on the [The Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer/). The data is composed by a set of questions and paragraphs that contain the answers. The model will be trained to locate the answer in the context by giving the positions where the answer starts and ends.

In this notebook we are going to explore the dataset and see how to set it up for fine-tuning.

More info from HuggingFace docs:
- [Question Answering](https://huggingface.co/tasks/question-answering)
- [Glossary](https://huggingface.co/transformers/glossary.html#model-inputs)
- [Question Answering chapter of NLP course](https://huggingface.co/learn/nlp-course/chapter7/7?fw=pt)

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer
from rich.pretty import pprint

In [None]:
from datasets.utils import disable_progress_bar
from datasets import disable_caching


disable_progress_bar()
disable_caching()

## The raw data

In [None]:
# Load the dataset
hf_dataset = load_dataset('squad')

In [None]:
# Display the dataset to check how it is partitioned
hf_dataset

In [None]:
# Let's check five train set samples to see how they look
for _squad_example in hf_dataset['train'].select(range(5)):
    pprint(_squad_example)

In [None]:
# Let's five validation sample to see how they look
for _squad_example in hf_dataset['validation'].select(range(5)):
    pprint(_squad_example)

In [None]:
# Individual samples can be accessed as a dictionary
squad_ex = hf_dataset['train'].select([20584])

In [None]:
squad_ex['title']

In [None]:
squad_ex['context']

In [None]:
squad_ex['question']

In [None]:
squad_ex['answers']

# The tokenizer

## Processing the data for training
Now we process the data so we can feed it later to the model.
The idea is to replace the words (and some word parts) by numbers using the tokenizer above and organize the training data as a set of paragraphs and questions.

In [None]:
# We will work with this model
hf_model = 'google/mobilebert-uncased'

In [None]:
# Extract the tokenizer that was used for pretraining that model
tokenizer = AutoTokenizer.from_pretrained(hf_model)

## Question

1. Check the `tokenizer` object and find out its vocabulary length

## Processing the data

There are a few preprocessing operations that we need to do in the dataset so it can be fed to the HuggingFace MobileBERT model class:
 1. Tokenize the contexts and the answers with the tokenizer we extracted (it already outputs a dictionary in the shape the model expects)
 2. Convert the start and ending positions from relative-to-character to relative-to-token. For example, in the string `"the cat sat in the mat"`, the answer for the question 'Where did the cat sit', starts at character 16. In the dataset it would appear as `{'answer_start': [16]}`. If the sentence is tokenized as `["the", "cat", "sat", "on", "the", "mat"]`, the answer starts at token 5.
 3. Discard the question/context pairs where the the answer appears outside of the truncation lenght of the tokenizer. This will be done only to make the tutorial simpler, as it can result in loss of information and potentially impact the performance of the model. Instead of discarding the extra tokens, one can make a smaller contexts by removing the begining so the answers fit. Find more info in the [Question Answering chapter of HuggingFace's NLP course](https://huggingface.co/learn/nlp-course/chapter7/7?fw=pt)

In [None]:
# Maximum sequence length
MAX_SEQ_LEN = 300

def tokenize_dataset(squad_example, tokenizer=tokenizer):
    """Tokenize the text in the dataset and convert
    the start and ending positions of the answers
    from text to tokens"""
    max_len = MAX_SEQ_LEN
    context = squad_example['context']
    answer_start = squad_example['answers']['answer_start'][0]
    answer = squad_example['answers']['text'][0]
    squad_example_tokenized = tokenizer(
        context, squad_example['question'],
        padding='max_length',
        max_length=max_len,
        truncation='only_first',
    )
    token_start = len(tokenizer.tokenize(context[:answer_start + 1]))
    token_end = len(tokenizer.tokenize(answer)) + token_start

    # Add the "start_token_idx" and "end_token_idx" keys to the 
    # `squad_example_tokenized` dictionary
    squad_example_tokenized['start_token_idx'] = token_start
    squad_example_tokenized['end_token_idx'] = token_end

    return squad_example_tokenized


def filter_samples_by_max_seq_len(squad_example):
    """Fliter out the samples where the answers are
    not within the first `MAX_SEQ_LEN` tokens"""
    max_len = MAX_SEQ_LEN
    answer_start = squad_example['answers']['answer_start'][0]
    answer = squad_example['answers']['text'][0]
    token_start = len(tokenizer.tokenize(squad_example['context'][:answer_start]))
    token_end = len(tokenizer.tokenize(answer)) + token_start
    return token_end < max_len

## Questions

1. In the function `tokenize_dataset`, what does the following code do? Try it outside of the function with one of the context and questions we extracted above. Make sure you understand all the arguments ;)
```python
    squad_example_tokenized = tokenizer(
        context, squad_example['question'],
        padding='max_length',
        max_length=max_len,
        truncation='only_first',
    )
```
2. Make sure you understand how `token_start` and `token_end` are obtained.

In [None]:
# Apply the filtering function through a filter
dataset_filtered = hf_dataset.filter(
    filter_samples_by_max_seq_len,
    num_proc=24,
)

# Display the dataset and compare with the original dataset
# to see how many samples were filtered out
dataset_filtered

In [None]:
# Apply the tokenizing function through a map
# and remove the text-containing entries of the dataset
dataset_tok = dataset_filtered.map(
    tokenize_dataset,
    remove_columns=hf_dataset['train'].column_names,
    num_proc=24,
)

# Convert the internal format of the dataset to pytorch
dataset_tok.set_format('pt')

# Display the dataset and compare features with the
# tokenized dataset
dataset_tok

## The training set

In [None]:
train_dataset = dataset_tok["train"]
train_dataset

In [None]:
# Print the sample 20299 of the training set to see how it looks
train_sample = train_dataset.select([20299])[0]
pprint(train_sample)

### The `input_ids` key (Question)

1. What are the `"input_ids"` key in the tokenized dataset? Use `tokenizer.decode()` to "de-tokenize" back to text the sample `train_sample['input_ids']`.

### The `attention_mask` key
The attention masks differentiate what is text and what is padding. More info [here](https://huggingface.co/transformers/glossary.html#attention-mask).

In [None]:
train_sample['attention_mask']

In [None]:
# Filter out the padding tokens by evaluating the train sample on
# train_sample['attention_mask'] == 1
context_encoded = train_sample['input_ids'][train_sample['attention_mask'] == 1]
tokenizer.decode(context_encoded)

### The `token_type_ids` key
Differentiate two types of tokens, the ones that correspond to the question and the ones that correspond to the answers. More info [here](https://huggingface.co/transformers/glossary.html#token-type-ids)

In [None]:
# Filter out the padding tokens by evaluating the train sample on
# train_sample['attention_mask'] == 1
train_sample['token_type_ids']

In [None]:
# Filter out the question tokens by evaluating the train sample on
# train_sample['token_type_ids'] == 0
paragraph_encoded = train_sample['input_ids'][train_sample['token_type_ids'] == 0]
tokenizer.decode(paragraph_encoded,skip_special_tokens=True)

In [None]:
# Filter out the context tokens by evaluating the train sample on
# train_sample['token_type_ids'] == 1
question_encoded = train_sample['input_ids'][train_sample['token_type_ids'] == 1]
tokenizer.decode(question_encoded, skip_special_tokens=True)