## CS310 Natural Language Processing
## Lab 13: Explore Question-Answering Models and Datasets

In this lab, we will practice with running pretrained models on question-answering tasks. The we demonstrate with is `distilbert-base-uncased`, which is a smaller version of BERT.

We will use the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) datast provided in the [Datasets](https://github.com/huggingface/datasets) library. Make sure to install the library:

```bash
pip install datasets
```

In [1]:
from pprint import pprint

### T1. Explore the SQuAD dataset

First, let's load the SQuAD dataset

In [2]:
from datasets import load_dataset, load_metric

squad_dataset = load_dataset('./squad/')

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

The `squad_dataset` object is a `DefaultDict` that contains keys for the train and validation splits.

In [None]:
squad_dataset

To access a data instance, you can specify the split and index:

In [None]:
squad_dataset['train'][0]

We can see that teh answer is indicated by its span start index (at character `515`) in the passage text. 

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset

In [None]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(squad_dataset["train"], num_examples=3)

### T2. Preprocess the data

Before we feed the data to a model for fine-tuning, there is some preprocessing needed: 
- Tokenize the input text
- Put it in the format expected by the model
- Generate other inputs the model requires

To do all of this, we need to instantiate a tokenizer that is compatible with the model we want to use, i.e., `distilbert-base-uncased`.

In [None]:
from transformers import AutoTokenizer

model_checkpoint = "./distilbert-base-uncased" # If loaded locally, make sure you have the model downloaded first
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

You can directly call this tokenizer on two sentences (e.g., question and context):

In [None]:
tokenizer('Architecturally, the school has a Catholic character.', 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?')

An important step in QA is to deal with very **long documents**. If longer than the maximum input size of model, then removing part of context might result in losing the answer.

To handle this, we will allow a long document to give several input *features*, each of length shorter than the maximum size. 

Also, in case the answer is split between two features, we allow some overlap between features, controlled by `doc_stride`.

In [None]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

Let's examine on one long example:

In [None]:
for i, example in enumerate(squad_dataset["train"]):
    if len(tokenizer(example["question"], example["context"])["input_ids"]) > 384:
        break
example = squad_dataset["train"][i]

Without truncation, its length is:

In [None]:
len(tokenizer(example['question'], example['context'])['input_ids'])

If we truncate, the resulting length is:

In [None]:
len(tokenizer(example["question"], example["context"], max_length=max_length, truncation="only_second")["input_ids"])

Note that we never want to truncate the question, so we specify `truncation='only_second`. 

Now, we further tell the tokenizer to return the overlaping features, by setting `return_overflowing_tokens=True` and `stride=doc_stride`.

In [None]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    stride=doc_stride
)

print([len(x) for x in tokenized_example["input_ids"]])

We can look at the two features decoded:

In [None]:
for x in tokenized_example["input_ids"][:2]:
    pprint(tokenizer.decode(x))

Now, we nned to find out in which of the two features the answer is, and where exactly it starts and ends.

Thankfully, the tokenizer can help us by returning the `offset_mapping` that gives the start and end character of each token:

In [None]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    stride=doc_stride
)

offsets = tokenized_example["offset_mapping"][0]
print(offsets[:10])

In the above output, the very first token (`[CLS]`) has `(0, 0)` because it doesn't correspond to any part of the question/answer.

The second token corresponds to the span from character 0 to 3 in the context, and so on.

In [None]:
token_id = tokenized_example["input_ids"][0][1]
print(tokenizer.convert_ids_to_tokens(token_id))

token_offsets = tokenized_example["offset_mapping"][0][1]
print(example["question"][token_offsets[0]:token_offsets[1]])

Before going on to the next step, we just have to distinguish between the offsets for `question` and those for `context`. The `sequence_ids` method can be helpful:

In [None]:
sequence_ids = tokenized_example.sequence_ids()

print('len(sequence_ids):', len(sequence_ids))
print(sequence_ids)

It returns None for the special tokens; then `0` for tokens from the first sequence (i.e., the `question`), and `1` for tokens from the second sequence (i.e., the `context`).

It tells us that we need to find the span of answer among all `1` tokens.

Now, we are ready to use `offset_mapping` to find the position of the start and end tokens of the `answer` in a given feature.

In [None]:
answers = example["answers"]
ans_start = answers["answer_start"][0]
ans_end = ans_start + len(answers["text"][0])

print(answers)
print('ans_start:', ans_start)
print('end_char:', ans_end)

Let `token_start_index` and `token_end_index` be the initial search range for the answer span, initialize them properly:

In [None]:
# Find the position of the first `1` token
### START YOUR CODE ###
# token_start_index = None
token_start_index = tokenized_example.char_to_token(0)
### END YOUR CODE ###

print('token_start_index:', token_start_index)
print('offsets[token_start_index]:', offsets[token_start_index])
# Expected output
# token_start_index: 16
# offsets[token_start_index]: (0, 3)

In [None]:
# Find the position of the last `1` token
### START YOUR CODE ###
# token_end_index = None
token_end_index = tokenized_example.char_to_token(len(example["context"]) - 1)
### END YOUR CODE ###

print('token_end_index:', token_end_index)
print('offsets[token_end_index]:', offsets[token_end_index])
# Expected output
# token_end_index: 382
# offsets[token_end_index]: (1665, 1669)

First, detect if `ans_start` and `ans_end` is within the initial search range. 

If they do, then find the start and end indices of tokens, whose offsets encompass `ans_start` and `ans_end`, repectively. 

In [None]:
offsets = tokenized_example["offset_mapping"][0]
token_start_index = 16
token_end_index = 382 # reset

# Detect if the answer is within the initial search range
### START YOUR CODE ###
if None: # Change `None` to your condition
    print('The answer is not in this feature.')
### END YOUR CODE ###
else:
    # Find the start and end indices of the tokens, whose offsets encompass the ans_start and ans_end
    ### START YOUR CODE ###
    start_position = None
    end_position = None
    for i, (start, end) in enumerate(offsets):
        if ans_start >= start and ans_start < end:
            start_position = i
        if ans_end > start and ans_end <= end:
            end_position = i + 1
    ### END YOUR CODE ###

# Test
print(start_position, end_position)
print(offsets[start_position], offsets[end_position])

# Expected output
# 23 26

We can double check that it is indeed the answer:

In [None]:
print(tokenizer.decode(tokenized_example["input_ids"][0][start_position: end_position+1]))
print(answers["text"][0])