## CS310 Natural Language Processing
## Lab 13: Explore Question-Answering Models and Datasets

In this lab, we will practice with running pretrained models on question-answering tasks. The we demonstrate with is `distilbert-base-uncased`, which is a smaller version of BERT.

We will use the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) datast provided in the [Datasets](https://github.com/huggingface/datasets) library. Make sure to install the library:

```bash
pip install datasets
```

In [1]:
from pprint import pprint

### T1. Explore the SQuAD dataset

First, let's load the SQuAD dataset

In [2]:
from datasets import load_dataset, load_metric

squad_dataset = load_dataset('./squad')

  from .autonotebook import tqdm as notebook_tqdm
Generating train split: 100%|██████████| 87599/87599 [00:00<00:00, 195998.19 examples/s]
Generating validation split: 100%|██████████| 10570/10570 [00:00<00:00, 184188.46 examples/s]


The `squad_dataset` object is a `DefaultDict` that contains keys for the train and validation splits.

In [3]:
squad_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

To access a data instance, you can specify the split and index:

In [4]:
squad_dataset['train'][0]

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

We can see that teh answer is indicated by its span start index (at character `515`) in the passage text. 

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset

In [5]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [6]:
show_random_elements(squad_dataset["train"], num_examples=3)

Unnamed: 0,id,title,context,question,answers
0,56f974f69e9bad19000a0942,Brain,"Brain tissue consumes a large amount of energy in proportion to its volume, so large brains place severe metabolic demands on animals. The need to limit body weight in order, for example, to fly, has apparently led to selection for a reduction of brain size in some species, such as bats. Most of the brain's energy consumption goes into sustaining the electric charge (membrane potential) of neurons. Most vertebrate species devote between 2% and 8% of basal metabolism to the brain. In primates, however, the percentage is much higher—in humans it rises to 20–25%. The energy consumption of the brain does not vary greatly over time, but active regions of the cerebral cortex consume somewhat more energy than inactive regions; this forms the basis for the functional brain imaging methods PET, fMRI, and NIRS. The brain typically gets most of its energy from oxygen-dependent metabolism of glucose (i.e., blood sugar), but ketones provide a major alternative source, together with contributions from medium chain fatty acids (caprylic and heptanoic acids), lactate, acetate, and possibly amino acids.",The energy used for metabolism of the brain in humans is what percentage?,"{'text': ['20–25%'], 'answer_start': [559]}"
1,572841bd3acd2414000df7db,LaserDisc,"LaserDiscs potentially had a much longer lifespan than videocassettes. Because the discs were read optically instead of magnetically, no physical contact needs to be made between the player and the disc, except for the player's clamp that holds the disc at its center as it is spun and read. As a result, playback would not wear the information-bearing part of the discs, and properly manufactured LDs would theoretically last beyond one's lifetime. By contrast, a VHS tape held all of its picture and sound information on the tape in a magnetic coating which is in contact with the spinning heads on the head drum, causing progressive wear with each use (though later in VHS's lifespan, engineering improvements allowed tapes to be made and played back without contact). Also, the tape was thin and delicate, and it was easy for a player mechanism, especially on a low quality or malfunctioning model, to mishandle the tape and damage it by creasing it, frilling (stretching) its edges, or even breaking it.",Where do VHS tapes store their information?,"{'text': ['magnetic coating'], 'answer_start': [537]}"
2,5709ee056d058f1900182c37,United_States_dollar,"The monetary base consists of coins and Federal Reserve Notes in circulation outside the Federal Reserve Banks and the U.S. Treasury, plus deposits held by depository institutions at Federal Reserve Banks. The adjusted monetary base has increased from approximately 400 billion dollars in 1994, to 800 billion in 2005, and over 3000 billion in 2013. The amount of cash in circulation is increased (or decreased) by the actions of the Federal Reserve System. Eight times a year, the 12-person Federal Open Market Committee meet to determine U.S. monetary policy. Every business day, the Federal Reserve System engages in Open market operations to carry out that monetary policy. If the Federal Reserve desires to increase the money supply, it will buy securities (such as U.S. Treasury Bonds) anonymously from banks in exchange for dollars. Conversely, it will sell securities to the banks in exchange for dollars, to take dollars out of circulation.",What was the monetary base value in 1994?,"{'text': ['400 billion dollars'], 'answer_start': [266]}"


### T2. Preprocess the data

Before we feed the data to a model for fine-tuning, there is some preprocessing needed: 
- Tokenize the input text
- Put it in the format expected by the model
- Generate other inputs the model requires

To do all of this, we need to instantiate a tokenizer that is compatible with the model we want to use, i.e., `distilbert-base-uncased`.

In [7]:
from transformers import AutoTokenizer

model_checkpoint = "./distilbert-base-uncased" # If loaded locally, make sure you have the model downloaded first
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

You can directly call this tokenizer on two sentences (e.g., question and context):

In [8]:
tokenizer('Architecturally, the school has a Catholic character.', 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?')

{'input_ids': [101, 6549, 2135, 1010, 1996, 2082, 2038, 1037, 3234, 2839, 1012, 102, 2000, 3183, 2106, 1996, 6261, 2984, 9382, 3711, 1999, 8517, 1999, 10223, 26371, 2605, 1029, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

An important step in QA is to deal with very **long documents**. If longer than the maximum input size of model, then removing part of context might result in losing the answer.

To handle this, we will allow a long document to give several input *features*, each of length shorter than the maximum size. 

Also, in case the answer is split between two features, we allow some overlap between features, controlled by `doc_stride`.

In [9]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

Let's examine on one long example:

In [10]:
for i, example in enumerate(squad_dataset["train"]):
    if len(tokenizer(example["question"], example["context"])["input_ids"]) > 384:
        break
example = squad_dataset["train"][i]

Without truncation, its length is:

In [14]:
print(len(tokenizer(example['question'], example['context'])['input_ids']))
print(example['question'])
print(example['context'])

396
How many wins does the Notre Dame men's basketball team have?
The men's basketball team has over 1,600 wins, one of only 12 schools who have reached that mark, and have appeared in 28 NCAA tournaments. Former player Austin Carr holds the record for most points scored in a single game of the tournament with 61. Although the team has never won the NCAA Tournament, they were named by the Helms Athletic Foundation as national champions twice. The team has orchestrated a number of upsets of number one ranked teams, the most notable of which was ending UCLA's record 88-game winning streak in 1974. The team has beaten an additional eight number-one teams, and those nine wins rank second, to UCLA's 10, all-time in wins against the top team. The team plays in newly renovated Purcell Pavilion (within the Edmund P. Joyce Center), which reopened for the beginning of the 2009–2010 season. The team is coached by Mike Brey, who, as of the 2014–15 season, his fifteenth at Notre Dame, has achieved 

If we truncate, the resulting length is:

In [12]:
len(tokenizer(example["question"], example["context"], max_length=max_length, truncation="only_second")["input_ids"])

384

Note that we never want to truncate the question, so we specify `truncation='only_second`. 

Now, we further tell the tokenizer to return the overlaping features, by setting `return_overflowing_tokens=True` and `stride=doc_stride`.

In [15]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    stride=doc_stride
)

print([len(x) for x in tokenized_example["input_ids"]])

[384, 157]


We can look at the two features decoded:

In [16]:
for x in tokenized_example["input_ids"][:2]:
    pprint(tokenizer.decode(x))

("[CLS] how many wins does the notre dame men's basketball team have? [SEP] "
 "the men's basketball team has over 1, 600 wins, one of only 12 schools who "
 'have reached that mark, and have appeared in 28 ncaa tournaments. former '
 'player austin carr holds the record for most points scored in a single game '
 'of the tournament with 61. although the team has never won the ncaa '
 'tournament, they were named by the helms athletic foundation as national '
 'champions twice. the team has orchestrated a number of upsets of number one '
 "ranked teams, the most notable of which was ending ucla's record 88 - game "
 'winning streak in 1974. the team has beaten an additional eight number - one '
 "teams, and those nine wins rank second, to ucla's 10, all - time in wins "
 'against the top team. the team plays in newly renovated purcell pavilion ( '
 'within the edmund p. joyce center ), which reopened for the beginning of the '
 '2009 – 2010 season. the team is coached by mike brey, who,

Now, we nned to find out in which of the two features the answer is, and where exactly it starts and ends.

Thankfully, the tokenizer can help us by returning the `offset_mapping` that gives the start and end character of each token:

In [17]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    stride=doc_stride
)

offsets = tokenized_example["offset_mapping"][0]
print(offsets[:10])

[(0, 0), (0, 3), (4, 8), (9, 13), (14, 18), (19, 22), (23, 28), (29, 33), (34, 37), (37, 38)]


In the above output, the very first token (`[CLS]`) has `(0, 0)` because it doesn't correspond to any part of the question/answer.

The second token corresponds to the span from character 0 to 3 in the context, and so on.

In [18]:
token_id = tokenized_example["input_ids"][0][1]
print(tokenizer.convert_ids_to_tokens(token_id))

token_offsets = tokenized_example["offset_mapping"][0][1]
print(example["question"][token_offsets[0]:token_offsets[1]])

how
How


Before going on to the next step, we just have to distinguish between the offsets for `question` and those for `context`. The `sequence_ids` method can be helpful:

In [19]:
sequence_ids = tokenized_example.sequence_ids()

print('len(sequence_ids):', len(sequence_ids))
print(sequence_ids)

len(sequence_ids): 384
[None, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

It returns None for the special tokens; then `0` for tokens from the first sequence (i.e., the `question`), and `1` for tokens from the second sequence (i.e., the `context`).

It tells us that we need to find the span of answer among all `1` tokens.

Now, we are ready to use `offset_mapping` to find the position of the start and end tokens of the `answer` in a given feature.

In [20]:
answers = example["answers"]
ans_start = answers["answer_start"][0]
ans_end = ans_start + len(answers["text"][0])

print(answers)
print('ans_start:', ans_start)
print('end_char:', ans_end)

{'text': ['over 1,600'], 'answer_start': [30]}
ans_start: 30
end_char: 40


Let `token_start_index` and `token_end_index` be the initial search range for the answer span, initialize them properly:

In [21]:
# Find the position of the first `1` token
### START YOUR CODE ###
token_start_index = sequence_ids.index(1) 
### END YOUR CODE ###

print('token_start_index:', token_start_index)
print('offsets[token_start_index]:', offsets[token_start_index])
# Expected output
# token_start_index: 16
# offsets[token_start_index]: (0, 3)

token_start_index: 16
offsets[token_start_index]: (0, 3)


In [22]:
# Find the position of the last `1` token
### START YOUR CODE ###
token_end_index = len(sequence_ids) - 1 - sequence_ids[::-1].index(1)
### END YOUR CODE ###

print('token_end_index:', token_end_index)
print('offsets[token_end_index]:', offsets[token_end_index])
# Expected output
# token_end_index: 382
# offsets[token_end_index]: (1665, 1669)

token_end_index: 382
offsets[token_end_index]: (1665, 1669)


First, detect if `ans_start` and `ans_end` is within the initial search range. 

If they do, then find the start and end indices of tokens, whose offsets encompass `ans_start` and `ans_end`, repectively. 

In [55]:
offsets = tokenized_example["offset_mapping"][0]
token_start_index = 16
token_end_index = 382 # reset

# Detect if the answer is within the initial search range
### START YOUR CODE ###
if not (offsets[token_start_index][0] <= ans_start <= offsets[token_end_index][1] and
        offsets[token_start_index][0] <= ans_end <= offsets[token_end_index][1]):
    print('The answer is not in this feature.')
### END YOUR CODE ###
else:
    # Find the start and end indices of the tokens, whose offsets encompass the ans_start and ans_end
    ### START YOUR CODE ###
    start_position = next(i for i, (start, end) in enumerate(offsets) if start == ans_start and ans_start< end)
    end_position = next(i for i, (start, end) in enumerate(offsets) if start < ans_end and ans_end== end)

    ### END YOUR CODE ###

# Test
print(start_position, end_position)
print(offsets[start_position], offsets[end_position])

# Expected output
# 23 26

23 26
(30, 34) (37, 40)


We can double check that it is indeed the answer:

In [56]:
print(tokenizer.decode(tokenized_example["input_ids"][0][start_position: end_position+1]))
print(answers["text"][0])

over 1, 600
over 1,600
