Use the SQuAD dataset for extractive question and answering

In [4]:
from datasets import load_dataset

raw_datasets = load_dataset("squad")

Found cached dataset squad (C:/Users/Raj/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


  0%|          | 0/2 [00:00<?, ?it/s]

In [5]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [6]:
print("Context: ", raw_datasets["train"][0]["context"])
print("Question: ", raw_datasets["train"][0]["question"])
print("Answer: ", raw_datasets["train"][0]["answers"])

Context:  Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
Question:  To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Answer:  {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}


In [7]:
# during training there is only one possible answer. We can this -
raw_datasets["train"].filter(lambda example: len(example["answers"]["text"]) > 1)

Loading cached processed dataset at C:\Users\Raj\.cache\huggingface\datasets\squad\plain_text\1.0.0\d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453\cache-8e85a9cedb11e432.arrow


Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 0
})

In [8]:
# for evaluation however there are several possible answers for each sample, which may be different or same.
print("answers: ", raw_datasets["validation"][0]["answers"])
print("answers: ", raw_datasets["validation"][2]["answers"])

answers:  {'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], 'answer_start': [177, 177, 177]}
answers:  {'text': ['Santa Clara, California', "Levi's Stadium", "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California."], 'answer_start': [403, 355, 355]}


In [9]:
# look at the sample at index 2 in the validation set for an example of multiple answers
print("Context: ", raw_datasets["validation"][2]["context"])
print("Question: ", raw_datasets["validation"][2]["question"])
print("Answer: ", raw_datasets["validation"][2]["answers"])

Context:  Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.
Question:  Where did Super Bowl 50 take place?
Answer:  {'text': ['Santa Clara, California', "Levi's Stadium", "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California."], 'answer_start': [403, 355,

Preprocessing the data

In [10]:
from transformers import AutoTokenizer

model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) # defaults to fast tokenizer

In [11]:
tokenizer.is_fast

True

In [12]:
# we can pass the tokeenizer the question and the context together, and it will properly insert the special tokens to form a sentence like this: [CLS] question [SEP] context [SEP]
# let's double check this
context = raw_datasets["train"][0]["context"]
question = raw_datasets["train"][0]["question"]

inputs = tokenizer(question, context)
tokenizer.decode(inputs["input_ids"])

'[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive ( and in a direct line that connects through 3 statues and the Gold Dome ), is a simple, modern stone statue of Mary. [SEP]'

In [13]:
# Deal with long contexts by creating a sliding window between the long contexts as the long contexts will create several training features
inputs = tokenizer(
    question,
    context,
    max_length=100,
    truncation="only_second", # only truncate the context
    stride=50,
    return_overflowing_tokens=True,
)

for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))

[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basi [SEP]
[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin [SEP]
[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] Next to the Main Building is the B

In [14]:
# we will use the return_offsets_mapping=True to get the mapping between the tokens and the characters in the context
inputs = tokenizer(
    question,
    context,
    max_length=100,
    truncation="only_second", # only truncate the context
    stride=50,
    return_overflowing_tokens=True, # return the overflowing tokens
    return_offsets_mapping=True, # return the mapping between the tokens and the context
)
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping'])

In [42]:
inputs["overflow_to_sample_mapping"]

[0, 0, 0, 0]

In [43]:
inputs["offset_mapping"]

[[(0, 0),
  (0, 2),
  (3, 7),
  (8, 11),
  (12, 15),
  (16, 22),
  (23, 27),
  (28, 37),
  (38, 44),
  (45, 47),
  (48, 52),
  (53, 55),
  (56, 59),
  (59, 63),
  (64, 70),
  (70, 71),
  (0, 0),
  (0, 13),
  (13, 15),
  (15, 16),
  (17, 20),
  (21, 27),
  (28, 31),
  (32, 33),
  (34, 42),
  (43, 52),
  (52, 53),
  (54, 56),
  (56, 58),
  (59, 62),
  (63, 67),
  (68, 76),
  (76, 77),
  (77, 78),
  (79, 83),
  (84, 88),
  (89, 91),
  (92, 93),
  (94, 100),
  (101, 107),
  (108, 110),
  (111, 114),
  (115, 121),
  (122, 126),
  (126, 127),
  (128, 139),
  (140, 142),
  (143, 148),
  (149, 151),
  (152, 155),
  (156, 160),
  (161, 169),
  (170, 173),
  (174, 180),
  (181, 183),
  (183, 184),
  (185, 187),
  (188, 189),
  (190, 196),
  (197, 203),
  (204, 206),
  (207, 213),
  (214, 218),
  (219, 223),
  (224, 226),
  (226, 229),
  (229, 232),
  (233, 237),
  (238, 241),
  (242, 248),
  (249, 250),
  (250, 251),
  (251, 254),
  (254, 256),
  (257, 259),
  (260, 262),
  (263, 264),
  (264, 2

In [15]:
# Above it is all zeros because there is only one sample in the inputs. If there were more than one sample, then the overflow_to_sample_mapping would tell us which sample each overflowing token belongs to.
inputs = tokenizer(
    raw_datasets["train"][2:6]["question"],
    raw_datasets["train"][2:6]["context"],
    max_length=100,
    truncation="only_second", # only truncate the context
    stride=50,
    return_overflowing_tokens=True, # return the overflowing tokens
    return_offsets_mapping=True, # return the mapping between the tokens and the context
)

print(f"The 4 examples gave {len(inputs['input_ids'])} features instead of 4.")
print(f"Here is where each overflow to sample maping comes from: {inputs['overflow_to_sample_mapping']}")

The 4 examples gave 19 features instead of 4.
Here is where each overflow to sample maping comes from: [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3]


In [45]:
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping'])

In [31]:
# datatype of inputs
inputs.sequence_ids(0)

[None,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 None,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 None]

In [16]:
# determine the start and end token positions in the context for each answer
# we will use the sequence_ids method to get the mapping between the tokens and the context since not all models have token_type_ids, for example distilbert

answers = raw_datasets["train"][2:6]["answers"]
start_positions, end_positions = [], []

for i, offset in enumerate(inputs["offset_mapping"]):
    sample_idx = inputs["overflow_to_sample_mapping"][i]
    answer = answers[sample_idx]
    start_char = answer["answer_start"][0]
    end_char = answer["answer_start"][0] + len(answer["text"][0])
    sequence_ids = inputs.sequence_ids(i)
    
    # find the start and end of the context
    idx = 0
    while sequence_ids[idx] != 1:
        idx += 1
        context_start = idx
    while sequence_ids[idx] == 1:
        idx += 1
        context_end = idx - 1

    # if the answer is out of the span of the context, label is (0, 0)
    if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
        start_positions.append(0)
        end_positions.append(0) # no answer
    else:
        # otherwise, get the start and end token postions in the context
        idx = context_start
        while idx <= context_end and offset[idx][0] <= start_char:
            idx += 1
        start_positions.append(idx - 1)

        idx = context_end
        while idx >= context_start and offset[idx][1] >= end_char:
            idx -= 1
        end_positions.append(idx + 1)

start_positions, end_positions

([83, 51, 19, 0, 0, 64, 27, 0, 34, 0, 0, 0, 67, 34, 0, 0, 0, 0, 0],
 [85, 53, 21, 0, 0, 70, 33, 0, 40, 0, 0, 0, 68, 35, 0, 0, 0, 0, 0])

In [17]:
# let's take a look at a few results to verify that our approach is correct. For the first feature we find (83,85) as labels, so let's compare the theoritical answer with the decoded span of tokens from 81 to 83 (inclusive)
idx = 0
sample_idx = inputs["overflow_to_sample_mapping"][idx]
answer = answers[sample_idx]["text"][0]

start = start_positions[idx]
end = end_positions[idx]
labeled_answer = tokenizer.decode(inputs["input_ids"][idx][start : end+1])

print(f"Theoritical answer: {answer}, labeled answer: {labeled_answer}")

Theoritical answer: the Main Building, labeled answer: the Main Building


In [18]:
# let's do the same for the second feature
idx = 5
sample_idx = inputs["overflow_to_sample_mapping"][idx]
answer = answers[sample_idx]["text"][0]

start = start_positions[idx]
end = end_positions[idx]
labeled_answer = tokenizer.decode(inputs["input_ids"][idx])

print(f"Theoritical answer: {answer}, labeled answer: {labeled_answer}")

Theoritical answer: a Marian place of prayer and reflection, labeled answer: [CLS] What is the Grotto at Notre Dame? [SEP] it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette So [SEP]


In [None]:
# TODO: Adapt the code above to the XLNet tokenizer. Add padding = True. Be aware that the CLS token is at the end of the sequence for XLNet. (May not be at the 0 position with padding applied)

In [20]:
# Define a function based on the above learning to preprocess the training data for the entire dataset

max_length = 384 # max length of the input tokens
doc_stride = 128 # the stride when splitting up a long document into chunks

def prepare_train_features(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second", # only truncate the context
        stride=doc_stride,  # we use doc_stride here to generate features from overlapping chunks of the context
        return_overflowing_tokens=True, 
        return_offsets_mapping=True,  # we need to get the mapping between tokens and the context
        padding="max_length", # pad the input to max_length
    )

    # since one example might give us several features if it has a long context, we need a map from a feature to its corresponding example. This key gives us just that
    sample_mapping = inputs.pop("overflow_to_sample_mapping")
    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions, end_positions = [], []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_mapping[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
            context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
            context_end = idx - 1

        # if the answer is out of the span of the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # otherwise, get the start and end token postions in the context
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [21]:
# apply this function to the entire training dataset with the dataset.map method
train_dataset = raw_datasets["train"].map(
    prepare_train_features,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)
len(raw_datasets["train"]), len(train_dataset)

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

(87599, 88729)

In [22]:
# In the preprocessing we could differentiate between the context and quesion using sequence IDs. However, for postprocessing we won't have any way to know which part of the input ids correspond to the context and which part correspond to the question. So, we will set the offsets corresponding to the question to None

def preprocess_validation_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second", # only truncate the context
        stride=doc_stride,  # we use doc_stride here to generate features from overlapping chunks of the context
        return_overflowing_tokens=True,
        return_offsets_mapping=True,  # we need to get the mapping between tokens and the context
        padding="max_length", # pad the input to max_length
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])

        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            (o if sequence_ids[k] == 1 else None)
            for k, o in enumerate(offset)
        ]

    inputs["example_id"] = example_ids
    return inputs

In [24]:
# apply this function to the entire validation dataset with the dataset.map method
validation_dataset = raw_datasets["validation"].map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=raw_datasets["validation"].column_names,
)
len(raw_datasets["validation"]), len(validation_dataset)

Loading cached processed dataset at C:\Users\Raj\.cache\huggingface\datasets\squad\plain_text\1.0.0\d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453\cache-ed067dd8935f35cf.arrow


(10570, 10822)

Fine tuning the model with Trainer API

In [82]:
# the hardest thing would be to write the compute_metrics() function. Since we padded all the samples to the maximum length we set, there is no data collator to define, so this metric computation is really the only thing to worry about.

# The difficult part will be to post process the model predictions into spans of text in the original examples; once we have done that, the metric from the HF Datasets library will do most of the work for us.

# generate some prediction on a small part of the validation set
small_eval_set = raw_datasets["validation"].select(range(100))
trained_checkpoint = "distilbert-base-cased-distilled-squad"

tokenizer = AutoTokenizer.from_pretrained(trained_checkpoint)
eval_set = small_eval_set.map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=raw_datasets["validation"].column_names,
)

Loading cached processed dataset at C:\Users\Raj\.cache\huggingface\datasets\squad\plain_text\1.0.0\d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453\cache-1ae2d01aebf3a981.arrow


In [83]:
# now that the preprocessing is done, we change the tokenizer back to the one we originally picked
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [84]:
# remove columns from our eval_set that are not expected by the model, build a batch with all of that small validation set, and pass it through the model. If a GPU is available, we will use it to go faster.

import torch
from transformers import AutoModelForQuestionAnswering

eval_set_for_model = eval_set.remove_columns(["example_id", "offset_mapping"])
eval_set_for_model.set_format("torch")

device = "cuda" if torch.cuda.is_available() else "cpu"
batch = {k: eval_set_for_model[k].to(device) for k in eval_set_for_model.column_names}
trained_model = AutoModelForQuestionAnswering.from_pretrained(trained_checkpoint).to(device)

with torch.no_grad():
    outputs = trained_model(**batch)

In [85]:
# since the trainer will give us predictions as NumPy arrays, we grab the start and end logits and convert them to that format
start_logits = outputs.start_logits.detach().cpu().numpy()
end_logits = outputs.end_logits.detach().cpu().numpy()

In [86]:
# now we need to find the predicted answer for each example in our small_eval_set. One example may have been split into several features in eval_set, so the first step is to map each example in small_eval_set to the corresponding features in eval_set
import collections

example_to_features = collections.defaultdict(list)
for idx, feature in enumerate(eval_set):
    example_to_features[feature["example_id"]].append(idx)

In [91]:
# now loop through all the examples and for each example through all the associated features. For each feature, we grab the n_best start and end logits, excluding postions that give:
# An answer that wouldn't be inside the context
# An answer with negative length
# An answer that is too long (we limit the possibilities at max_answer_length=30)

# Once we have all the scored possible answers for one example, we just pick the one with the best logit score

import numpy as np

n_best = 20
max_answer_length = 30
predicted_answers = []

for example in small_eval_set:
    example_id = example["id"]
    context = example["context"]
    answers = []

    for feature_idx in example_to_features[example_id]:
        start_logit = start_logits[feature_idx]
        end_logit = end_logits[feature_idx]
        offsets = eval_set["offset_mapping"][feature_idx]

        # get the best start and end logits from the n_best results
        start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist() # get the indexes of the n_best start logits
        end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist() # get the indexes of the n_best end logits

        # create a cartesian product of the start and end indexes to get all possible combinations
        for start_index in start_indexes:
            for end_index in end_indexes:
                # skip answers that are not fully in the context
                if offsets[start_index] is None or offsets[end_index] is None:
                    continue
                # skip answers with a length that is either < 0 or > max_answer_length
                if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                    continue
                answers.append(
                    {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "score": start_logit[start_index] + end_logit[end_index],
                    }
                )

    best_answer = max(answers, key=lambda x: x["score"])
    predicted_answers.append({"id": example_id, "prediction_text": best_answer["text"]})


In [92]:
# the final format of the predicted answers is the one that will be expected by the metric we will use. As usual, we can load it with the help of hf evaluate library

import evaluate

metric = evaluate.load("squad")

In [93]:
# the metric expects predicted answers and theoritical answers.
# the predicted answers in the format : a list of dictionaries with one key for the ID of the example and one key for the predicted answer
# the theoritical answers in the format : a list of dictionaries with one key for the ID of the example and one key for the possible answers

theoritical_answers = [
    {"id": ex["id"], "answers": ex["answers"]} for ex in small_eval_set
]

In [94]:
# let's check the results
print(predicted_answers[0])
print(theoritical_answers[0])

{'id': '56be4db0acb8001400a502ec', 'prediction_text': 'Denver Broncos'}
{'id': '56be4db0acb8001400a502ec', 'answers': {'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], 'answer_start': [177, 177, 177]}}


In [95]:
# not too bad! let's have a look at the score the metric gives us
metric.compute(predictions=predicted_answers, references=theoritical_answers)

{'exact_match': 83.0, 'f1': 88.25000000000004}