Referenced: https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb




In [None]:
# uncomment before running on google colab 
!pip install transformers==4.8.1
!pip install datasets 

Collecting transformers==4.8.1
  Downloading transformers-4.8.1-py3-none-any.whl (2.5 MB)
[K     |████████████████████████████████| 2.5 MB 5.2 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 43.7 MB/s 
Collecting huggingface-hub==0.0.12
  Downloading huggingface_hub-0.0.12-py3-none-any.whl (37 kB)
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 75.6 MB/s 
Installing collected packages: tokenizers, sacremoses, huggingface-hub, transformers
Successfully installed huggingface-hub-0.0.12 sacremoses-0.0.45 tokenizers-0.10.3 transformers-4.8.1
Collecting datasets
  Downloading datasets-1.12.1-py3-none-any.whl (270 kB)
[K     |████████████████████████████████| 270 kB 5.1 MB/s 
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m

In [None]:
from datasets import load_dataset, load_metric 

In [None]:
# load dataset 
datasets = load_dataset("squad_v2")

Downloading:   0%|          | 0.00/1.87k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading and preparing dataset squad_v2/squad_v2 (download: 44.34 MiB, generated: 122.41 MiB, post-processed: Unknown size, total: 166.75 MiB) to /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d...


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/9.55M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/801k [00:00<?, ?B/s]

  0%|          | 0/2 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset squad_v2 downloaded and prepared to /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
# check contents of dataset 
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 130319
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 11873
    })
})

In [None]:
# look at first entry in train set 
datasets['train'][0]

{'answers': {'answer_start': [269], 'text': ['in the late 1990s']},
 'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".',
 'id': '56be85543aeaaa14008c9063',
 'question': 'When did Beyonce start becoming popular?',
 'title': 'Beyoncé'}

# Preprocess train data 

- In order to train a model with our text data we will need to tokenize it and get word embedding vectors. 

- In Hugging Face each model comes with a tokenizer that does most of the heavy lifting for us. 

In [None]:
from transformers import AutoTokenizer

# instantiate tokenizer instance for same model being used 
tokenizer = AutoTokenizer.from_pretrained('google/electra-small-discriminator')

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
type(tokenizer) # confirm fast tokenizer 

transformers.models.electra.tokenization_electra_fast.ElectraTokenizerFast



"Fast" tokenizers in the Hugging Face library provide several additional features that allow us to go back and forth between the original string representation and the token space. This will prove useful when trying to convert predictions from the model back to a string representation for evaluation purposes.   


In [None]:
tokenizer.model_max_length # check max size of tokens accepted by model 

512



The model we have selected has a maximum token length of 512. If we have a question-context pair that is longer then 512 tokens it will not be possible to represent that with one feature. In this case, if we were to simply discard any tokens past the max then we could possibly be discarding the answer. 

We can work around this by allowing examples with longer text input to be split into multiple features of size less than (or equal to) the maximum allowed by the model. 

In order to deal with the possibility that the answer is at or near the point of split for a longer context, we can allow some overlap between the multiple features for that particular example. The amount of overlap we allow between two features from the same example is often referred to as the stride or doc stride. 

Fortunately, the tokenizer class in HuggingFace has built in functionality allowing us to work with inputs bigger then the maximum allowed by the model.

For our purposes we will use 512 as the max length allowed for a feature. By using the max length posible we minimize the number of features for each example. 

We will use a doc stride of 128. This is a quarter of the size of the max token length. By using such a large doc stride we are trying to minimze the chance that the answer to a question will get lost when splitting a long piece of text. 

In [None]:
max_length = 512 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

Lets consider an example from the dataset which has text input that is longer then the max token length supported by the model. 


In [None]:
# get example from dataset that is longer then max_length 
for i,example in enumerate(datasets['train']):
  # check if length of encoding of question + context is greater then max of model  
  if len(tokenizer(example["question"], example["context"])['input_ids']) > max_length:
    break 
long_example = datasets['train'][i]
long_example


Token indices sequence length is longer than the specified maximum sequence length for this model (518 > 512). Running this sequence through the model will result in indexing errors


{'answers': {'answer_start': [3], 'text': ['1565']},
 'context': 'In 1565, the powerful Rinbung princes were overthrown by one of their own ministers, Karma Tseten who styled himself as the Tsangpa, "the one of Tsang", and established his base of power at Shigatse. The second successor of this first Tsang king, Karma Phuntsok Namgyal, took control of the whole of Central Tibet (Ü-Tsang), reigning from 1611–1621. Despite this, the leaders of Lhasa still claimed their allegiance to the Phagmodru as well as the Gelug, while the Ü-Tsang king allied with the Karmapa. Tensions rose between the nationalistic Ü-Tsang ruler and the Mongols who safeguarded their Mongol Dalai Lama in Lhasa. The fourth Dalai Lama refused to give an audience to the Ü-Tsang king, which sparked a conflict as the latter began assaulting Gelug monasteries. Chen writes of the speculation over the fourth Dalai Lama\'s mysterious death and the plot of the Ü-Tsang king to have him murdered for "cursing" him with illness, a

Lets tokenize the long_example to get a better idea of how the tokenizer works. 

In [None]:
tokenized_example = tokenizer(
    long_example['question'],
    long_example['context'],)
print(len(tokenized_example['input_ids']))

518


If we do not truncate the input the tokenizer will return a token that is larger then what the model will accept. 

Lets consider truncating and restricting the size of the output tokens to the maximum size allowed by our model since we are looking to feed this data to our model. 

In [None]:
tokenized_example = tokenizer(
    long_example['question'],
    long_example['context'],
    max_length= max_length,
    truncation= "only_second",
    )
print(tokenized_example.keys())
print('token_type_ids: ',tokenized_example['token_type_ids'])
print(f"size of feature: {len(tokenized_example['input_ids'])}")

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
token_type_ids:  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

The input_ids are the indices corresponding to each token in our sentence. The `input_ids` are the result of mapping the text input to the word embeddings of the specific transformer model. 

We gave the tokenizer two seperate string inputs (question and context) and it returned a single vector in `input_id` to represent the question-context pair. 
  - The `token_type_ids` gives a binary mask that tells us whether a particular input_id is part of the first text input (question) or seond (context). The 0 indicates the token is part of the question and 1 indicates it is part of the context. 

https://huggingface.co/transformers/preprocessing.html


In [None]:
tokenized_example = tokenizer(
    long_example['question'],
    long_example['context'],
    max_length= max_length,
    truncation= "only_second",
    return_overflowing_tokens=True,
    stride=doc_stride,
    padding=True
    )

For our running example, adding the following arguments to our tokenizer allows us to be able work with text inputs that are larger then the maximum allowed by the model. 
  - The `padding=True` arguments ensures that every feature is of the same length for modeling purposes. It does this by adding a placeholder value for indicies after the end of the text until the max length allowed by the model. 
  - As mentioned earlier the `stride` argument defines the overlap between different features from the sample example. This allows the model to extract meaning from portions of the text that are close to the cut off point. 
  - The `return_overflowing_tokens` argument returns a mapping that allows us to go from the current span or feature back to the original text input. 



In [None]:
tokenized_example.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'overflow_to_sample_mapping'])

In [None]:
print('number of features/spans: ',len(tokenized_example['input_ids'])) # number of features for example 
print('length of first feature: ',len(tokenized_example['input_ids'][0])) # size of first feature 
print('length of second feature: ', len(tokenized_example['input_ids'][1])) # size of second feature 



number of features/spans:  2
length of first feature:  512
length of second feature:  512


After adding the additional arguments to pad and return over flowing tokens we see that 
  - For the long example (bigger then max allowed by model) we now have two features to represent the text. This is evident from the fact that we have two `input_ids` from the single input. 
  - Each feature is of the same size as a result of padding 

- Without padding and truncating the text had 518 tokens. After padding and truncating the second span has 6 new tokens (and the 128 for the stride) representing the rest of the text while the remainder of the tokens are simply placeholders. 
  - If we could indicate to our model which tokens were actual information from text and which are simply placeholders then it could focus only on the actual input saving computing resources and avoid possibly predicting that the answer is in the padding. 
  - The attention mask provides exactly this information!  

In [None]:
import numpy as np
np.set_printoptions(threshold=np.inf)

np.array(tokenized_example['attention_mask'])

array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

- We can see that the attention mask for the first span is all 1's indicating that it is all part of the text input. 
- The attention mask for the second span is 1 if the tokens that represent actual text and 0 if the value in that place is just for padding purposes. 

In [None]:
tokenized_example.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'overflow_to_sample_mapping'])

  When the input is too long, it's converted in a batch of inputs with overflowing tokens
            # and a stride of overlap between the inputs. If a batch of inputs is given, a special output
            # "overflow_to_sample_mapping" indicate which member of the encoded batch belong to which original batch sample.

For each span in the encoded batch the `overflow_to_sample_mapping` tells us which original batch sample that span corresponds to. 

In [None]:
tokenized_example['overflow_to_sample_mapping']

[0, 0]

In this particular case, since we only tokenized one example which resulted in two spans the `overflow_to_sample_mapping` gives a value of 0 for each span. This indicates that both tokenized spans originated from the same input.  

Now this will give us some work to properly treat the answers: we need to find in which of those features the answer actually is, and where exactly in that feature.

Thankfully, the tokenizer we're using can help us with that by returning an offset_mapping: 

Now that we have split up the original question-context pair into multiple spans, we need to find which of the spans contains the answer and where in that span it is located. We can use the `return_offset_mapping` argumentto help us with this. 


In [None]:
# tokenize with offset mappings 
tokenized_example = tokenizer(
    long_example["question"],
    long_example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    stride=doc_stride,
    return_offsets_mapping=True,
)
print(tokenized_example["offset_mapping"][0][:10])
print(tokenized_example['input_ids'][0][:10])
print(len(tokenized_example['input_ids'][0])==len(tokenized_example["offset_mapping"][0]))

[(0, 0), (0, 4), (5, 9), (10, 13), (14, 16), (16, 18), (18, 21), (22, 29), (30, 39), (39, 40)]
[101, 2043, 2020, 1996, 15544, 27698, 5575, 12000, 16857, 2078]
True


In [None]:
# testing offset mapping 
print(long_example["question"][:4])
print(long_example["question"][5:9])
print(long_example["question"])


When
were
When were the Rinbung princes overthrown?


For each token id in our `input_ids` the offset mapping gives us the corresponding start and end character index in the original text. 

The very first token ([CLS]) has (0, 0) because it doesn't correspond to any part of the question/answer


In [None]:
# show how to go from offset mapping back to original tokens 
first_token_id = tokenized_example["input_ids"][0][1]
offsets = tokenized_example["offset_mapping"][0][1]
print(tokenizer.convert_ids_to_tokens([first_token_id]), example["question"][offsets[0]:offsets[1]])

['when'] When


The offset mapping gives us a way to convert between token index and character index in original text. This mapping can be used to get the position of start/end tokens of our answer in a particular feature. 

EXPLAIN 
We just have to distinguish which parts of the offsets correspond to the question and which part correspond to the context

this is where the sequence_ids method of our tokenized_example can be useful:

In [None]:
sequence_ids = tokenized_example.sequence_ids()
sequence_ids[:20]

[None, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1]

The sequence_ids method tells us which part of the text input each token was: 
  - returns None for special tokens
  - returns 0 for tokens from the first sequence
  - returns 1 for tokens from the second sequence. 

In our case, we are inputing the question as the first sequence and the context as the second sequence in the tokenizer. Therefore tokens with sequence_ids value of 0 correspond to tokens from question and 1 corresponding to tokens from the context. 

When looking for answers we should make sure they are not part of the question. This avoids possibly returning the question as the predicted answer.  



https://huggingface.co/transformers/main_classes/tokenizer.html

In [None]:
# get answer to question
answers = long_example['answers']
# get index where answer starts 
start_char_index = answers['answer_start'][0]
# end index = start index + length of answer 
end_char_index = start_char_index + len(answers['text'][0])


# get start token index of the current span in the text 
token_start_index = 0 # start at 0 
while sequence_ids[token_start_index] != 1: # while token_start index is not in context 
  token_start_index +=1 # keep incrementing until it hits the first token in the context  

# get end token index of the current span in the text 
token_end_index = len(tokenized_example['input_ids'][0]) -1 # starts at end of tokens  
while sequence_ids[token_end_index] != 1:  # look for  value in sequence_ids that is 1 indicating in context 
# if you dont find it shift left (look at smaller index) 
  token_end_index -=1  

# Detect if the answer is out of the span(in which case this feature is labeled with CLS)
offsets = tokenized_example["offset_mapping"][0] 
# if answer starts and ends within current span  
if (offsets[token_start_index][0] <= start_char_index and offsets[token_end_index][1] >= end_char_index): 
    # Move the token_start_index and token_end_index to the two ends of the answer.
    while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char_index: 
        token_start_index += 1 # keep shifting to the right until token_start_index points to start index of answer in context
    start_position = token_start_index - 1 
    while offsets[token_end_index][1] >= end_char_index: 
        token_end_index -= 1 # keep shifting left until token_end_index points to index of end of answer in context 
    end_position = token_end_index + 1 
    print(start_position, end_position)
else:
    print("The answer is not in this feature.")


13 14




The tokenizer class in hugging face library also allows us to go from input ids back to the original string using the decode method. We can use this to confirm that we got correct start and end positions in terms of the input ids/token index 

In [None]:
# confirm answer
print(tokenizer.decode(tokenized_example['input_ids'][0][start_position:end_position+1]))
print(answers['text'][0])

1565
1565



Now that we know how to prepare features for modeling we can put all the previous steps into a function for use on our entire dataset. 

For the case that the answer is not within the current span/feature we can set the start and end position to 0 indicating no answer in this feature. 


In [None]:
def prepare_train_features(examples):
    ''' given examples from Squad dataset: 
     accounts for examples longer then 512 tokens 
    returns: tokenized examples with: input ids, attention_mask, answer start position index, answer end position index''' 

    # tokenize examples accounting for some of them being to long to fit in a single feature 
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # get mapping from features to corresponding example in dataset  
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    #  get offset_mapping to map tokens to character position in original context  
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # add keys for start_positions and end_positions of answers to tokenized_examples 
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    # iterate through all offset_mappings (the corresponding start and end character in the original text that gave our token.)
    for i, offsets in enumerate(offset_mapping):
        input_ids = tokenized_examples["input_ids"][i] # input ids for tokenized examples 
        #cls_index = input_ids.index(tokenizer.cls_token_id)         # get cls index to label impossible answers 


        # Grab the sequence  corresponding for example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans,  
        sample_index = sample_mapping[i] # index of the example containing this span of text.
        answers = examples["answers"][sample_index] # get answers for this example 
        # If no answers are given for this example set the answer start and end position to 0 
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(0)
            tokenized_examples["end_positions"].append(0)
        else: # answers are given 
            # get start and end character index of the answer in the context.
            start_char_index = answers["answer_start"][0]
            end_char_index = start_char_index + len(answers["text"][0])

            # get start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1

            # get end token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            # if the answer is out of the span 
            if not (offsets[token_start_index][0] <= start_char_index and offsets[token_end_index][1] >= end_char_index):
              #  set start and end position for answer to 0 since not in this span
                tokenized_examples["start_positions"].append(0)
                tokenized_examples["end_positions"].append(0)
            else: # the answer is in the span 
            # Move the token_start_index and token_end_index to the two ends of the answer.
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char_index: 
                    token_start_index += 1 # keep shifting to the right until token_start_index refers to where the answer starts in text 
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char_index: # start on right side and work down
                    token_end_index -= 1 # keep shifting left until token_end_index points to where answer ends in text 
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In [None]:
tokenized_datasets = datasets.map(prepare_train_features, batched=True, remove_columns=datasets["train"].column_names)

  0%|          | 0/131 [00:00<?, ?ba/s]

  0%|          | 0/12 [00:00<?, ?ba/s]