# Question answering with BERT (HuggingFace)

Deep learning has been revolutionized by transformer models. Transformer based models like BERT are heavily used in NLP to solve tasks due to the rich numerical representations of text they provide. Here we will be discussing how to use HuggingFace's transformers library to conveniently explore various transformer based NLP models. We will be training a question answering model on the famous SQUAD v1 dataset.


<table align="left">
    <td>
        <a target="_blank" href="https://colab.research.google.com/github/thushv89/manning_tf2_in_action/blob/master/Ch13-Transormers-with-TF2-and-Huggingface/13.2_Question_answering_with_BERT.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
    </td>
</table>


## Import libraries

In [1]:
import random
import numpy as np
import transformers
from datasets import load_dataset
from transformers import DistilBertTokenizerFast
from transformers import DistilBertConfig, TFDistilBertForQuestionAnswering
import tensorflow as tf
import time

def fix_random_seed(seed):
    """ Setting the random seed of various libraries """
    try:
        np.random.seed(seed)
    except NameError:
        print("Warning: Numpy is not imported. Setting the seed for Numpy failed.")
    try:
        tf.random.set_seed(seed)
    except NameError:
        print("Warning: TensorFlow is not imported. Setting the seed for TensorFlow failed.")
    try:
        random.seed(seed)
    except NameError:
        print("Warning: random module is not imported. Setting the seed for random failed.")
    try:
        transformers.trainer_utils.set_seed(seed)
    except NameError:
        print("Warning: transformers module is not imported. Setting the seed for transformers failed.")
        
# Fixing the random seed
random_seed=4321
fix_random_seed(random_seed)


## Download the dataset

For this we will be using the [SQUAD v1 dataset](https://rajpurkar.github.io/SQuAD-explorer/). It is a question answering dataset. You are provided with a question, a context (e.g. a paragraph in which the answer to the question may exist) and finally the answer. Your goal is to, given the question and the context predict the answer.

In [4]:
# Section 13.3

from datasets import load_dataset

dataset = load_dataset("squad")
print("")

Reusing dataset squad (C:\Users\carlos\.cache\huggingface\datasets\squad\plain_text\1.0.0\d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


  0%|          | 0/2 [00:00<?, ?it/s]




## Print the first 5 samples in the training set

In [5]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})


In [6]:
dataset["train"]["answers"][:5]

[{'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]},
 {'text': ['a copper statue of Christ'], 'answer_start': [188]},
 {'text': ['the Main Building'], 'answer_start': [279]},
 {'text': ['a Marian place of prayer and reflection'], 'answer_start': [381]},
 {'text': ['a golden statue of the Virgin Mary'], 'answer_start': [92]}]

## Correcting incorrect offsets of the provided answers

The answers are provided by means of the, starting index (`answer_start`) and the answer it self (`text`). However, for some examples, the starting index is slightly off from the actual index. In the function belowe we correct that. Furthermore, we will add `answer_end`, which will denote the index of the position the answer ends.

<a id="pgfId-1176927" href=""></a><span class="fm-combinumeral">#1</span> Track how many were correct and fixed.<br>
<a id="pgfId-1176948" href=""></a><span class="fm-combinumeral">#2</span> New fixed answers will be held in this variable.<br>
<a id="pgfId-1176965" href=""></a><span class="fm-combinumeral">#3</span> Iterate through each answer context pair.<br>
<a id="pgfId-1176982" href=""></a><span class="fm-combinumeral">#4</span> Convert the answer from a list of strings to a string.<br>
<a id="pgfId-1176999" href=""></a><span class="fm-combinumeral">#5</span> Convert the start of the answer from a list of integers to an integer.<br>
<a id="pgfId-1177016" href=""></a><span class="fm-combinumeral">#6</span> Compute the end index by adding the answer’s length to the start_idx.<br>
<a id="pgfId-1177033" href=""></a><span class="fm-combinumeral">#7</span> If the slice from start_idx to end_idx exactly matches the answer text, no changes are required.<br>
<a id="pgfId-1177050" href=""></a><span class="fm-combinumeral">#8</span> If the slice from start_idx to end_idx needs to be offset by 1 to match the answer, offset accordingly.<br>
<a id="pgfId-1177067" href=""></a><span class="fm-combinumeral">#9</span> If the slice from start_idx to end_idx needs to be offset by 2 to match the answer, offset accordingly.<br>
<a id="pgfId-1177091" href=""></a><span class="fm-combinumeral">#10</span> Print the number of correct answers (requires no change).<br>
<a class="calibre7" id="pgfId-1177108" href=""></a><span class="fm-combinumeral">#11</span> Print the number of answers that required fixing<br>

In [7]:
# Section 13.3

# Code listing 13.5
def correct_indices_add_end_idx(answers, contexts):
    """ Correct the answer index of the samples (if wrong) """
    
    # Track how many were correct and fixed
    n_correct, n_fix = 0, 0
    fixed_answers = []
    for answer, context in zip(answers, contexts):

        gold_text = answer['text'][0]
        answer['text'] = gold_text
        start_idx = answer['answer_start'][0]
        answer['answer_start'] = start_idx
        if start_idx <0 or len(gold_text.strip())==0:
            print(answer)
        end_idx = start_idx + len(gold_text)        
        
        # sometimes squad answers are off by a character or two – fix this
        if context[start_idx:end_idx] == gold_text:
            answer['answer_end'] = end_idx
            n_correct += 1
        elif context[start_idx-1:end_idx-1] == gold_text:
            answer['answer_start'] = start_idx - 1
            answer['answer_end'] = end_idx - 1     # When the gold label is off by one character
            n_fix += 1
        elif context[start_idx-2:end_idx-2] == gold_text:
            answer['answer_start'] = start_idx - 2
            answer['answer_end'] = end_idx - 2     # When the gold label is off by two characters
            n_fix +=1
        
        fixed_answers.append(answer)
        
    # Print how many samples were fixed
    print("\t{}/{} examples had the correct answer indices".format(n_correct, len(answers)))
    print("\t{}/{} examples had the wrong answer indices".format(n_fix, len(answers)))
    return fixed_answers, contexts

train_questions = dataset["train"]["question"]
print("Training data corrections")
train_answers, train_contexts = correct_indices_add_end_idx(
    dataset["train"]["answers"], dataset["train"]["context"]
)
test_questions = dataset["validation"]["question"]
print("\nValidation data correction")
test_answers, test_contexts = correct_indices_add_end_idx(
    dataset["validation"]["answers"], dataset["validation"]["context"]
)

Training data corrections
	87599/87599 examples had the correct answer indices
	0/87599 examples had the wrong answer indices

Validation data correction
	10570/10570 examples had the correct answer indices
	0/10570 examples had the wrong answer indices


## Question answering with DistilBert

Now we will start our way to train a question answering model. The pretrained model we'll be using is known as [DistilBert](https://arxiv.org/pdf/1910.01108.pdf). It is a variant of BERT trained using a knowledge distilliation mechanism (a type of transfer learning).

### Defining the tokenizer

In [8]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

### Convert some text to tokens with the tokenizer

In [9]:
context = "This is the context"
question = "This is the question"

token_ids = tokenizer(context, question, return_tensors='tf')
print(token_ids)
print(tokenizer.convert_ids_to_tokens(token_ids['input_ids'].numpy()[0]))

{'input_ids': <tf.Tensor: shape=(1, 11), dtype=int32, numpy=array([[ 101, 2023, 2003, 1996, 6123,  102, 2023, 2003, 1996, 3160,  102]])>, 'attention_mask': <tf.Tensor: shape=(1, 11), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])>}
['[CLS]', 'this', 'is', 'the', 'context', '[SEP]', 'this', 'is', 'the', 'question', '[SEP]']


## Converting the inputs to tokens

In adition to converting inputs to tokens and adding special tokens, it will truncate and pad inputs to the maximum length of the sequences defined in the model config. For example, you can check model config with, `tokenizer.model_max_length`.

In [10]:
# Code listing 13.6

# Encode train data
# train_encodings -> transformers.tokenization_utils_base.BatchEncoding
train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True, return_tensors='tf')
print("train_encodings.shape: {}".format(train_encodings["input_ids"].shape))
# Encode test data
test_encodings = tokenizer(test_contexts, test_questions, truncation=True, padding=True, return_tensors='tf')
print("test_encodings.shape: {}".format(test_encodings["input_ids"].shape))


train_encodings.shape: (87599, 512)
test_encodings.shape: (10570, 512)


### Dealing with truncated answers

In the original dataset the `answer_start` and `answer_end` denote the *character*-level position of the answer. But in the model, since we deal in tokens we need the *token*-level position of the answer. For that, we will use the `char_to_token` function in the tokenizer. It will convert the character index to a token index.

Because we are enforcing a maximum sequence length of 512, some answers will be inevitably truncated if they are present after the 512th token. Although this is rare, we still need to take care of this as it can result in numerical errors otherwise. Therefore, if the positions are `None` (i.e. couldn't find the answer), it is set to the maximum position.

In [11]:
# Code listing 13.7
def update_char_to_token_positions_inplace(encodings, answers):
    start_positions = []
    end_positions = []
    n_updates = 0
    # Go through all the answers
    for i in range(len(answers)):        
        
        # Get the token position for both start end char positions
        start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
        end_positions.append(encodings.char_to_token(i, answers[i]['answer_end'] - 1))
        
        if start_positions[-1] is None or end_positions[-1] is None:
            n_updates += 1
        # if start position is None, the answer passage has been truncated
        # In the guide, https://huggingface.co/transformers/custom_datasets.html#qa-squad
        # they set it to model_max_length, but this will result in NaN losses as the last
        # available label is model_max_length-1 (zero-indexed)
        if start_positions[-1] is None:
            start_positions[-1] = tokenizer.model_max_length -1
            
        if end_positions[-1] is None:
            end_positions[-1] = tokenizer.model_max_length -1
            
    print("{}/{} had answers truncated".format(n_updates, len(answers)))
    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

update_char_to_token_positions_inplace(train_encodings, train_answers)
update_char_to_token_positions_inplace(test_encodings, test_answers)

10/87599 had answers truncated
8/10570 had answers truncated


### Creating TensorFlow dataset

In [12]:
# Section 13.3

import tensorflow as tf
from functools import partial


def data_gen(input_ids, attention_mask, start_positions, end_positions):
    """ Generator for data """
    for inps, attn, start_pos, end_pos in zip(input_ids, attention_mask, start_positions, end_positions):
        
        yield (inps, attn), (start_pos, end_pos)
        
print("Creating train data")

# Define the generator as a callable (not the generator it self)
train_data_gen = partial(data_gen,
    input_ids=train_encodings['input_ids'], attention_mask=train_encodings['attention_mask'],
    start_positions=train_encodings['start_positions'], end_positions=train_encodings['end_positions']
)

# Define the dataset
train_dataset = tf.data.Dataset.from_generator(
    train_data_gen, output_types=(('int32', 'int32'), ('int32', 'int32'))
)
# Shuffling the data
train_dataset = train_dataset.shuffle(1000)
print('\tDone')

batch_size = 8
# Valid set is taken as the first 10000 samples in the shuffled set
valid_dataset = train_dataset.take(10000)
valid_dataset = valid_dataset.batch(batch_size)

# Rest is kept as the training data
train_dataset = train_dataset.skip(10000)
train_dataset = train_dataset.batch(batch_size)

# Creating test data
print("Creating test data")

# Define the generator as a callable
test_data_gen = partial(data_gen,
    input_ids=test_encodings['input_ids'], attention_mask=test_encodings['attention_mask'],
    start_positions=test_encodings['start_positions'], end_positions=test_encodings['end_positions']
)
test_dataset = tf.data.Dataset.from_generator(
    test_data_gen, output_types=(('int32', 'int32'), ('int32', 'int32'))
)
test_dataset = test_dataset.batch(batch_size)
print("\tDone")

Creating train data
	Done
Creating test data
	Done


### Defining the model

Here we define a DistilBert model (particularly a TF variant)

In [13]:
from transformers import DistilBertConfig, TFDistilBertForQuestionAnswering

config = DistilBertConfig.from_pretrained("distilbert-base-uncased", return_dict=False)
model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased", config=config)

# Code listing 13.8
def tf_wrap_model(model):
    """ Wraps the huggingface's model with in the Keras Functional API """
    
    # If this is not wrapped in a keras model by taking the correct tensors from
    # TFQuestionAnsweringModelOutput produced, you will get the following error
    # setting return_dict did not seem to work as it should
    
    # TypeError: The two structures don't have the same sequence type. 
    # Input structure has type <class 'tuple'>, while shallow structure has type 
    # <class 'transformers.modeling_tf_outputs.TFQuestionAnsweringModelOutput'>.
    
    # Define inputs
    input_ids = tf.keras.layers.Input([None,], dtype=tf.int32, name="input_ids")
    attention_mask = tf.keras.layers.Input([None,], dtype=tf.int32, name="attention_mask")
    
    # Define the output (TFQuestionAnsweringModelOutput)
    out = model([input_ids, attention_mask])
    
    # Get the correct attributes in the produced object to generate an output tuple
    wrap_model = tf.keras.models.Model([input_ids, attention_mask], outputs=(out.start_logits, out.end_logits))
    
    return wrap_model


# Define and compile the model

# Keras will assign a separate loss for each output and add them together. So we'll just use the standard CE loss
# instead of using the built-in model.compute_loss, which expects a dict of outputs and averages the two terms.
# Note that this means the loss will be 2x of when using TFTrainer since we're adding instead of averaging them.
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
acc = tf.keras.metrics.SparseCategoricalAccuracy()
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)

model_v2 = tf_wrap_model(model)
model_v2.compile(optimizer=optimizer, loss=loss, metrics=[acc])


Downloading:   0%|          | 0.00/347M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForQuestionAnswering: ['vocab_layer_norm', 'vocab_transform', 'activation_13', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['dropout_19', 'qa_outputs']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Training the model

In [11]:
# Section 13.3

import time

t1 = time.time()

model_v2.fit(
    train_dataset, 
    validation_data=valid_dataset,    
    epochs=3
)

t2 = time.time()

print("It took {} seconds to complete the training".format(t2-t1))

Epoch 1/3


2022-07-27 11:20:02.709340: W tensorflow/core/common_runtime/forward_type_inference.cc:231] Type inference failed. This indicates an invalid graph that escaped type checking. Error message: INVALID_ARGUMENT: expected compatible input types, but input 1:
type_id: TFT_OPTIONAL
args {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_TENSOR
    args {
      type_id: TFT_BOOL
    }
  }
}
 is neither a subtype nor a supertype of the combined inputs preceding it:
type_id: TFT_OPTIONAL
args {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_TENSOR
    args {
      type_id: TFT_LEGACY_VARIANT
    }
  }
}

	while inferring type of node 'sparse_categorical_crossentropy/cond/output/_10'


Epoch 2/3
Epoch 3/3
It took 9943.647599935532 seconds to complete the training


### Save the model

In [12]:
print(model_v2.summary())

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, None)]       0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, None)]       0           []                               
                                                                                                  
 tf_distil_bert_for_question_an  TFQuestionAnswering  66364418   ['input_ids[0][0]',              
 swering (TFDistilBertForQuesti  ModelOutput(loss=No              'attention_mask[0][0]']         
 onAnswering)                   ne, start_logits=(N                                               
                                one, None),                                                   

**Note**: We cannot save `model_v2` as is, because it raises an error about not finding config for the transformer model layer. THerefore, we will save just the transformer model layer, so that we can call the `tf_wrap_model()` function anytime and get the wrapped model. 

In [13]:
import os

# Create folders
if not os.path.exists('models'):
    os.makedirs('models')
if not os.path.exists('tokenizers'):
    os.makedirs('tokenizers')
    
# Save the modle
model_v2.get_layer("tf_distil_bert_for_question_answering").save_pretrained(os.path.join('models', 'distilbert_qa'))

# Save the tokenizer
tokenizer.save_pretrained(os.path.join('tokenizers', 'distilbert_qa'))



('tokenizers/distilbert_qa/tokenizer_config.json',
 'tokenizers/distilbert_qa/special_tokens_map.json',
 'tokenizers/distilbert_qa/vocab.txt',
 'tokenizers/distilbert_qa/added_tokens.json',
 'tokenizers/distilbert_qa/tokenizer.json')

### Testing on unseen data

In [14]:
model_v2.evaluate(test_dataset)



[2.477313995361328,
 1.271211862564087,
 1.2061036825180054,
 0.6571428775787354,
 0.6939451098442078]

## Ask BERT a question ...

In [15]:
# Section 13.3

# Code listing 13.9
i = 5

# Define sample question
sample_q = test_questions[i]
# Define sample context
sample_c = test_contexts[i]
# Define sample answer 
sample_a = test_answers[i]

# Get the input in the format BERT accepts
sample_input = (test_encodings["input_ids"][i:i+1], test_encodings["attention_mask"][i:i+1])

def ask_bert(sample_input, tokenizer, model):
    """ This function takes an input, a tokenizer, a model and returns the prediciton """
    out = model.predict(sample_input)
    pred_ans_start = tf.argmax(out[0][0])
    pred_ans_end = tf.argmax(out[1][0])
    print("{}-{} token ids contain the answer".format(pred_ans_start, pred_ans_end))
    ans_tokens = sample_input[0][0][pred_ans_start:pred_ans_end+1]
    
    return " ".join(tokenizer.convert_ids_to_tokens(ans_tokens))

print("Question")
print("\t", sample_q, "\n")
print("Context")
print("\t", sample_c, "\n")
print("Answer (char indexed)")
print("\t", sample_a, "\n")
print('='*50,'\n')

sample_pred_ans = ask_bert(sample_input, tokenizer, model_v2)

print("Answer (predicted)")
print(sample_pred_ans)
print('='*50,'\n')

Question
	 What was the theme of Super Bowl 50? 

Context
	 Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50. 

Answer (char indexed)
	 {'text': '"golden anniversary"', 'answer_start': 487, 'answer_end': 507} 


98-99 token ids contain the answer
Answer (predicted)
golden a

### Debugging the model for NaN losses

This is a few things you can do to debug your model if you get nan losses. Here I demonstrate some checks you can do on the model to find out errors

In [19]:
from transformers import DistilBertConfig, TFDistilBertForQuestionAnswering

config = DistilBertConfig.from_pretrained("distilbert-base-uncased", return_dict=True)
model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased", config=config)

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
acc = tf.keras.metrics.SparseCategoricalAccuracy()
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)
model.compile(optimizer=optimizer, loss=loss, metrics=[acc])

for i,(x,y) in enumerate(train_dataset.take(100)):
    print(f"Checked data point at index {i}", end='\r')
    
    
    # Get the model output
    out = model(x)
    
    # Check any index in the labels is greater than the max length
    if tf.reduce_sum(tf.cast(y[0]>511, 'int32'))>0:
        print('start label out of range >')
        print(out.start_logits)
        print(x)
        print(y[0])
        break
    # Check if any index in the labels is smaller than zero
    if tf.reduce_sum(tf.cast(y[0]<0, 'int32'))>0:
        print('start label out of range <')
        print(out.start_logits)
        print(x)
        print(y[0])
        break
    # Check any index in the labels is greater than the max length
    if tf.reduce_sum(tf.cast(y[1]>511, 'int32'))>0:
        print('end label out of range >')
        print(out.start_logits)
        print(x)
        print(y[1])
        break
    # Check if any index in the labels is smaller than zero
    if tf.reduce_sum(tf.cast(y[1]<0, 'int32'))>0:
        print('end label out of range <')
        print(out.start_logits)
        print(x)
        print(y[1])
        break
    # Check if any loss is nan    
    if tf.math.is_nan(tf.reduce_sum(out.start_logits)):
        print('start_logits were nan')
        print(out.start_logits)
        print(x)
        print(y)
        break
    # Check if any loss is nan
    if tf.math.is_nan(tf.reduce_sum(out.end_logits)):
        print('end_logits were nan')
        print(out.end_logits)
        print(x)
        print(y)
        break

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForQuestionAnswering: ['vocab_layer_norm', 'vocab_transform', 'vocab_projector', 'activation_13']
- This IS expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs', 'dropout_99']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Checked data point at index 99