# Q&A with finetuned BERT

In this section, let's learn how to perform question answering with a finetuned Q&A BERT. First, let us import the necessary modules:

In [None]:
%%capture
!pip install transformers==3.5.1

In [None]:
import torch
from transformers import BertForQuestionAnswering, BertTokenizer


Now, we download and load the model. We use the bert-large-uncased-whole-word-masking-finetuned-squad model which is finetuned on the SQUAD (Stanford question answering dataset).


In [None]:
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



Next, we download and load the tokenizer which is used for pretraining the bert-large-uncased-whole-word-masking-finetuned-squad model:


In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]


Now that we downloaded the model and tokenizer, let's preprocess the input.

## Preprocessing the input
First, we define the input to the BERT which is question and paragraph text:


In [None]:
question = "What is the immune system?"
paragraph = "The immune system is a system of many biological structures and processes within an organism that protects against disease. To function properly, an immune system must detect a wide variety of agents, known as pathogens, from viruses to parasitic worms, and distinguish them from the organism's own healthy tissue."

Add [CLS] token to the beginning of the question and [SEP] token at the end of both the question and paragraph:

In [None]:
question = '[CLS] ' + question + '[SEP]'
paragraph = paragraph + '[SEP]'


Now, tokenize the question and paragraph:


In [None]:
question_tokens = tokenizer.tokenize(question)
paragraph_tokens = tokenizer.tokenize(paragraph)

In [None]:
question_tokens

['[CLS]', 'what', 'is', 'the', 'immune', 'system', '?', '[SEP]']



Combine the question and paragraph tokens and convert them to input_ids:

In [None]:
tokens = question_tokens + paragraph_tokens
input_ids = tokenizer.convert_tokens_to_ids(tokens)



Next, we define the segment_ids. The segment_ids will be 0 for all the tokens of question and it will be 1 for all the tokens of the paragraph:


In [None]:
segment_ids = [0] * len(question_tokens)
segment_ids += [1] * len(paragraph_tokens)


Now we convert the input_ids and segment_ids to tensor:

In [None]:
input_ids = torch.tensor([input_ids])
segment_ids = torch.tensor([segment_ids])



Now that we processed the input. Let's feed them to the model and get the result.

## Getting the answer
We feed the input_ids and segment_ids to the model which return the start score and end score for all of the tokens:


In [None]:
start_scores, end_scores = model(input_ids, token_type_ids = segment_ids,return_dict = False)

In [None]:
start_scores

tensor([[-6.2588, -4.6880, -6.7744, -6.3712, -5.8096, -8.4909, -9.0369, -6.2588,
          2.3760, -0.8670, -4.0859,  2.1112,  7.0353,  3.1633, -2.0016,  1.8844,
          2.4239, -0.8321, -4.7245, -0.6628, -0.9607, -1.5406, -0.9789, -1.5247,
          1.5805, -3.6135, -1.7062, -6.2587, -4.3460, -5.7781, -6.2772, -7.2236,
         -2.5216, -2.8306, -5.5702, -4.4567, -3.9796, -6.1513, -5.8940, -6.4212,
         -7.3876, -5.6694, -7.7685, -4.6375, -6.5613, -3.7148, -7.0651, -8.1083,
         -5.4551, -4.3829, -7.9004, -4.8883, -5.8361, -7.9597, -6.8583, -4.6028,
         -7.3392, -7.3848, -6.5887, -5.8965, -5.8692, -7.9263, -6.7758, -5.4052,
         -5.2147, -7.6892, -6.2588]], grad_fn=<CloneBackward0>)


Now, we select the start_index which is the index of the token which has a maximum start score and end_index which is the index of the token which has a maximum end score:


In [None]:
start_index = torch.argmax(start_scores)
end_index = torch.argmax(end_scores)

In [None]:
end_index

tensor(26)


That's it! Now, we print the text span between the start and end index as our answer:

In [None]:
print(' '.join(tokens[start_index:end_index+1]))

a system of many biological structures and processes within an organism that protects against disease
