# Question Answering Using the SQuAD v2.0 Dataset
### Wyatt Cupp
#### wyattcupp@gmail.com

In [4]:
%pip install transformers



In [5]:
import numpy as np
import pandas as pd
import nltk

import os

import json

### Loading SQuAD v2.0

In [6]:
# load the train set
with open('train-v2.0.json') as data_file:
    train_json = json.load(data_file)
    
train_json['data'][0]['paragraphs'][0]['context'] #examine frist context for an example

'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".'

In [7]:
# load the test set
with open('dev-v2.0.json') as f:
    test_json = json.load(f)

### Preprocessing SQuAD v2.0

In this section, I will dive into the SQuAD dataset, explore the data, and perform preprocessing on the data.

In [8]:
import transformers
from transformers import BertTokenizer, BertModel, BertForQuestionAnswering

from tqdm import tqdm

import torch
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu' # if CPU, try Google Colab

device

'cuda'

The tokenizer we will use is the `BertTokenizer`. This tokenizer is used across a variety of premade BERT models.

In [9]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




In [10]:
def char_index_mapper(context, context_tokenized):
    """
    Maps from a start str index, to a tokenized index within a list of tokens.
    """
    # map from a start index in a str, to a tokenized index in token list
    mapper = {}
    curr = ''
    token_idx = 0
    for i, char in enumerate(context):
        if char != ' ' and char != '\n' and char != '\t' and char != '\r': # make sure current char is not whitespace
            curr += char
            if curr == context_tokenized[token_idx]:
                start = i - len(curr) + 1
                for j in range(start, i+1):
                    mapper[j] = (curr, token_idx)                
                curr = ''
                token_idx += 1
    if token_idx != len(context_tokenized): # problems with doc span within the data, skip example.
        return None
            
    return mapper
            
def preprocess_data(dataset, is_training=True, tokenized=False):
    """
    Parse the json_data object into a pandas readable data representation (list of dicts)
    
    Params:
        dataset -- JSON dataset object
        tokenized -- Boolean representing whether to tokenize context & question or not (default False)
    Returns:
        examples -- list of dicts representing SQuAD dataset
    """
    
    def _tokenize(seq):
        """
        Minimizes errors between tokenizers and encodings.
        Recommended in the paper BiDAF (Seo et al., 2016)
        """
        return [t.replace("``", '"').replace("''", '"') for t in seq.split()]
    
    examples = [] # store rows of data here for qa
    
    tokenization_errors = 0
    misaligned_ans_errors = 0
    num_impossibles = 0
    num_questions = 0
    
    for article_id in tqdm(range(len(dataset['data']))): # for each context
        paragraphs = dataset['data'][article_id]['paragraphs']
        for paragraph_id in range(len(paragraphs)):
            questions = dataset['data'][article_id]['paragraphs'][paragraph_id]['qas']
            
            context = paragraphs[paragraph_id]['context']
            context_tokenized = _tokenize(context)
                    
            for qid in range(len(questions)): # loop through questions
                num_questions += 1
                
                question = questions[qid]['question']
                question_tokenized = _tokenize(question)
                qas_id = questions[qid]['id']
                
                is_impossible = questions[qid]['is_impossible']
                
                if is_impossible: # check if question is impossible to answer
                    num_impossibles += 1
                    examples.append({'qas_id': qas_id, 
                                     'question':question_tokenized if tokenized else question, 
                                     'context': context_tokenized if tokenized else context, 
                                     'answer':'', 
                                     'is_impossible': is_impossible,
                                     'start_pos': -1, 
                                     'end_pos':-1,
                                    'santiy_check': context_tokenized[-1:0]})
                    continue
                    
                # question is not impossible, continue parsing
                answers = questions[qid]['answers']
                
                for ans_id in range(len(answers)): # for each answer
                    answer = answers[ans_id]['text']
                    start_pos = answers[ans_id]['answer_start'] # inclusive start index in raw context
                    end_pos = start_pos + len(answer) #exclusive end index in raw context
                          
                    if context[start_pos:end_pos] != answer:
                        misaligned_ans_errors += 1
                        continue
                        
                    if tokenized:
                        mapper = char_index_mapper(context, context_tokenized)
                        if mapper is None:
                            tokenization_errors += 1
                            continue
                        
                        start_pos = mapper[start_pos][1]
                        end_pos = mapper[end_pos-1][1] # inclusive
                    
                    examples.append({'qas_id': qas_id, 
                                     'question':question_tokenized if tokenized else question, 
                                     'context': context_tokenized if tokenized else context, 
                                     'answer':answer, 
                                     'is_impossible': is_impossible,
                                     'start_pos': start_pos, 
                                     'end_pos':end_pos,
                                    'santiy_check': context_tokenized[start_pos:end_pos+1] if tokenized else context[start_pos:end_pos+1]})
            
                    
    print('Total number of questions:{}'.format(num_questions))
    print('Total number of impossible questions:{}'.format(num_impossibles))
    print('Number of tokenization errors:{}'.format(tokenization_errors))
    print('Number of misaligned answer errors:{}'.format(misaligned_ans_errors))
    return examples

Create the training examples:

In [11]:
training_examples = preprocess_data(train_json, tokenized=True)

100%|██████████| 442/442 [00:28<00:00, 15.29it/s]

Total number of questions:130319
Total number of impossible questions:43498
Number of tokenization errors:102
Number of misaligned answer errors:0





Now we convert to pandas dataframe to get a better look at the data:

In [12]:
df = pd.DataFrame(training_examples)
df[df.start_pos == df.end_pos][:40]

Unnamed: 0,qas_id,question,context,answer,is_impossible,start_pos,end_pos,santiy_check
2,56be85543aeaaa14008c9066,"[When, did, Beyonce, leave, Destiny's, Child, ...","[Beyoncé, Giselle, Knowles-Carter, (/biːˈjɒnse...",2003,False,82,82,"[(2003),]"
11,56d43c5f2ccc5a1400d830ac,"[When, did, Beyoncé, release, Dangerously, in,...","[Beyoncé, Giselle, Knowles-Carter, (/biːˈjɒnse...",2003,False,82,82,"[(2003),]"
12,56d43c5f2ccc5a1400d830ad,"[How, many, Grammy, awards, did, Beyoncé, win,...","[Beyoncé, Giselle, Knowles-Carter, (/biːˈjɒnse...",five,False,92,92,[five]
15,56be86cf3aeaaa14008c9076,"[After, her, second, solo, album,, what, other...","[Following, the, disbandment, of, Destiny's, C...",acting,False,30,30,"[acting,]"
17,56be86cf3aeaaa14008c9079,"[To, set, the, record, for, Grammys,, how, man...","[Following, the, disbandment, of, Destiny's, C...",six,False,87,87,[six]
18,56bf6e823aeaaa14008c9627,"[For, what, movie, did, Beyonce, receive, her,...","[Following, the, disbandment, of, Destiny's, C...",Dreamgirls,False,37,37,[Dreamgirls]
19,56bf6e823aeaaa14008c9629,"[When, did, Beyonce, take, a, hiatus, in, her,...","[Following, the, disbandment, of, Destiny's, C...",2010,False,91,91,"[2010,]"
20,56bf6e823aeaaa14008c962a,"[Which, album, was, darker, in, tone, from, he...","[Following, the, disbandment, of, Destiny's, C...",Beyoncé,False,26,26,[Beyoncé]
23,56d43da72ccc5a1400d830be,"[What, was, the, name, of, Beyoncé's, second, ...","[Following, the, disbandment, of, Destiny's, C...",B'Day,False,15,15,[B'Day]
24,56d43da72ccc5a1400d830bf,"[What, was, Beyoncé's, first, acting, job,, in...","[Following, the, disbandment, of, Destiny's, C...",Dreamgirls,False,37,37,[Dreamgirls]


In [13]:
def examples_to_feats(examples, tokenizer, max_len, max_query_len):
    """
    Converts examples of data into BERT input format tensors.
    """
    context_length_errors = 0
    feats = [] # list of dicts containing feature data
    for ex in tqdm(examples):
        question_raw = ' '.join(ex['question'])
        context_raw = ' '.join(ex['context'])
        if len(question_raw) > max_query_len: # check length of question, truncate if necessary
            question_raw = question_raw[:max_query_len]
        
        # encode the data using the BERT tokenizer
        encoded = tokenizer.encode_plus(question_raw, context_raw,
                                        max_length=max_len,
                                        padding='max_length',
                                        truncation=True,
                                        return_token_type_ids=True)
        if ex['is_impossible']:
            start = -1
            end = -1
        else: # Adjust the start_pos and end_pos to align with new tokenized question+context encodings
            input_ids = encoded['input_ids']
            answer_ids = tokenizer.encode(ex['answer']) # get token ids for answer to compare
            start, end = 0, 0 # defaults to this, if encode_plus performed truncation which included answer
            for i in range(len(input_ids)):
                if input_ids[i: i+len(answer_ids[1:-1])] == answer_ids[1:-1]:
                    start = i
                    end = i + len(answer_ids[1:-1]) - 1
                    break # found start and end pos ids in context
            
        ids = encoded['input_ids']
        token_type_ids = encoded['token_type_ids']
        mask = encoded['attention_mask']
        
        assert len(ids) == max_len
        assert len(token_type_ids) == max_len
        assert len(mask) == max_len
        
        feats.append({'ids': ids, # X
                      'token_type_ids': token_type_ids, # X
                      'mask': mask, # X
                      'start_pos': start, # y (target)
                      'end_pos': end}) # y (target)        
    return feats

### BERT Fine Tuning for the SQuAD v2.0 Dataset

We now must fine tune a BERT model using our training dataset:

In [22]:
# PROPERTIES
EPOCHS = 3
MAX_SEQ_LENGTH = 256
MAX_QUESTION_LENGTH = 64
BATCH_SIZE = 16
NUM_OUT = 2 # our output will be probability distribution over start token and end token
LEARNING_RATE = 2e-05

### Loading the Training and Test Datasets

In [23]:
from torch.utils.data import TensorDataset, DataLoader

Load the train dataset:

In [24]:
# file names for pickled features
training_feats_file = 'training_features'
testing_feats_file = 'test_features'

In [34]:
if not os.path.exists(training_feats_file):
    training_features = examples_to_feats(training_examples, tokenizer, MAX_SEQ_LENGTH, MAX_QUESTION_LENGTH)
    torch.save(training_features, training_feats_file)
    print('Saved training_features to pickle (file): {}'.format(training_feats_file))

else:
    training_features = torch.load(training_feats_file)
    print('Training features loaded from file: {}\nNumber of features:{}'.format(training_feats_file, len(training_features)))

all_input_ids = torch.tensor([f['ids'] for f in training_features], dtype=torch.long)
all_input_mask = torch.tensor([f['mask'] for f in training_features], dtype=torch.long)
all_segment_ids = torch.tensor([f['token_type_ids'] for f in training_features], dtype=torch.long)


all_start_positions = torch.tensor([f['start_pos'] for f in training_features], dtype=torch.long)
all_end_positions = torch.tensor([f['end_pos'] for f in training_features], dtype=torch.long)
dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids,
                        all_start_positions, all_end_positions)

train_params = train_params = {'batch_size': BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

training_loader = DataLoader(dataset, **train_params)

[A
 99%|█████████▉| 129376/130217 [10:04<00:04, 176.20it/s][A
 99%|█████████▉| 129395/130217 [10:04<00:04, 179.70it/s][A
 99%|█████████▉| 129417/130217 [10:04<00:04, 189.50it/s][A
 99%|█████████▉| 129437/130217 [10:04<00:04, 185.82it/s][A
 99%|█████████▉| 129456/130217 [10:04<00:04, 182.85it/s][A
 99%|█████████▉| 129475/130217 [10:04<00:04, 176.77it/s][A
 99%|█████████▉| 129493/130217 [10:05<00:04, 164.91it/s][A
 99%|█████████▉| 129510/130217 [10:05<00:05, 133.63it/s][A
 99%|█████████▉| 129526/130217 [10:05<00:04, 139.53it/s][A
 99%|█████████▉| 129545/130217 [10:05<00:04, 150.83it/s][A
 99%|█████████▉| 129564/130217 [10:05<00:04, 160.61it/s][A
100%|█████████▉| 129587/130217 [10:05<00:03, 176.09it/s][A
100%|█████████▉| 129611/130217 [10:05<00:03, 190.19it/s][A
100%|█████████▉| 129632/130217 [10:05<00:03, 191.42it/s][A
100%|█████████▉| 129653/130217 [10:05<00:02, 195.39it/s][A
100%|█████████▉| 129674/130217 [10:06<00:02, 198.98it/s][A
100%|█████████▉| 129695/130217 [10:0

Saved training_features to pickle (file): training_features


Load the test dataset:

In [26]:
test_examples = preprocess_data(test_json, tokenized=True)

100%|██████████| 35/35 [00:06<00:00,  5.43it/s]

Total number of questions:11873
Total number of impossible questions:5945
Number of tokenization errors:15
Number of misaligned answer errors:0





In [40]:
if not os.path.exists(testing_feats_file):
    test_features = examples_to_feats(test_examples, tokenizer, MAX_SEQ_LENGTH, MAX_QUESTION_LENGTH)
    torch.save(test_features, testing_feats_file)
    print('Saved test_features to pickle (file): {}'.format(testing_feats_file))
else:
    test_features = torch.load(testing_feats_file)
    print('Testing features loaded from file: {}\nNumber of features:{}'.format(testing_feats_file, len(test_features)))

all_input_ids = torch.tensor([f['ids'] for f in test_features], dtype=torch.long)
all_input_mask = torch.tensor([f['mask'] for f in test_features], dtype=torch.long)
all_segment_ids = torch.tensor([f['token_type_ids'] for f in test_features], dtype=torch.long)

all_start_positions = torch.tensor([f['start_pos'] for f in test_features], dtype=torch.long)
all_end_positions = torch.tensor([f['end_pos'] for f in test_features], dtype=torch.long)

test_dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, 
                             all_start_positions, all_end_positions)

test_params = {'batch_size': BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }    

testing_loader = DataLoader(test_dataset, **train_params)

Testing features loaded from file: test_features
Number of features:26232


In [80]:
def train(model, training_loader, optimizer):
    print('Starting training...')
    step = 0
    loss_tracker = 0
    model.zero_grad()
    model.train()
    for data in tqdm(training_loader):
        data = tuple(d.to(device) for d in data)
        inputs = {'input_ids':     data[0],
                'attention_mask':  data[1], 
                'token_type_ids':  data[2],  
                'start_positions': data[3], 
                'end_positions':   data[4]}
        outputs = model(**inputs)
        loss = outputs[0]
        loss.backward() # back propagation
        optimizer.step()
        model.zero_grad()
        loss_tracker += loss.item()
        step += 1
        if step % 100 == 0:
            print("Train loss: {}".format(loss_tracker/step))

    return loss

def validation(model, testing_loader):
    print('Starting validation...')
    model.eval()
    preds = []
    targs = []
    with torch.no_grad():
        for data in tqdm(testing_loader):
            data = tuple(d.to(device) for d in data)
            inputs = {'input_ids': data[0],
                'attention_mask':  data[1], 
                'token_type_ids':  data[2]}
            output = model(**inputs)
            starts = output[0]
            ends = output[1]
            start_preds = []
            end_preds = []

            for s in starts:
                start_pred = torch.argmax(s)
                start_preds.append(start_pred)
            for e in ends:
                end_pred = torch.argmax(e)
                end_preds.append(end_pred)
            
            for s, e in zip(start_preds, end_preds):
                preds.append((s,e))

            target_starts = data[3]
            target_ends = data[4]

            for s,e in zip(target_starts, target_ends):
                targs.append((s,e))
            
    return preds, targs

### Fine Tuning the BertForQuestionAnswering Model

In [36]:
model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')
model.to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased a

BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_

In [37]:
optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)

In [38]:
for epoch in range(EPOCHS): #TRAINING
    loss = train(model, training_loader, optimizer)
    print(f'Epoch: {epoch}, Loss:  {loss.item()}')  


 99%|█████████▉| 8096/8139 [1:48:44<00:34,  1.24it/s][A
 99%|█████████▉| 8097/8139 [1:48:45<00:33,  1.24it/s][A
 99%|█████████▉| 8098/8139 [1:48:46<00:33,  1.24it/s][A
100%|█████████▉| 8099/8139 [1:48:46<00:32,  1.24it/s][A
100%|█████████▉| 8100/8139 [1:48:47<00:31,  1.24it/s][A

Train loss: 1.4102776688171759



100%|█████████▉| 8101/8139 [1:48:48<00:30,  1.24it/s][A
100%|█████████▉| 8102/8139 [1:48:49<00:29,  1.24it/s][A
100%|█████████▉| 8103/8139 [1:48:50<00:29,  1.24it/s][A
100%|█████████▉| 8104/8139 [1:48:51<00:28,  1.24it/s][A
100%|█████████▉| 8105/8139 [1:48:51<00:27,  1.24it/s][A
100%|█████████▉| 8106/8139 [1:48:52<00:26,  1.23it/s][A
100%|█████████▉| 8107/8139 [1:48:53<00:25,  1.23it/s][A
100%|█████████▉| 8108/8139 [1:48:54<00:25,  1.24it/s][A
100%|█████████▉| 8109/8139 [1:48:55<00:24,  1.23it/s][A
100%|█████████▉| 8110/8139 [1:48:55<00:23,  1.24it/s][A
100%|█████████▉| 8111/8139 [1:48:56<00:22,  1.23it/s][A
100%|█████████▉| 8112/8139 [1:48:57<00:21,  1.24it/s][A
100%|█████████▉| 8113/8139 [1:48:58<00:20,  1.24it/s][A
100%|█████████▉| 8114/8139 [1:48:59<00:20,  1.24it/s][A
100%|█████████▉| 8115/8139 [1:48:59<00:19,  1.24it/s][A
100%|█████████▉| 8116/8139 [1:49:00<00:18,  1.24it/s][A
100%|█████████▉| 8117/8139 [1:49:01<00:17,  1.24it/s][A
100%|█████████▉| 8118/8139 [1:

Epoch: 0, Loss:  0.9089405536651611





Save the model with new weights for future use:

In [76]:
model_pretrained = model.module if hasattr(model, 'module') else model
model_pretrained.save_pretrained('BERTQAPretrained')

### Validation on the Test Dataset

In [82]:
preds, targs = validation(model, testing_loader)
preds



 82%|████████▏ | 1339/1640 [05:49<01:22,  3.64it/s][A[A

 82%|████████▏ | 1340/1640 [05:49<01:21,  3.67it/s][A[A

 82%|████████▏ | 1341/1640 [05:49<01:21,  3.66it/s][A[A

 82%|████████▏ | 1342/1640 [05:50<01:21,  3.67it/s][A[A

 82%|████████▏ | 1343/1640 [05:50<01:21,  3.66it/s][A[A

 82%|████████▏ | 1344/1640 [05:50<01:20,  3.67it/s][A[A

 82%|████████▏ | 1345/1640 [05:50<01:20,  3.66it/s][A[A

 82%|████████▏ | 1346/1640 [05:51<01:20,  3.66it/s][A[A

 82%|████████▏ | 1347/1640 [05:51<01:20,  3.65it/s][A[A

 82%|████████▏ | 1348/1640 [05:51<01:19,  3.66it/s][A[A

 82%|████████▏ | 1349/1640 [05:52<01:19,  3.66it/s][A[A

 82%|████████▏ | 1350/1640 [05:52<01:19,  3.66it/s][A[A

 82%|████████▏ | 1351/1640 [05:52<01:18,  3.66it/s][A[A

 82%|████████▏ | 1352/1640 [05:52<01:18,  3.67it/s][A[A

 82%|████████▎ | 1353/1640 [05:53<01:18,  3.65it/s][A[A

 83%|████████▎ | 1354/1640 [05:53<01:18,  3.66it/s][A[A

 83%|████████▎ | 1355/1640 [05:53<01:18,  3.65it/s][A

[(tensor(65, device='cuda:0'), tensor(0, device='cuda:0')),
 (tensor(53, device='cuda:0'), tensor(57, device='cuda:0')),
 (tensor(0, device='cuda:0'), tensor(0, device='cuda:0')),
 (tensor(0, device='cuda:0'), tensor(0, device='cuda:0')),
 (tensor(0, device='cuda:0'), tensor(0, device='cuda:0')),
 (tensor(14, device='cuda:0'), tensor(14, device='cuda:0')),
 (tensor(0, device='cuda:0'), tensor(0, device='cuda:0')),
 (tensor(24, device='cuda:0'), tensor(24, device='cuda:0')),
 (tensor(142, device='cuda:0'), tensor(145, device='cuda:0')),
 (tensor(0, device='cuda:0'), tensor(0, device='cuda:0')),
 (tensor(26, device='cuda:0'), tensor(26, device='cuda:0')),
 (tensor(0, device='cuda:0'), tensor(0, device='cuda:0')),
 (tensor(0, device='cuda:0'), tensor(123, device='cuda:0')),
 (tensor(41, device='cuda:0'), tensor(41, device='cuda:0')),
 (tensor(14, device='cuda:0'), tensor(15, device='cuda:0')),
 (tensor(49, device='cuda:0'), tensor(73, device='cuda:0')),
 (tensor(31, device='cuda:0'), tens

In [83]:
torch.save(preds, 'predictions')
torch.save(targs, 'targets')

In [88]:
assert len(preds) == len(test_features) # ensure complete results after validation for accuracy comparison

(26232, 26232)

In [105]:
import string
def normalize(tokens):
    """
    Used to normalize tokenized answers for comparison in our accuracy evaluation.
    """
    ans = ' '.join(tokens)
    ans = ''.join(char for char in ans.lower() if char not in set(string.punctuation))
    ans = ' '.join(ans.split())
    return ans

Now we can iterate over our results and compute our accuracy score. 

Accuracy will be computed the following way (may change in the future):

- First, we normalize the predicted answer and the true answer.
- We then compare and check if the predicted answer is *at least* contained inside of the true answer, resulting in an accuracy point

Future evaluation will be calculated by running the SQuAD test evaluation script available at:
https://rajpurkar.github.io/SQuAD-explorer/

In [110]:
# calculate accuracy based on if pred is in answer at all:
count = 0
for i, ((s,e), (st,et)) in enumerate(zip(preds,targs)):
    start_pred = s.item()
    end_pred = e.item()
  
    start_true = st.item()
    end_true = et.item()

    pred_answer = tokenizer.convert_ids_to_tokens(test_features[i]['ids'][start_pred:end_pred+1])
    true_answer = tokenizer.convert_ids_to_tokens(test_features[i]['ids'][start_true:end_true+1])

    if normalize(pred_answer) in normalize(true_answer):
    count += 1
    
    # elif normalize(true_answer)in normalize(pred_answer):
    #   count += 1

print('Accuracy Score: {}'.format(count/ len(test_features)))

Accuracy Score: 0.46187862153095455


As you can see, when we evaluate accuracy using the *predicted answer in true answer* approach, we get **~46%** accuracy. When we stretch acceptance criteria a little further (seen in the commented out code above), we can get upwards of **~77%** accuracy.

### Test on a Random Question

We will use a sample input question about BERT itself, found as an example [here](https://mccormickml.com/2020/03/10/question-answering-with-a-fine-tuned-BERT/#bert-input-format).

In [113]:
# Using an example input question seen at: https://mccormickml.com/2020/03/10/question-answering-with-a-fine-tuned-BERT/#bert-input-format
question = "How many parameters does BERT-large have?"
answer_text = "BERT-large is really big... it has 24-layers and an embedding size of 1,024, for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take a couple minutes to download to your Colab instance."

input_ids = tokenizer.encode(question, answer_text)
tokens = tokenizer.convert_ids_to_tokens(input_ids)

sep_index = input_ids.index(tokenizer.sep_token_id)
num_seg_a = sep_index + 1

num_seg_b = len(input_ids) - num_seg_a

segment_ids = [0]*num_seg_a + [1]*num_seg_b

assert len(segment_ids) == len(input_ids)

result = model(torch.tensor([input_ids]).to(device), # The tokens representing our input text.
                                 token_type_ids=torch.tensor([segment_ids]).to(device)) # The segment IDs to differentiate question from answer_text

# Find the tokens with the highest `start` and `end` scores.
answer_start = torch.argmax(result[0])
answer_end = torch.argmax(result[1])

# Combine the tokens in the answer and print it out.
answer = ' '.join(tokens[answer_start:answer_end+1])

print('Answer: "' + answer + '"')

Answer: "340 ##m"


It appears our fine-tuned BERT model answered **correctly**.