## Question-Answering with BERT test

I implement and train a (distil)BERT model for Question and Answering on a subset of the [SQuAD v2.0](https://rajpurkar.github.io/SQuAD-explorer/) dataset. 
1. Convert the data to tensors using the BERT tokenizer
2. Train a model for Question-Answering by tuning on top of a pre-trained BERT model 

I use distBERT in this lab because it is significantly smaller and faster than BERT, but with very similar performance. Even though I am using distBERT, I will call it BERT throughout this application.

I run this on Google Colab with a GPU backend.

In [1]:
!pip install pulp
!pip install transformers
from google.colab import drive
drive.mount('/content/gdrive')

Collecting pulp
[?25l  Downloading https://files.pythonhosted.org/packages/14/c4/0eec14a0123209c261de6ff154ef3be5cad3fd557c084f468356662e0585/PuLP-2.4-py3-none-any.whl (40.6MB)
[K     |████████████████████████████████| 40.6MB 1.2MB/s 
[?25hCollecting amply>=0.1.2
  Downloading https://files.pythonhosted.org/packages/f3/c5/dfa09dd2595a2ab2ab4e6fa7bebef9565812722e1980d04b0edce5032066/amply-0.1.4-py3-none-any.whl
Installing collected packages: amply, pulp
Successfully installed amply-0.1.4 pulp-2.4
Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 13.9MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████

## Getting Started

Access relevant modules

In [1]:
import numpy as np
import torch
import pulp
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import accuracy_score
import os

In [3]:
# set the pseudo-random generator
import torch
manual_seed = 77
torch.manual_seed(manual_seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
n_gpu = torch.cuda.device_count()
if n_gpu > 0:
    torch.cuda.manual_seed(manual_seed)

cuda


## Download the SQuAD data from [google drive](https://drive.google.com/file/d/1tzpxoIW9ES33nUN_jouBaOfockXodWTX/view?usp=sharing)

In [4]:
squad_path = '/content/gdrive/MyDrive/Colab Notebooks/data/squad/'

## Tidy Submission
rubric={mechanics:1}

To get the marks for tidy submission:
- Submit the assignment by filling in this Jupyter notebook with your answers embedded
- Be sure to follow the instructions

## Convert data to BERT tensors

In [5]:

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')


def convert_to_BERT_tensors(questions, contexts):
    '''takes a parallel list of question strings and answer strings'''
    #your code here
    tok = tokenizer(questions,contexts, padding='max_length', truncation=True, max_length=512, return_tensors="pt")
    #print("input_ids",tok['input_ids'])
    # print("attention_mask",tok['attention_mask'])
    return tok['input_ids'],tok['attention_mask']


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




In [6]:
test_questions = ["Why?", "How?"]
test_contexts = ["I think it is because we can bluminate", "It was done"" ".join(["very"]*1000) + " well"]

ids, mask = convert_to_BERT_tensors(test_questions,test_contexts)
assert ids.shape == (2,512) # 512 because that's the max allowed
assert ids[0][3] == 102 # fourth token is separator
assert list(ids[0][-100:]) == [0]*100 # first row is mostly padding
assert list(ids[1][-100:]) != [0]*100 # second row is not
assert list(mask[0][-100:]) == [0]*100 # first row padding is masked
assert list(mask[1][-100:]) != [0]*100 # second row is not padding, no mask
print("Success!")

Success!


### Create question and answer spans

In [53]:
def get_answer_span_tensor(question,context,answer):

    ques_context = tokenizer.tokenize(question + "sep " + context)
    answer = tokenizer.tokenize(answer)
    ques_context.insert(0, "cls")
    ques_context = ques_context[:512]

    match_str = "*".join(answer)
    context_str = "*".join(ques_context)
    
    start_index = context_str.find(match_str)
    
    if start_index == -1:
        end_index = 0
        start_index = 0
        return torch.tensor([start_index, end_index])
        
    end_index = start_index + len(match_str) - 1
    
    token = 0
    for index,char in enumerate(context_str):
        if char == "*":
            token+=1

        if index == start_index:
            start_index = token

        if index == end_index:    
            end_index = token

    return torch.tensor([start_index, end_index])


In [8]:
test_question = "Why?"
test_context1 = "I think it is because we can bluminate"
test_context2 = "I think it is because we can because we can bluminate"

test_answer = "because we can bluminate"
bad_answer  = "because we can fumiage"
span = get_answer_span_tensor(test_question,test_context1,test_answer)

assert span.shape == (2,)
assert list(span) == [8,12]

span = get_answer_span_tensor(test_question,test_context2,test_answer)
print(span)
assert list(span) == [11,15]

span = get_answer_span_tensor(test_question,test_context1,bad_answer)
assert list(span) == [0,0]

test_context = """after a three-year hiatus , a fifth digimon series began airing on april 2 , 2006 . like frontier , savers has no connection with the previous installments , and also marks a new start for the digimon franchise , with a drastic change in character designs and story-line , in order to reach a broader audience . the story focuses on the challenges faced by the members of d.a.t.s . ( " digital accident tactics squad " ) , an organization created to conceal the existence of the digital world and digimon from the rest of mankind , and secretly solve any digimon-related incidents occurring on earth . later the d.a.t.s . is dragged into a massive conflict between earth and the digital world , triggered by an ambitious human scientist named akihiro kurata , determined to make use of the digimon for his own personal gains . the english version was dubbed by studiopolis and it premiered on the jetix block on toon disney on october 1 , 2007 . digivolution in data squad requires the human partner 's dna ( " digital natural ability " in the english version and " digisoul " in the japanese version ) to activate , a strong empathy with their digimon and a will to succeed . 'digimon savers ' also introduces a new form of digivolving called burst mode which is essentially the level above mega ( previously the strongest form a digimon could take ) . like previously in tamers , this plot takes on a dark tone throughout the story and the anime was aimed , originally in japan , at an older audience consisting of late teens and people in their early twenties from ages 16 to 21 . because of that , along with the designs , the anime being heavily edited and localized for western us audiences like past series , and the english dub being aimed mostly toward younger audiences of children aged 6 to 10 and having a lower tv-y7-fv rating just like past dubs , studiopolis dubbed the anime on jetix with far more edits , changes , censorship , and cut footage . this included giving the japanese characters full americanized names and american surnames as well as applying far more americanization ( marcus damon as opposed to the japanese daimon masaru ) , cultural streamlining and more edits to their version similar to the changes 4kids often made ( such as removal of japanese text for the purpose of cultural streamlining ) . despite all that , the setting of the country was still in japan and the characters were japanese in the dub . this series was the first to show any japanese cultural concepts that were unfamiliar with american audiences ( such as the manju ) , which were left unedited and used in the english dub . also despite the heavy censorship and the english dub aimed at young children , some of the digimon 's attacks named after real weapons such as rizegreymon 's trident revolver are not edited and used in the english dub . well go usa released it on dvd instead of disney . the north american english dub was televised on jetix in the u.s. and on the family channel in canada ."""
test_question = "what was the original target age for the digimon series ?"
test_answer = "children aged 6 to 10"
span = get_answer_span_tensor(test_question,test_context,test_answer)
print(span)


test_question = "when did universal inaugurate its studio tour subsidiary ?"
test_context = "the long-awaited takeover of universal pictures by mca , inc. happened in mid-1962 as part of the mca-decca records merger . the company reverted in name to universal pictures . as a final gesture before leaving the talent agency business , virtually every mca client was signed to a universal contract . in 1964 mca formed universal city studios , inc. , merging the motion pictures and television arms of universal pictures company and revue productions ( officially renamed as universal television in 1966 ) . and so , with mca in charge , universal became a full-blown , a-film movie studio , with leading actors and directors under contract ; offering slick , commercial films ; and a studio tour subsidiary launched in 1964 . television production made up much of the studio 's output , with universal heavily committed , in particular , to deals with nbc ( which later merged with universal to form nbc universal ; see below ) providing up to half of all prime time shows for several seasons . an innovation during this period championed by universal was the made-for-television movie ."
test_answer = "1964"
span = get_answer_span_tensor(test_question,test_context,test_answer)
print(span)

print('Success!')


tensor([11, 15])
tensor([395, 399])
tensor([74, 74])
Success!


### Dataloader QAdataset

In [9]:
batch_size = 16

class QAdataset(Dataset):
    '''A dataset for housing QA data, including input_data, output_data, and padding mask'''
    def __init__(self, input_data, output_data,mask):
        self.input_data = input_data
        self.output_data = output_data
        self.mask = mask
        
    def __len__(self):
        return len(self.input_data)
    
    def __getitem__(self, index):
        target = self.output_data[index]
        data_val = self.input_data[index]
        mask = self.mask[index]
        return data_val,target,mask 

In [10]:
train_files = ['train/' + filename for filename in os.listdir(squad_path + 'train')]
dev_files = ['dev/' + filename for filename in os.listdir(squad_path + 'dev')]
test_files = ['test/' + filename for filename in os.listdir(squad_path + 'test')]
train_files, dev_files, test_files = sorted(train_files), sorted(dev_files), sorted(test_files)
dev_and_train = sorted(train_files + dev_files)

In [11]:
train_files

['train/train.answer',
 'train/train.context',
 'train/train.question',
 'train/train.span']

In [12]:
def read_SQUAD_data(squad_files):
    '''create IOB-data for all the SRL-tagged Ontonotes files in srl_files. Output should be a tuple
    consisting of a list of lists of (token, is_target_pred) pairs , and a list of lists of dummy IOB tags'''

    #your code here
    questions = []
    answers = []
    spans = []
    context = []

    for file in squad_files:
        file_path = squad_path + file
        f = open(file_path,'r')
        for line in f:
          if file.endswith(".question"):
            questions.append(line)
          if file.endswith(".answer"):
            answers.append(line)            
          if file.endswith(".span"):
            spans.append(line)            
          if file.endswith(".context"):
            context.append(line)            

        
    return context, questions, answers, spans

In [13]:
def convert2tensors(questions,contexts,answers,spans):
  id_tensors = []
  span_tensors = []
  mask_tensors = []
  count = 0
  for question, context, span, answer in zip(questions, contexts, spans, answers):
    count +=1
    if count % 10000   == 0:
      print("Processing row %s of %s" % (count, len(questions)),end="\n")
    ids, mask = convert_to_BERT_tensors(question,context)
    span = get_answer_span_tensor(question, context, answer)
    if list(span)[0] > 512 or list(span)[1] > 512:
      print("WARNING on ",question)

    id_tensors.append(ids)
    span_tensors.append(span)
    mask_tensors.append(mask)

  return id_tensors, span_tensors, mask_tensors

In [14]:
def prepare_dataset(files):
    '''given a list of qa, contexts and corrected spans, loads them into a pytorch Dataset'''
    contexts, questions, answers, spans = read_SQUAD_data(files)
    id_tensors, span_tensors, mask_tensors = convert2tensors(questions,contexts,answers,spans)
    return QAdataset(id_tensors, span_tensors, mask_tensors)


In [15]:
train_dataset = prepare_dataset(train_files)

Processing row 10000 of 77558
Processing row 20000 of 77558
Processing row 30000 of 77558
Processing row 40000 of 77558
Processing row 50000 of 77558
Processing row 60000 of 77558
Processing row 70000 of 77558


In [16]:
assert len(train_dataset) == 77558

train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=False)
for train_ids_batch, train_span_batch, train_mask_batch in train_dataloader:
    sample_input = train_ids_batch
    sample_output = train_span_batch
    break

print('Success!')

Success!


In [17]:
sample_output

tensor([[ 23,  24],
        [204, 220],
        [ 96, 102],
        [ 53,  60],
        [ 46,  48],
        [123, 125],
        [ 81,  82],
        [231, 232],
        [ 35,  43],
        [110, 111],
        [ 72,  98],
        [135, 141],
        [ 44,  47],
        [ 21,  21],
        [ 62,  64],
        [ 25,  26]])

## BERT Training 


In [18]:
model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this mode

In [19]:
loss_function = nn.CrossEntropyLoss()
EPOCHS = 1
LEARNING_RATE = 0.00003
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
model = model.to(device)

In [None]:
#torch.cuda.empty_cache()

In [20]:
#your code here

import torch
count = 0
for epoch in range(EPOCHS):
    tot_loss = 0
    count = 0
    for train_ids_batch, train_span_batch, train_mask_batch in train_dataloader:
        count +=1
        if count % 100 == 0:
          print("Processing row %s of %s" % (count, len(train_dataloader)),end="\n")

        model.zero_grad()
        targets_start = train_span_batch[:,0]
        targets_end = train_span_batch[:,1]
        output = model(train_ids_batch.squeeze(1).to(device),train_mask_batch.squeeze(1).to(device))
        try:
          loss_start = loss_function(output.start_logits.cpu(), targets_start)
          loss_end = loss_function(output.end_logits.cpu(), targets_end)

          loss = loss_start + loss_end
          loss.backward()
          optimizer.step()
          tot_loss += loss.detach().numpy()
        except:
          print("EXCEPTION")
          print("output.start_logits",output.start_logits)
          print("output.end_logits",output.end_logits)
    print()
    avg_loss = tot_loss/len(train_dataloader)
    print("EPOCH %u: AVG LOSS PER EX: %.5f" % (epoch+1,avg_loss))   


Processing row 100 of 4848
Processing row 200 of 4848
Processing row 300 of 4848
Processing row 400 of 4848
Processing row 500 of 4848
Processing row 600 of 4848
Processing row 700 of 4848
Processing row 800 of 4848
Processing row 900 of 4848
Processing row 1000 of 4848
Processing row 1100 of 4848
Processing row 1200 of 4848
Processing row 1300 of 4848
Processing row 1400 of 4848
Processing row 1500 of 4848
Processing row 1600 of 4848
Processing row 1700 of 4848
Processing row 1800 of 4848
Processing row 1900 of 4848
Processing row 2000 of 4848
Processing row 2100 of 4848
Processing row 2200 of 4848
Processing row 2300 of 4848
Processing row 2400 of 4848
Processing row 2500 of 4848
Processing row 2600 of 4848
Processing row 2700 of 4848
Processing row 2800 of 4848
Processing row 2900 of 4848
Processing row 3000 of 4848
Processing row 3100 of 4848
Processing row 3200 of 4848
Processing row 3300 of 4848
Processing row 3400 of 4848
Processing row 3500 of 4848
Processing row 3600 of 4848
P

In [21]:
model.save_pretrained("/content/gdrive/MyDrive/Colab Notebooks/563_Lab_3/finetuned_1new")
tokenizer.save_pretrained("/content/gdrive/MyDrive/Colab Notebooks/563_Lab_3/finetuned_1new")

('/content/gdrive/MyDrive/Colab Notebooks/563_Lab_3/finetuned_1new/tokenizer_config.json',
 '/content/gdrive/MyDrive/Colab Notebooks/563_Lab_3/finetuned_1new/special_tokens_map.json',
 '/content/gdrive/MyDrive/Colab Notebooks/563_Lab_3/finetuned_1new/vocab.txt',
 '/content/gdrive/MyDrive/Colab Notebooks/563_Lab_3/finetuned_1new/added_tokens.json')

In [22]:
#### Only Run to load the model #####

model = DistilBertForQuestionAnswering.from_pretrained('/content/gdrive/MyDrive/Colab Notebooks/563_Lab_3/finetuned_1new')
#model.load_state_dict(torch.load('/content/gdrive/MyDrive/Colab Notebooks/data/ckpt/563lab3_1.pt')['state_dict'])

### Accuracy on Dev

In [23]:
dev_dataset = prepare_dataset(dev_files)
dev_dataloader = DataLoader(dev_dataset, batch_size=batch_size, shuffle=False)

In [24]:
all_dev_pred_start = []
all_dev_start = []
count = 0
all_dev_pred_end = []
all_dev_end = []
model.eval()
model = model.to(device)

with torch.no_grad():
  for dev_ids_batch, dev_span_batch, dev_mask_batch in dev_dataloader:
    count +=1
    if count % 100 == 0:
      print("Processing row %s of %s" % (count, len(dev_dataloader)),end="\n")

    dev_spans = model(dev_ids_batch.squeeze(1).to(device),dev_mask_batch.squeeze(1).to(device))

    pred_start = tuple(torch.argmax(dev_spans[0].cpu(),dim=1).numpy().flatten())
    pred_end = tuple(torch.argmax(dev_spans[1].cpu(),dim=1).numpy().flatten())

    all_dev_pred_start.extend(pred_start)
    all_dev_pred_end.extend(pred_end)
    all_dev_start.extend(dev_span_batch[:,0].numpy().flatten())
    all_dev_end.extend(dev_span_batch[:,1].numpy().flatten())

    if count==1:
      pass
      


Processing row 100 of 366
Processing row 200 of 366
Processing row 300 of 366


In [25]:
print(len(all_dev_pred_start))
print(len(all_dev_pred_end))
print(len(all_dev_start))
print(len(all_dev_end))

5854
5854
5854
5854


In [26]:
from sklearn.metrics import accuracy_score

start_accuracy = accuracy_score(all_dev_start, all_dev_pred_start)
end_accuracy = accuracy_score(all_dev_end, all_dev_pred_end)

print(start_accuracy)
print(end_accuracy)

0.6194055346771439
0.6569866757772463


In [49]:
def select_best_answer_span(start_probs, end_probs, distance):
    ''' returns a list of spans corresponding to the highest probability QA solution which satisfy the restriction that the end index must
    be within distance after the start index'''
    output_spans = []
    for start,end in zip(start_probs,end_probs):
        output_spans.append(select_best(start,end,distance))
    return output_spans

def select_best(start_probs, end_probs, distance):

    start_ind_sorted = np.flip(np.argsort(start_probs))
    end_ind_sorted = np.flip(np.argsort(end_probs))
    start = 0
    end = 0
    max_prob = 0

    for i in range(len(start_ind_sorted)-1):
        for j in range(len(end_ind_sorted)-1):
            if start_ind_sorted[i] <= end_ind_sorted[j] and end_ind_sorted[j] - start_ind_sorted[i] <=distance:
                temp_prob = start_probs[start_ind_sorted[i]] + end_probs[end_ind_sorted[j]]
                if temp_prob > max_prob:
                    max_prob = temp_prob
                    start = start_ind_sorted[i]
                    end = end_ind_sorted[j]

    return (start,end,)


In [37]:
test_starts = np.array([[0.1,0.5,0.2,0.1,0.1], [0.3,0.2,0.2,0.1,0.1]])
test_ends = np.array([[0.4,0.1,0.3,0.1,0.1], [0.1,0.1,0.1,0.1,0.6]])
assert select_best_answer_span_slow(test_starts,test_ends,2) == [(1,2),(2,4)]
print("Success!")

Success!


In [50]:
test_starts = np.array([[0.1,0.5,0.2,0.1,0.1], [0.3,0.2,0.2,0.1,0.1]])
test_ends = np.array([[0.4,0.1,0.3,0.1,0.1], [0.1,0.1,0.1,0.1,0.6]])
assert select_best_answer_span(test_starts,test_ends,2) == [(1,2),(2,4)]
print("Success!")

Success!


## Predict on Test set 

In [28]:
test_dataset = prepare_dataset(test_files)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
distance = 20

In [52]:
count = 0
model.eval()
model = model.to(device)
pred_answers = []
start_spans = []
id = []
with torch.no_grad():
  for test_ids_batch, test_span_batch, test_mask_batch in test_dataloader:
    count +=1
    if count % 100 == 0:
      print("Processing row %s of %s" % (count, len(test_dataloader)),end="\n")

    test_spans = model(test_ids_batch.squeeze(1).to(device),test_mask_batch.squeeze(1).to(device))

    pred_start_probs = test_spans.start_logits.cpu().detach().numpy()
    pred_end_probs = test_spans.end_logits.cpu().detach().numpy()
    pred_spans = select_best_answer_span_slow(pred_start_probs, pred_end_probs, distance)

    for ids, span in zip(test_ids_batch,pred_spans):
      ans = tokenizer.decode(ids[0][span[0]:span[1]+1],clean_up_tokenization_spaces=True)
      start_spans.append(span[0])
      pred_answers.append(ans)
      id.append(str(count))

    if count==1:
      pass
      


Processing row 100 of 500
Processing row 200 of 500
Processing row 300 of 500
Processing row 400 of 500
Processing row 500 of 500


In [30]:
len(pred_answers)

8000

In [31]:
pred_answers[:10]

['1964',
 'our own consciousness',
 'mars',
 "mayor's office for policing and crime",
 'jorge carcavallo',
 'steve jobs',
 'encased inside a vertical cylinder',
 '15 – 23 december',
 'orlando international airport',
 '400, 000']

In [33]:
pred_answers_new = []
for ans in pred_answers:
  if " – " in ans:
    new = ans.replace(" – ","–")
    pred_answers_new.append(new.lower())
  elif "'s" in ans:
    new = ans.replace("'s"," 's")
    pred_answers_new.append(new.lower())
  elif ". " in ans:
    new = ans.replace(". ",".")
    pred_answers_new.append(new.lower())
  elif ", " in ans:
    new = ans.replace(", ",",")
    pred_answers_new.append(new.lower())
  else:
    pred_answers_new.append(ans.lower())

In [34]:
len(pred_answers_new)

8000

In [35]:
##### Please note, I had to remove extra columns in the csv when openning in google sheets
##### and added column names : Id and Predicted

from pandas import DataFrame
cols = ['Id','Predicted']
df = DataFrame(pred_answers_new)
df.to_csv("/content/gdrive/MyDrive/Colab Notebooks/563_Lab_3/out.csv")

##### Please note, I had to remove extra columns in the csv when openning in google sheets
##### and added column names : Id and Predicted

In [44]:
pred_answers[:10]

['1964',
 'our own consciousness',
 'mars',
 "mayor's office for policing and crime",
 'jorge carcavallo',
 'steve jobs',
 'encased inside a vertical cylinder',
 '15 – 23 december',
 'orlando international airport',
 '400, 000']