# **IE7500 NLP Final Project**
# *SQuADv2.0 Answer Prediction Model using KNN and BERT models*

---




# **Importing Libraries**

Installing the transformers

In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Installing other necessary libraries

In [2]:
import json
import pandas as pd
import numpy as np
import random
from transformers import AutoTokenizer
import os, json


# **Question Answering using BERT Model**

Reading the input from JSON file and storing in list



In [3]:
def read_input(doc: str) -> tuple:    
    path = os.path.join(os.getcwd(), doc)
    with open(path, "rb") as json_file:
        input_dictionary = json.load(json_file)
    contexts= list() 
    questions=list() 
    answers =list()
    for i in input_dictionary['data']:
        for j in i['paragraphs']:
            k = j['context']
            for qa in j['qas']:
                q = qa['question']
                access = "plausible_answers" if "plausible_answers" in qa.keys() else 'answers'
                for a in qa[access]:
                    contexts.append(k)
                    questions.append(q)
                    answers.append(a)
    
    return contexts, questions, answers



Reading test and train data

In [4]:
train_contexts, train_questions, train_answers = read_input('train-v2.0.json')
valid_contexts, valid_questions, valid_answers = read_input('dev-v2.0.json')

Printing 5 random samples to check if data is uploaded and processed correctly

In [5]:
ind = random.sample(range(0, len(train_answers)), 5)
for index in ind:
    print('Q: ',train_questions[index],'\n')
    print("Context:\n")
    print(train_contexts[index])
    print(f"\nAnswer:[{train_answers[index]}]\n")
    print("*" * 100)

Q:  For how many seasons was American Idol the most watched show in the US? 

Context:

In 2001, Fuller, Cowell, and TV producer Simon Jones attempted to sell the Pop Idol format to the United States, but the idea was met with poor response from United States television networks. However, Rupert Murdoch, head of Fox's parent company, was persuaded to buy the show by his daughter Elisabeth, who was a fan of the British show. The show was renamed American Idol: The Search for a Superstar and debuted in the summer of 2002. Cowell was initially offered the job as showrunner but refused; Lythgoe then took over that position. Much to Cowell's surprise, it became one of the hit shows for the summer that year. The show, with the personal engagement of the viewers with the contestants through voting, and the presence of the acid-tongued Cowell as a judge, grew into a phenomenon. By 2004, it had become the most-watched show in the U.S., a position it then held on for seven consecutive seasons.



Bert models require the end position of the answers, therefore finding the end point of for each answer

In [6]:
def end_point(answers: list, contexts: list) -> list:
    _answers = answers.copy()
    for answer, context in zip(_answers, contexts):
        end_idx = answer['text']
        start_idx = answer['answer_start']
        answer['answer_end'] = start_idx + len(end_idx)
    return _answers

In [7]:
train_answers = end_point(train_answers, train_contexts)
valid_answers = end_point(valid_answers, valid_contexts)

Here we fine-tune the pre-trained model, we use ALBERT model

In [8]:
tokenizer = AutoTokenizer.from_pretrained('albert-base-v2', use_fast=True)

Encoding the data to train the model

In [9]:
def encode_data(contexts: list, questions: list, answers: list) -> dict:
    encodings = tokenizer(contexts, questions, truncation=True, padding=True, return_tensors="pt")
    start_pos, end_pos = list(), list()

    for index in range(len(answers)):
        start_value = encodings.char_to_token(index, answers[index]['answer_start'])
        end_value   = encodings.char_to_token(index, answers[index]['answer_end'])
        if start_value is None:
            start_value = tokenizer.model_max_length

        shift = 1
        while end_value is None:
            end_value = encodings.char_to_token(index, answers[index]['answer_end'] - shift)
            shift += 1

        start_pos.append(start_value)
        end_pos.append(end_value)

    encodings.update({
        'start_positions': start_pos, 'end_positions': end_pos
    })
    return encodings

We use sample of data to train since training the full dataset will take lot of time and with the technological constraint, It is hard

In [10]:
train_encodings = encode_data(train_contexts[0:100], train_questions[0:100], train_answers[0:100])
valid_encodings = encode_data(valid_contexts[0:500], valid_questions[0:500], valid_answers[0:500])

Deleting unwanted variables to prevent crashing

In [11]:
del train_contexts, train_questions, train_answers
del valid_contexts, valid_questions, valid_answers

In [12]:
import torch
class SquadDataset(torch.utils.data.Dataset):
    def __init__(self, encodings: dict) -> None:
        self.encodings = encodings

    def __getitem__(self, index: int) -> dict:
        return {key: torch.tensor(val[index]) for key, val in self.encodings.items()}
    
    def __len__(self):
        return len(self.encodings['input_ids'])

In [13]:
train_ds = SquadDataset(train_encodings)
valid_ds = SquadDataset(valid_encodings)

In [14]:
from transformers import AutoModelForQuestionAnswering
model = AutoModelForQuestionAnswering.from_pretrained('albert-base-v2')
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)
model.train()

Some weights of the model checkpoint at albert-base-v2 were not used when initializing AlbertForQuestionAnswering: ['predictions.bias', 'predictions.LayerNorm.bias', 'predictions.decoder.weight', 'predictions.dense.bias', 'predictions.decoder.bias', 'predictions.LayerNorm.weight', 'predictions.dense.weight']
- This IS expected if you are initializing AlbertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of AlbertForQuestionAnswering were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN t

AlbertForQuestionAnswering(
  (albert): AlbertModel(
    (embeddings): AlbertEmbeddings(
      (word_embeddings): Embedding(30000, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0, inplace=False)
    )
    (encoder): AlbertTransformer(
      (embedding_hidden_mapping_in): Linear(in_features=128, out_features=768, bias=True)
      (albert_layer_groups): ModuleList(
        (0): AlbertLayerGroup(
          (albert_layers): ModuleList(
            (0): AlbertLayer(
              (full_layer_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (attention): AlbertAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias

In [15]:
"""
This cell is adopted from `https://github.com/michaelrzhang/lookahead/blob/master/lookahead_pytorch.py`, which is the
source code of `Lookahead Optimizer: k steps forward, 1 step back` paper (https://arxiv.org/abs/1907.08610).
"""

from collections import defaultdict

import torch
from torch.optim.optimizer import Optimizer


class LookaheadWrapper(Optimizer):
    r"""PyTorch implementation of the lookahead wrapper.
    Lookahead Optimizer: https://arxiv.org/abs/1907.08610
    """

    def __init__(self, optimizer, la_steps=5, la_alpha=0.8, pullback_momentum="none"):
        """optimizer: inner optimizer
        la_steps (int): number of lookahead steps
        la_alpha (float): linear interpolation factor. 1.0 recovers the inner optimizer.
        pullback_momentum (str): change to inner optimizer momentum on interpolation update
        """
        self.optimizer = optimizer
        self._la_step = 0  # counter for inner optimizer
        self.la_alpha = la_alpha
        self._total_la_steps = la_steps
        pullback_momentum = pullback_momentum.lower()
        assert pullback_momentum in ["reset", "pullback", "none"]
        self.pullback_momentum = pullback_momentum

        self.state = defaultdict(dict)

        # Cache the current optimizer parameters
        for group in optimizer.param_groups:
            for param in group['params']:
                param_state = self.state[param]
                param_state['cached_params'] = torch.zeros_like(param.data)
                param_state['cached_params'].copy_(param.data)
                if self.pullback_momentum == "pullback":
                    param_state['cached_mom'] = torch.zeros_like(param.data)

    def __getstate__(self):
        return {
            'state': self.state,
            'optimizer': self.optimizer,
            'la_alpha': self.la_alpha,
            '_la_step': self._la_step,
            '_total_la_steps': self._total_la_steps,
            'pullback_momentum': self.pullback_momentum
        }

    def zero_grad(self):
        self.optimizer.zero_grad()

    def get_la_step(self):
        return self._la_step

    def state_dict(self):
        return self.optimizer.state_dict()

    def load_state_dict(self, state_dict):
        self.optimizer.load_state_dict(state_dict)

    def _backup_and_load_cache(self):
        """Useful for performing evaluation on the slow weights (which typically generalize better)
        """
        for group in self.optimizer.param_groups:
            for param in group['params']:
                param_state = self.state[param]
                param_state['backup_params'] = torch.zeros_like(param.data)
                param_state['backup_params'].copy_(param.data)
                param.data.copy_(param_state['cached_params'])

    def _clear_and_load_backup(self):
        for group in self.optimizer.param_groups:
            for param in group['params']:
                param_state = self.state[param]
                param.data.copy_(param_state['backup_params'])
                del param_state['backup_params']

    @property
    def param_groups(self):
        return self.optimizer.param_groups

    def step(self, closure=None):
        """Performs a single Lookahead optimization step.
        Arguments:
            closure (callable, optional): A closure that reevaluates the model
                and returns the loss.
        """
        loss = self.optimizer.step(closure)
        self._la_step += 1

        if self._la_step >= self._total_la_steps:
            self._la_step = 0
            # Lookahead and cache the current optimizer parameters
            for group in self.optimizer.param_groups:
                for param in group['params']:
                    param_state = self.state[param]
                    param.data.mul_(self.la_alpha).add_(param_state['cached_params'], alpha=1.0 - self.la_alpha)  # crucial line
                    param_state['cached_params'].copy_(param.data)
                    if self.pullback_momentum == "pullback":
                        internal_momentum = self.optimizer.state[param]["momentum_buffer"]
                        self.optimizer.state[param]["momentum_buffer"] = internal_momentum.mul_(self.la_alpha).add_(
                            1.0 - self.la_alpha, param_state["cached_mom"])
                        param_state["cached_mom"] = self.optimizer.state[param]["momentum_buffer"]
                    elif self.pullback_momentum == "reset":
                        self.optimizer.state[param]["momentum_buffer"] = torch.zeros_like(param.data)

        return loss



**Model training**

In [16]:
# Initialize adam optimizer with weight decay to minimize overfit

from transformers import AdamW

base  = AdamW(model.parameters(), lr=1e-4)
optim = LookaheadWrapper(base)



In [17]:
from torch.utils.data import DataLoader
from tqdm import tqdm

import warnings
warnings.simplefilter("ignore")


# Initialize data loader for training data

train_loader = DataLoader(train_ds, batch_size=16, shuffle=True)


for epoch in range(6):
    model.train()
    loop = tqdm(train_loader, leave=True)
    for batch in loop:
        optim.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        start_positions = batch['start_positions'].to(device)
        end_positions = batch['end_positions'].to(device)
        outputs = model(input_ids=input_ids, attention_mask=attention_mask,
                        start_positions=start_positions, end_positions=end_positions)
        loss = outputs[0]
        loss.backward()
        optim.step()
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())

Epoch 0: 100%|██████████| 7/7 [00:07<00:00,  1.02s/it, loss=4.26]
Epoch 1: 100%|██████████| 7/7 [00:06<00:00,  1.13it/s, loss=3.35]
Epoch 2: 100%|██████████| 7/7 [00:06<00:00,  1.12it/s, loss=1.46]
Epoch 3: 100%|██████████| 7/7 [00:06<00:00,  1.13it/s, loss=0.779]
Epoch 4: 100%|██████████| 7/7 [00:06<00:00,  1.12it/s, loss=0.131]
Epoch 5: 100%|██████████| 7/7 [00:06<00:00,  1.11it/s, loss=0.425]


In [18]:
# Saving the model in a local directory

MODEL_DIR = "./model"
if not os.path.exists(MODEL_DIR):
    os.mkdir(MODEL_DIR)
tokenizer.save_pretrained(MODEL_DIR)
model.save_pretrained(MODEL_DIR)
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
MODEL_DIR = "./model"
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR, use_fast=True)
model = AutoModelForQuestionAnswering.from_pretrained(MODEL_DIR)

**Model Evaluation**

In [20]:
from torch.utils.data import DataLoader
from sklearn.metrics import f1_score
# Switch the model to evaluation mode
model.eval()
model = model.to(device)

validation_loader = DataLoader(valid_ds, batch_size=16)
accuracy_scores = list()
start_true_all, end_true_all = [], []
start_pred_all, end_pred_all = [], []

for batch in validation_loader:
    with torch.no_grad():
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        start_true = batch['start_positions'].to(device)
        end_true = batch['end_positions'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask)
        start_pred = torch.argmax(outputs['start_logits'], dim=1)
        end_pred = torch.argmax(outputs['end_logits'], dim=1)
        accuracy_scores.append(((start_pred == start_true).sum() / len(start_pred)).item())
        accuracy_scores.append(((end_pred == end_true).sum() / len(end_pred)).item())
        start_true_all.extend(start_true.tolist())
        end_true_all.extend(end_true.tolist())
        start_pred_all.extend(start_pred.tolist())
        end_pred_all.extend(end_pred.tolist())


# Calculate the average accuracy score
average_accuracy = sum(accuracy_scores) / len(accuracy_scores)
print(f"Model's score based on Exact Match: {average_accuracy}")


# Calculating F1 score for start and end positions

start_f1 = f1_score(start_true_all, start_pred_all, average='macro')
end_f1 = f1_score(end_true_all, end_pred_all, average='macro')
overall_f1 = (start_f1 + end_f1) / 2

print(f"F1 score of the model: {overall_f1:.3f}")


Model's score based on Exact Match: 0.39453125
F1 score of the model: 0.269


In [21]:
# Returns answers to a given set of questions
def get_answers_from_context(input_text: str, input_questions: list) -> list:
    encoded_inputs = tokenizer([input_text]*len(input_questions), input_questions, truncation=True, padding=True, return_tensors="pt")
    encoded_inputs = encoded_inputs.to(device)
    outputs = model(**encoded_inputs)
    start_positions = torch.argmax(outputs['start_logits'], dim=1)
    end_positions = torch.argmax(outputs['end_logits'], dim=1)  
    answers = list()
    for i, (start_index, end_index) in enumerate(zip(start_positions, end_positions)):
        tokens = tokenizer.convert_ids_to_tokens(encoded_inputs['input_ids'][i][start_index:end_index+1])
        answers.append(tokenizer.convert_tokens_to_string(tokens) )
    print("Context:")
    print(input_text)
    print()
    for question, answer in zip(input_questions, answers):
        print(f"Q:  {question}")
        print(f"A:  {answer}")
        print("-"*60)
    return answers


**Sample answers from the model**

In [22]:
context = "The modern Olympic Games or Olympics (French: Jeux olympiques)[1][2] are leading international sporting events featuring summer and winter sports competitions in which thousands of athletes from around the world participate in a variety of competitions. The Olympic Games are considered the world's foremost sports competition with more than 200 nations participating.[3] The Olympic Games are normally held every four years, alternating between the Summer and Winter Olympics every two years in the four-year period."
questions = [
    "How often do the Olympic games hold?",
    "How many nations do participate in each Olympic?","what is olympics in french called as?"
]

_ = get_answers_from_context(context, questions)

Context:
The modern Olympic Games or Olympics (French: Jeux olympiques)[1][2] are leading international sporting events featuring summer and winter sports competitions in which thousands of athletes from around the world participate in a variety of competitions. The Olympic Games are considered the world's foremost sports competition with more than 200 nations participating.[3] The Olympic Games are normally held every four years, alternating between the Summer and Winter Olympics every two years in the four-year period.

Q:  How often do the Olympic games hold?
A:  every four years,
------------------------------------------------------------
Q:  How many nations do participate in each Olympic?
A:  200
------------------------------------------------------------
Q:  what is olympics in french called as?
A:  olympics
------------------------------------------------------------


In [23]:
context = "Vikings is the modern name given to seafaring people primarily from Scandinavia (present-day Denmark, Norway and Sweden), who from the late 8th to the late 11th centuries raided, pirated, traded and settled throughout parts of Europe. They also voyaged as far as the Mediterranean, North Africa, the Middle East, and North America. In some of the countries they raided and settled in, this period is popularly known as the Viking Age, and the term \"Viking\" also commonly includes the inhabitants of the Scandinavian homelands as a collective whole. The Vikings had a profound impact on the Early medieval history of Scandinavia, the British Isles, France, Estonia, and Kievan Rus'."
questions = [
    "When vikings started raided?","who are vikings?","Vikings had impact on which period?"
]

_ = get_answers_from_context(context, questions)

Context:
Vikings is the modern name given to seafaring people primarily from Scandinavia (present-day Denmark, Norway and Sweden), who from the late 8th to the late 11th centuries raided, pirated, traded and settled throughout parts of Europe. They also voyaged as far as the Mediterranean, North Africa, the Middle East, and North America. In some of the countries they raided and settled in, this period is popularly known as the Viking Age, and the term "Viking" also commonly includes the inhabitants of the Scandinavian homelands as a collective whole. The Vikings had a profound impact on the Early medieval history of Scandinavia, the British Isles, France, Estonia, and Kievan Rus'.

Q:  When vikings started raided?
A:  late 8th to the late 11th centuries raided,
------------------------------------------------------------
Q:  who are vikings?
A:  vikings is the modern name given to seafaring people primarily from scandinavia (present-day denmark, norway and sweden), who from the late 8