<a href="https://colab.research.google.com/github/utd-hltri/nlp/blob/main/hw2/neural_question_answering_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#@title MIT License
#
# Copyright (c) 2022 Maxwell Weinzierl
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.

# Question Answering with Bidirectional Encoder Representations from Transformers (BERT)

This notebook utilizes BERT, a State-Of-The-Art (SOTA) neural transformer model pre-trained on masked language modeling and next sentence prediction. 
The paper which introduces BERT can be found here:
https://arxiv.org/pdf/1810.04805.pdf

We also use RoBERTa: A Robustly Optimized BERT Pretraining Approach which is a better-optimized BERT model: https://arxiv.org/pdf/1907.11692.pdf

Finally, we use a distilled RoBERTa model, which is a technique which attempts to take a much larger model and distill its knowledge into a smaller model: https://arxiv.org/pdf/1503.02531.pdf

# Packages and Libraries
We will utilize the deep learning library PyTorch this time as opposed to TensorFlow. PyTorch (https://pytorch.org/) has become the most popular deep learning library for research to-date: http://horace.io/pytorch-vs-tensorflow/

![](https://www.assemblyai.com/blog/content/images/2021/12/Fraction-of-Papers-Using-PyTorch-vs.-TensorFlow.png)

In [None]:
import torch
print(torch.__version__)
print('CUDA Enabled: ', torch.cuda.is_available())
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
  print(f'  {device} - ' + torch.cuda.get_device_name(0))
else:
  print(f'  {device}')

The above cell should include a torch library with "+cu..." to denote PyTorch is installed with CUDA capabilities. CUDA should be enabled with at least one device. Typically a Tesla K80 is the GPU I get on Google Colab, but others may be assigned as resources are made available. If you are unable to reserve a GPU instance then the device will be "cpu" and the code will run much slower, but still work.

## HuggingFace Transformers

Next we will install the `transformers` library, built by HuggingFace. This library makes it extremely easy to use SOTA neural NLP models with PyTorch. See the HuggingFace website to browse all the publically available models: https://huggingface.co/models

In [None]:
!pip install transformers

In [None]:
import transformers
print(transformers.__version__)

## HuggingFace Datasets
HuggingFace also provides a library called `datasets` for downloading and utilizing common NLP datasets: https://huggingface.co/datasets

In [None]:
!pip install datasets

In [None]:
import datasets
print(datasets.__version__)

## Model Summary

TorchInfo is a nice little library to provide a summary of model sizes and layers. We install it below to visualize the size of our models.

In [None]:
!pip install torchinfo

In [None]:
from torchinfo import summary

# Neural Question Answering Models

Below we load our neural QA model. We load the model and the tokenizer from the `model_name` from HuggingFace. The library will automatically download all required model weights, config files, and tokenizers.

We then move the model to the `cuda:0` device (our GPU) and turn on eval mode to avoid dropout randomness.

Finally, we print a summary of our model.

In [None]:
#@title Model

from transformers import AutoModelForQuestionAnswering, AutoTokenizer

model_name = 'deepset/bert-base-cased-squad2' #@param ["deepset/bert-base-cased-squad2", "deepset/roberta-base-squad2", "deepset/roberta-base-squad2-distilled"]

model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# move model to GPU device
model.to(device)
# turn on EVAL mode so drop-out layers do not randomize outputs
model.eval()
# create model summary
summary(model)

# Question Answering Datasets
The largest and most utilized Question-Answering dataset is The Stanford Question Answering Dataset (SQUAD): https://rajpurkar.github.io/SQuAD-explorer/

They have released a 2.0 version of SQUAD which we will utilize below. Feel free to play around with the other QA datasets, or find your own on HuggingFace Datasets: https://huggingface.co/datasets?task_categories=task_categories:question-answering&sort=downloads

In [None]:
from datasets import load_dataset
# imported to help inspect dataset
from textwrap import wrap

#@title Dataset

dataset = 'squad_v2' #@param ["squad_v2", "squad", "adversarial_qa"]
data = load_dataset(dataset)
ds = data['validation']
data_size = len(ds)
print(ds)

## Inspecting the Dataset

We can look at individual examples in the validation collection of SQUAD v2 to get a feeling for the types of questions and answers.

In [None]:
#@title Example { run: "auto" }
example_index = 0 #@param {type:"slider", min:0, max:11872, step:1}
example = ds[example_index]
print('Question: ')
for line in wrap(example['question'], 50):
  print(f'  {line}')
print('Context: ')
for line in wrap(example['context'], 50):
  print(f'  {line}')
answer = 'No Answer Provided' if len(example['answers']['text']) == 0 else example['answers']['text'][0]
print(f'Answer: ')
for line in wrap(answer, 50):
  print(f'  {line}')

# Specific Example
We will use the below example to follow the prediction process of the model

In [None]:
example_index = 339
example = ds[example_index]
print('Question: ')
for line in wrap(example['question'], 50):
  print(f'  {line}')
print('Context: ')
for line in wrap(example['context'], 50):
  print(f'  {line}')
answer = 'No Answer Provided' if len(example['answers']['text']) == 0 else example['answers']['text'][0]
print(f'Answer: ')
for line in wrap(answer, 50):
  print(f'  {line}')

## Tokenization

We will tokenize the above example using the HuggingFace tokenizer:

In [None]:
# we will tokenize a single example question and context,
# and we will move these tensors to the GPU device:
inputs = tokenizer(example['question'], example['context'], return_tensors="pt").to(device)

print('Inputs to model: ')
print(f'  {inputs.keys()}')

In [None]:
# the inputs to the model will contain a few tensors, but the most
# important tensor is the "input_ids":
input_ids = inputs['input_ids'][0]
print(input_ids)

In [None]:
# these are the token ids of the input. We can convert back to text tokens like so:
input_tokens = tokenizer.convert_ids_to_tokens(input_ids)
for line in wrap(str(input_tokens), 50):
  print(line)

# Notice that we have added a [CLS] token to denote the start of the sequence,
# [SEP] tokens between the question and context and at the end of the sequence,
# and the Word-Piece tokenizer has split some words up into pieces, such as 
# 'Turin', '##g' from Turing and 
# 'de', '##via', '##tes' from deviates

## Running Model

Next we will run the model on the above example

In [None]:

# the outputs will contain logits (unnormalized probabilities) for the start and the end of the answer sequence.
outputs = model(**inputs)
print(outputs)


In [None]:

# we select the most likely start of the answer by taking the maximum start logit (probability)
answer_start = torch.argmax(outputs['start_logits'])

# we also select the most likely end of the answer by taking the maximum end logit (probability)
answer_end = torch.argmax(outputs['end_logits'])

print(f'Answer Token Span: {answer_start} to {answer_end}')

In [None]:

# we can now retrieve the most likely answer to the question from the input:
answer_ids = input_ids[answer_start:answer_end+1]
print(answer_ids)


In [None]:
# we convert these token ids back to tokens: 
answer_tokens = tokenizer.convert_ids_to_tokens(answer_ids)
print(answer_tokens)

In [None]:
# we can then transform these tokens to a normal string:
answer = tokenizer.convert_tokens_to_string(answer_tokens)
print(f'Answer: {answer}')

# Examine QA Predictions
Now we will perform the above process for a few examples. We will first define a `run_model` function to do all of the above for an example

In [None]:
# Re-run this cell when you swap models
def run_model(example):
  # we will tokenize a single example question and context,
  # and we will move these tensors to the GPU device:
  inputs = tokenizer(example['question'], example['context'], return_tensors="pt").to(device)
  # the inputs to the model will contain a few tensors, but the most
  # important tensor is the "input_ids":
  input_ids = inputs['input_ids'][0]
  # these are the token ids of the input. We can convert back to text tokens like so:
  input_tokens = tokenizer.convert_ids_to_tokens(input_ids)
  # the outputs will contain logits (unnormalized probabilities) for the start and the end of the answer sequence.
  outputs = model(**inputs)
  # we select the most likely start of the answer by taking the maximum start logit (probability)
  answer_start = torch.argmax(outputs['start_logits'])

  # we also select the most likely end of the answer by taking the maximum end logit (probability)
  answer_end = torch.argmax(outputs['end_logits'])

  # we can now retrieve the most likely answer to the question from the input:
  answer_ids = input_ids[answer_start:answer_end+1]

  # we convert these token ids back to tokens: 
  answer_tokens = tokenizer.convert_ids_to_tokens(answer_ids)
  # we can then transform these tokens to a normal string:
  answer = tokenizer.convert_tokens_to_string(answer_tokens)
  return answer.strip()


## Evaluation

Change the example index and view the model's predictions below for 10 different examples and report the results in your report. Discuss how accurately the model predicted answers and whether they lined up with the judged answers of the examples. 

IMPORTANT: keep track of the example indices you evaluate, you will need them when evaluating new models!

In [None]:
#@title Example { run: "auto" }
example_index = 5465 #@param {type:"slider", min:0, max:11872, step:1}
example = ds[example_index]
print('Question: ')
for line in wrap(example['question'], 50):
  print(f'  {line}')
print('Context: ')
for line in wrap(example['context'], 50):
  print(f'  {line}')
answer = 'No Answer Provided' if len(example['answers']['text']) == 0 else example['answers']['text'][0]
print(f'Answer: ')
for line in wrap(answer, 50):
  print(f'  {line}')

p_answer = run_model(example)

print(f'Predicted Answer: ')
for line in wrap(p_answer, 50):
  print(f'  {line}')

## Additional Models and Report

Go back to the cell in which you loaded the neural QA model and perform the same above evaluation with the other two neural models. Compare the outputs of each model across the same example indices and report your results in your report. Make sure you re-load the run_model function cell when you change your model. You do not need to re-run the Question Answering Datasets or Specific Example section cells.

You should have the following in your report:

| Model      | Accuracy |
| ----------- | ----------- |
| bert-base-cased-squad2      | ...       |
| roberta-base-squad2   | ...        |
| roberta-base-squad2-distilled   | ...        |


Calculate the accuracy of each model by adding up the number of correct examples (by your own judgement) and dividing by 10 (the total number of examples you should evaluate). If no judged answer exists and the model outputs nothing or "&lt;s>" then consider that correct. Otherwise use your own judgement.

Also include an example prediction that has a judged answer and compare it to the predictions by each system. Try to find an example where the systems differ in their predictions.