# **Question-Answering agent**




### **Introduction**

We build a reading comprehension question-and-answer agent.

We use a pretrained model. We chose a BERT type model - DistilBERT (a smaller cousin of BERT that uses fewer layers).
 They are basically the same save for DistilBERT having fewer layers. See this [paper](https://arxiv.org/pdf/1910.01108.pdf) for more info.

We need to do the following

1. Download a pretrained DistilBERT.
2. Add a task-specific readout for Q-and-A. In this case, a linear readout.
3. Finetune both DistilBERT and readout for the Q-and-A task.
4. When prompted to generate answers, use a sampling algorithm.

Note that the actual fine-tuning has already been done and we'll load the fine-tuned model and start form there.

You will be asked to understand what the model outputs, and generate human-readable answers to questions.

**Since this is the last homework, there are places where little information is given; you are encouraged to learn about the classes and objects by exploring them or seeking documentations online. Still, feel free to post to Piazza if you there's something you don't understand**.



### **Setup**

Dataset - The dataset we will be using is the [Stanford Question and Answer Dataset (SQuAD) v1.1](https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/).


The dataset will be downloaded as `dev-v1.1.json`. This is the evaluation dataset; since you won't be doing the actual finetuning, the training dataset is not needed.

(Pretrained) model - We will use hugging face's [transformers](https://huggingface.co/transformers/) package. This package provides a wide variety of pretrained encoder models.
You will download the finetuned model separately.


Important: Using Google Drive
It is highly recommended that you mount your Google Drive to Colab. The code provided to you assumes that you've already done that. Create a folder named `6864_hw4` in your Google Drive root directory and use the code below to mount it. The code should save everything (dataset, feature-ized data, trained models etc.) in the `6864_hw4` folder in your drive.

**Nothing for you to code for now, but please understand the lines and get ready to answer some questions.**

In [0]:
# Logistics #1: mount google drive
from google.colab import drive
drive.mount('/content/gdrive')


%%bash
# Logistics #2: install the transformers package, create a folder, download the dataset and a patch
pip -q install transformers

# remove the directory if necessary
# rm -rf "/content/gdrive/My Drive/6864_hw4/"

mkdir "/content/gdrive/My Drive/6864_hw4/"
cd "/content/gdrive/My Drive/6864_hw4/"
wget -nv -c https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
wget -nv -c https://raw.githubusercontent.com/allenai/bi-att-flow/master/squad/evaluate-v1.1.py

# fixing an incompatibility between the huggingface package and colab
wget -nv -c https://raw.githubusercontent.com/hzshan/mit6864/master/processor.py

# download the finetuned model
wget -nv -c https://raw.githubusercontent.com/hzshan/mit6864/master/config.json
wget -nv -c https://raw.githubusercontent.com/hzshan/mit6864/master/vocab.txt
wget -nv -c --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1V43WEqmkcH4VP7CDdMkYzITwc_lcSjvP' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1V43WEqmkcH4VP7CDdMkYzITwc_lcSjvP" -O pytorch_model.bin && rm -rf /tmp/cookies.txt


import glob, logging, os, random, timeit, torch, sys
import numpy as np
import torch
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, SubsetRandomSampler
from tqdm.notebook import tqdm, trange


from transformers import squad_convert_examples_to_features, AutoModelForQuestionAnswering, AutoTokenizer

from transformers.data.metrics.squad_metrics import compute_predictions_logits, squad_evaluate

from transformers.data.processors.squad import SquadResult

sys.path.append('/content/gdrive/My Drive/6864_hw4')
from processor import SquadV1Processor

# Make sure you are using GPU as a hardware accelerator for this notebook
assert torch.cuda.is_available()

### **Utility functions**

The transformers package provides most of the useful utility functions we need to preprocess the data.

We will set up the model, which already combines DistilBERT with a Q-and-A specific decoder. 

In [0]:
DIR = "/content/gdrive/My Drive/6864_hw4/"

device = torch.device("cuda")
# Load a trained model and vocabulary that has been fine-tuned by the TAs

model = AutoModelForQuestionAnswering.from_pretrained(DIR)
tokenizer = AutoTokenizer.from_pretrained(DIR, do_lower_case=True)
model=model.to(device)


# Define some parameters
max_seq_length = 384
doc_stride = 128 # The maximum total input sequence length after WordPiece tokenization.
max_query_length = 64 # The maximum number of tokens for the question.
batch_size = 8
predict_file = 'dev-v1.1.json' # name of the evaluation dataset file

### **Convert the dataset to features**

Convert the dataset (the .json file contains human-readable texts) to features.

If you are doing this for the first time, it may take a few minutes. Afterwards, it will save the featurized dataset in the directory.

In [0]:
def load_and_cache_examples(tokenizer, evaluate=False, output_examples=False):

    # Load data features from cache or dataset file
    input_dir = DIR
    cached_features_file = os.path.join(
        input_dir,
        "cached_{}_{}_{}".format(
            "dev" if evaluate else "train",
            list(filter(None, 'distilbert-base-uncased'.split("/"))).pop(),
            str(max_seq_length),
        ),
    )

    if os.path.exists(cached_features_file):
        print('Existing cached file found.')

        print('Loading cached features. This may take up to a few minutes.')
        features_and_dataset = torch.load(cached_features_file)
        features, dataset, examples = (
            features_and_dataset["features"],
            features_and_dataset["dataset"],
            features_and_dataset["examples"],
        )
    else:

        processor = SquadV1Processor()
        if evaluate:
            examples = processor.get_dev_examples(DIR, filename=predict_file)
        else:
            examples = processor.get_train_examples(DIR, filename=train_file)

        features, dataset = squad_convert_examples_to_features(
            examples=examples,
            tokenizer=tokenizer,
            max_seq_length=max_seq_length,
            doc_stride=doc_stride,
            max_query_length=max_query_length,
            is_training=not evaluate,
            return_dataset="pt",
            threads=1,
        )

        torch.save({"features": features, "dataset": dataset, "examples": examples}, cached_features_file)

    if output_examples:
        return dataset, examples, features
    return dataset
  

dataset, examples, features = load_and_cache_examples(tokenizer, evaluate=True, output_examples=True)

Processing input data


convert squad examples to features: 100%|██████████| 10570/10570 [01:15<00:00, 139.13it/s]
add example index and unique id: 100%|██████████| 10570/10570 [00:00<00:00, 759712.68it/s]


Here, examples contains text examples of the task; features are the representations fed to the transformers. Explore the various objects and **answer the following questions**.

* 3.1. In the Q-and-A model, the encoder is the DistilBERT. What is the decoder? (Hint: look at architecture of the `model` object)

* 3.2. What type of Q-and-A task is in the SQuAD dataset? What is provided, what are the queries, and what are the expected answers? (e.g., is it info retrieval or extraction? Are questions related to a text or abstractive? Does the model need to figure out where to look for answers? Does the model need to perform reasoning in order to answer?)
* 3.3. In class, we discussed various strategies to build a Q-and-A agent. For example, one strategy to answer questions about a paragraph is to append each question to the paragraph, and ask the model to predict the words that come after the question (as in language modeling). 

    (1) What is the strategy used here? 

    (2) Why does it suit the SQuAD dataset? 

    (3) Describe a Q-and-A task where this wouldn't work.


### **Sampling from a trained model**

**Once you've answered these questions. Get ready to code!**

You will code a sampling algorithm that returns the k-best answers to each question. To do that, a barebone structure has been provided. 

Before you proceed, you need to understand outputs of the network. We assume that the correct answer is a span in a 384-long string. All we need to do is find out where the answer begins and ends. `outputs` of `model` is a tuple of two 384-long arrays. The first encodes the log likelihood of each word being the start of the answer; the second, the end. 

In addition, it is good to know that each question in the SQuAD dataset has a unique ID. A code has been provided to you below such that for every question in the evaluation dataset, ID of the question, and the start logits and end logits are combined into a `SquadResult` object.


In [0]:
# [NOTHING TO CODE IN THIS CELL]

def to_list(tensor):
    return tensor.detach().cpu().tolist()

eval_sampler = SequentialSampler(dataset)
eval_dataloader = DataLoader(dataset, sampler=eval_sampler, batch_size=batch_size)

all_results = []

# evaluate the data in batches
for batch in eval_dataloader:

    # set the model in eval mode
    model.eval()

    # move batch to GPU
    batch = tuple(t.to(device) for t in batch)

    with torch.no_grad():
        inputs = {"input_ids": batch[0], "attention_mask": batch[1]}

        example_indices = batch[3]

        outputs = model(**inputs)

    for i, example_index in enumerate(example_indices):
        eval_feature = features[example_index.item()]
        unique_id = int(eval_feature.unique_id)

        output = [to_list(output[i]) for output in outputs]

        start_logits, end_logits = output
        result = SquadResult(unique_id, start_logits, end_logits)

        all_results.append(result)

**You will now need to code the following**. 
* For each question, find the 10 spans that are most likely the answer. 
* After that, convert them back into texts. 
* Give a few examples of questions and their corresponding answers. 

HINT: start by exploring the attributes of `SquadResult`.

A note about converting indices to text:
A function has been provided below to aid you. Basically, we can tokenize the text in two ways. For example, from "John Smith's", we can either can do whitespace tokens ("John, Smith's", i.e. `orig_text` below), or the WordPiece tokenizer ("john smith" `tok_text` below). The model is trained using the second tokenizer, which would return texts that are different from the original ones. We would like to align the two answers to get the best test ("John Smith"). The `get_final_text` function does exactly that.

In [0]:
from transformers.data.metrics.squad_metrics import get_final_text
import collections

def indices_to_text(START_IND, END_IND, feature, example, tokenizer):

  """feature is the featurized representation of each text, where as eample contains the text in text form"""

  tok_tokens = feature.tokens[START_IND : (END_IND + 1)]
  orig_doc_start = feature.token_to_orig_map[START_IND]
  orig_doc_end = feature.token_to_orig_map[END_IND]
  orig_tokens = example.doc_tokens[orig_doc_start : (orig_doc_end + 1)]

  tok_text = tokenizer.convert_tokens_to_string(tok_tokens)

  # Clean whitespace
  tok_text = tok_text.strip()
  tok_text = " ".join(tok_text.split())
  orig_text = " ".join(orig_tokens)

  text = get_final_text(tok_text, orig_text, do_lower_case=True)
  return text


# associate question features to results
example_index_to_features = collections.defaultdict(list)
for feature in features:
    example_index_to_features[feature.example_index].append(feature)

# associate question IDs to results
unique_id_to_result = {}
for result in all_results:
    unique_id_to_result[result.unique_id] = result

for (example_index, example) in enumerate(examples):
    features = example_index_to_features[example_index]

    prelim_predictions = []

    for (feature_index, feature) in enumerate(features):

        # feature.unique_id cantains the unique_id associated with the question. Use it to find corresponding output in `unique_id_to_result`
        #[YOUR CODE HERE]

        # From outputs of the network, find the start indices and end indices with the highest likelihood
        #[YOUR CODE HERE]

        result = unique_id_to_result[feature.unique_id]
        start_indexes = _get_best_indexes(result.start_logits, n_best_size)
        end_indexes = _get_best_indexes(result.end_logits, n_best_size)

        # Search through all possible combinations between the high-likelihood start indices and end indices.
        # Remember to exclude invalid ones (e.g. if the end index is smaller than the start). What are other cases where
        # an answer is invalid?
  
        for start_index in start_indexes:
            for end_index in end_indexes:
                # Some exclusions are written for you. More are needed
                if start_index not in feature.token_to_orig_map:
                    continue
                if end_index not in feature.token_to_orig_map:
                    continue
                if not feature.token_is_max_context.get(start_index, False):
                    continue

                # [YOUR CODE HERE] to exclude other invalid cases

                length = end_index - start_index + 1
                if length > max_answer_length:
                    continue
                
                # IF the answer is valid, store the start and end indices in order to retrieve the text.
                # [YOUR CODE HERE]

                # Once you've found the start and end indices that correspond to most likely answers, retreieve the span in text
                # [YOUR CODE HERE]

NameError: ignored