# Question Answering with BERT and SQuAD Spanish (using Hugging Face)

Author: Vladimir Araujo

Based on: https://www.spark64.com/post/machine-comprehension



## 1.0 Introduction

Question Answering (QA) is a challenging task that NLP tries to solve. The aim is to provide solution to queries expressed in natural language automatically (Hovy, Gerber, Hermjakob, Junk, and Lin 2000). For instance, given the following context:

> Quito, oficialmente San Francisco de Quito, es la capital de la República del Ecuador, de la Provincia de Pichincha y la capital más antigua de Sudamérica. Es la ciudad más poblada del Ecuador,​ con 2 millones de habitantes en el área urbana, y aproximadamente 3 millones en todo el Área metropolitana.

We ask the question

> ¿Cuál es la población de Quito?

We expect the QA system responds with something like this:

> 2 millones

Since 2017, transformer models have been shown to outperform existing approaches for this task. Currently, many pretrained transformer models exist, including BERT, GPT-2, XLNet.

This tutorial shows how you can fine-tune BERT for the task of QA and use it for inference. We will use the transformer library built by [Hugging Face](https://huggingface.co/), which is an extremely useful implementation of the transformer models in both TensorFlow and PyTorch. You can just use a fine-tuned model from their [model hub](https://huggingface.co/models). 
 
This tutorial is for educational purposes with which we will learn to finetune a BERT model and use it with your own data.

## Using BERT-based model for QA

<figure>
<center>
<img src='https://miro.medium.com/max/1840/1*QhIXsDBEnANLXMA0yONxxA.png' width="500" />
</center>
</figure>

*   Input is the $Question$ tokens and the $Paragraph$ tokens separated by the special token $[SEP]$. 
*   The final hidden vector of BERT is $T_i$
*   New parameters learned during fine-tuning are a start vector $S$ and an end vector $E$.
*   The probability of word $i$ being the start/end of the answer span is computed as a dot product between $T_{i}$ and $S$ or $E$ followed by a softmax.

## 2.0 Setup

First, we clone and install the Hugging Face transformer library from Github.

In [None]:
!git clone https://github.com/huggingface/transformers \
&& cd transformers \
&& git checkout a3085020ed0d81d4903c50967687192e3101e770 

In [None]:
!pip install ./transformers

## 3.0 Train Model

This is where we can train our own model.

### 3.1 Get Training and Evaluation Data

The [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. In this tutorial we will use a Spanish version of this dataset. 

Read more about this dataset here: https://github.com/ccasimiro88/TranslateAlignRetrieve

Now get the SQuAD V2.0 dataset. `train-v2.0-es_small.json
` is for training and `dev-v2.0-es_small.json` is for evaluation to see how well your model trained. Note that we use the small version for convenience.



In [None]:
!mkdir dataset \
&& cd dataset \
&& wget https://github.com/ccasimiro88/TranslateAlignRetrieve/raw/master/SQuAD-es-v2.0/train-v2.0-es_small.json \
&& wget https://github.com/ccasimiro88/TranslateAlignRetrieve/raw/master/SQuAD-es-v2.0/dev-v2.0-es_small.json

### 3.2 Dataset Exploration

Let's explore an example of the dataset. We define a function for loading the json file of the SQuAD dataset.

In [None]:
!wget https://gist.githubusercontent.com/vgaraujov/fd17b0c151657fbce73189a98617f1c6/raw/f677ae68aaa9cb6dc274cdbf44e60b12653c6cea/squad_utils.py

Now we can load the a split of the dataset and see an instance. You need to change `id` variable if you want to change the example.

In [None]:
import squad_utils

dev_data = squad_utils.json_to_dataframe('dataset/dev-v2.0-es_small.json')

In [None]:
id = 100

In [None]:
dev_data.iloc[id]['context']

In [None]:
dev_data.iloc[id]['question']

In [None]:
dev_data.iloc[id]['text']

In [None]:
dev_data.iloc[id]['ans_start']

### 3.3 Run training (Optional)

We can now train the model with the training set. 

**Notes about parameters:**

`per_gpu_train_batch_size` specifies the number of training examples per iteration per GPU.

`save_steps` specifies number of steps before it outputs a checkpoint file. I've increased it to save disk space.

`num_train_epochs` sets the number of epochs, two epochs are recommended. It's currently set to one for the purpose of time.

`version_2_with_negative` is required for SQuAD V2.0. If training with V1.1, take out this flag.

NOTE: it takes about 1 hour to train an epoch! If you don't want to wait this long, feel free to skip this step and use a pretrained model!

In [None]:
!export SQUAD_DIR=/content/dataset \
&& python transformers/examples/run_squad.py \
  --model_type bert \
  --model_name_or_path dccuchile/bert-base-spanish-wwm-cased \
  --do_train \
  --do_eval \
  --do_lower_case \
  --train_file $SQUAD_DIR/train-v2.0-es_small.json \
  --predict_file $SQUAD_DIR/dev-v2.0-es_small.json \
  --per_gpu_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /content/model_output \
  --save_steps 5000 \
  --threads 4 \
  --version_2_with_negative 

## 4.0 Setup prediction code

Now we can use the Hugging Face library to make predictions using our model. Note that a lot of the code is pulled from `run_squad.py` in the Hugging Face repository, with all the training parts removed.


In [None]:
import os
import torch
import time
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

from transformers import (
    AutoTokenizer, 
    AutoModelForQuestionAnswering,
    squad_convert_examples_to_features
)

from transformers.data.processors.squad import SquadResult, SquadV2Processor, SquadExample
from transformers.data.metrics.squad_metrics import compute_predictions_logits

If you have trained your own mode, you need to change the flag `use_own_model` to `True`. However, in the case that you want to use a pre-trained model of the hub, you need to change the flag `use_own_model` to `False`, and define the model variable `model_name_or_path`.

In this tutorial, we will use a pre-trained model on SQuAD spanish called [`mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es`](https://huggingface.co/mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es).

In [None]:
# READER NOTE: Set this flag to use own model, or use pretrained model in the Hugging Face repository
use_own_model = False

if use_own_model:
  model_name_or_path = "/content/model_output"
else:
  model_name_or_path = "mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es"

output_dir = ""

# Config
n_best_size = 1
max_answer_length = 30
do_lower_case = True
null_score_diff_threshold = 0.0

def to_list(tensor):
    return tensor.detach().cpu().tolist()

# Setup model
model_class, tokenizer_class = (AutoModelForQuestionAnswering, AutoTokenizer)
tokenizer = tokenizer_class.from_pretrained(
    model_name_or_path, do_lower_case=True)
model = model_class.from_pretrained(model_name_or_path)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model.to(device)

processor = SquadV2Processor()

Esta es una función prestada de `run_squad.py`.  This modified code allows to run predictions we pass in directly as strings, rather .json format like the training/test set.

In [None]:
def run_prediction(question_texts, context_text):
    """Setup function to compute predictions"""
    examples = []

    for i, question_text in enumerate(question_texts):
        example = SquadExample(
            qas_id=str(i),
            question_text=question_text,
            context_text=context_text,
            answer_text=None,
            start_position_character=None,
            title="Predict",
            is_impossible=False,
            answers=None,
        )

        examples.append(example)

    features, dataset = squad_convert_examples_to_features(
        examples=examples,
        tokenizer=tokenizer,
        max_seq_length=384,
        doc_stride=128,
        max_query_length=64,
        is_training=False,
        return_dataset="pt",
        threads=1,
    )

    eval_sampler = SequentialSampler(dataset)
    eval_dataloader = DataLoader(dataset, sampler=eval_sampler, batch_size=10)

    all_results = []

    for batch in eval_dataloader:
        model.eval()
        batch = tuple(t.to(device) for t in batch)

        with torch.no_grad():
            inputs = {
                "input_ids": batch[0],
                "attention_mask": batch[1],
                "token_type_ids": batch[2],
            }

            example_indices = batch[3]

            outputs = model(**inputs)

            for i, example_index in enumerate(example_indices):
                eval_feature = features[example_index.item()]
                unique_id = int(eval_feature.unique_id)

                output = [to_list(output[i]) for output in outputs]

                start_logits, end_logits = output
                result = SquadResult(unique_id, start_logits, end_logits)
                all_results.append(result)

    output_prediction_file = "predictions.json"
    output_nbest_file = "nbest_predictions.json"
    output_null_log_odds_file = "null_predictions.json"

    predictions = compute_predictions_logits(
        examples,
        features,
        all_results,
        n_best_size,
        max_answer_length,
        do_lower_case,
        output_prediction_file,
        output_nbest_file,
        output_null_log_odds_file,
        False,  # verbose_logging
        True,  # version_2_with_negative
        null_score_diff_threshold,
        tokenizer,
    )

    return predictions

## 5.0 Run predictions

Now for the fun part... testing out your model on different inputs. Pretty rudimentary example here. But the possibilities are endless with this function.

In [None]:
context = "Quito, oficialmente San Francisco de Quito, es la capital de la República del Ecuador, de la Provincia de Pichincha y la capital más antigua de Sudamérica. Es la ciudad más poblada del Ecuador,​ con 2 millones de habitantes en el área urbana, y aproximadamente 3 millones en todo el Área metropolitana."
questions = ["¿Cuál es la población de Quito?", 
             "¿En qué provincia esta ubicado Quito?"]

# Run method
predictions = run_prediction(questions, context)

# Print results
print("Results:")
for i, key in enumerate(predictions.keys()):
  print(questions[i],predictions[key])

## 6.0 Activity

Now is your turn. Use the code in Section 4.0 (previous section) to generate your own predictions. To do that, you must change the context variables and questions.

In [None]:
# Your code here

---

Based on this tutorial and the class, set whether the following statements are `True` or `False`.


In [None]:
#@title The SQuAD dataset is a reading comprehension task
answer = None #@param ["None","False", "True"] {type:"raw"}

In [None]:
#@title The BERT model is trained from scratch for the QA task
answer = None #@param ["None","False", "True"] {type:"raw"}

In [None]:
#@title This model generates the response word by word (generative approach)
answer = None #@param ["None","False", "True"] {type:"raw"}

## 6.0 Bonus: Attention Vizualization

The attention heads of the Transformer capture the relationships between the tokens. We can explore them to understand which tokens contribute the most to the prediction.

*BertViz* is a tool for visualizing attention in the Transformer model, supporting all models from the HuggingFace library. First, we clone and install the library from Github.

In [None]:
import sys
!test -d bertviz_repo && echo "FYI: bertviz_repo directory already exists, to pull latest version uncomment this line: !rm -r bertviz_repo"
# !rm -r bertviz_repo # Uncomment if you need a clean pull from repo
!test -d bertviz_repo || git clone https://github.com/jessevig/bertviz bertviz_repo
if not 'bertviz_repo' in sys.path:
  sys.path += ['bertviz_repo']
!pip install regex

In [None]:
def call_html():
  import IPython
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              "d3": "https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.8/d3.min",
              jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
            },
          });
        </script>
        '''))

Now it is necessary to load the pre-trained model on SQuAD.

In [None]:
from bertviz import head_view
from transformers import BertTokenizer, BertModel

In [None]:
do_lower_case = True
model = BertModel.from_pretrained(model_name_or_path, output_attentions=True)
tokenizer = BertTokenizer.from_pretrained(model_name_or_path, do_lower_case=do_lower_case)

In order to visualize the attentions we need to define `sentence_a` and 
`sentence_b`. Remember that the input for BERT QA is `[question,context]`.

Note: You need to set correctly `sentence_b_start` parameter in the `head_view` function depending on the length of your question.

In [None]:
sentence_b = "Quito, oficialmente San Francisco de Quito, es la capital de la República del Ecuador, de la Provincia de Pichincha y la capital más antigua de Sudamérica."
sentence_a = "¿En qué provincia esta ubicado Quito?"

inputs = tokenizer.encode_plus(sentence_a, sentence_b, return_tensors='pt', add_special_tokens=True)
token_type_ids = inputs['token_type_ids']
input_ids = inputs['input_ids']
attention = model(input_ids, token_type_ids=token_type_ids)[-1]
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list)
call_html()

head_view(attention, tokens, sentence_b_start = 10)