# First QA Model

For our first QA model we will setup a simple question-answering pipeline using HuggingFace transformers and a pretrained BERT model. We will be testing it on our SQuAD data so let's load that first.

In [3]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 5.0 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 54.1 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 46.6 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 8.1 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uni

In [12]:
import json

with open('/content/dev.json', 'r') as f:
    squad = json.load(f)

As usual, we initialize our transformer tokenizer and model. This time, we will be using a BERT model that has been trained for question-and-answering on the SQuAD dataset. Which is why we will be using the validation dataset (rather than training dataset) from SQuAD.

In [13]:
from transformers import BertTokenizer, BertForQuestionAnswering

modelname = 'deepset/bert-base-cased-squad2'

tokenizer = BertTokenizer.from_pretrained(modelname)
model = BertForQuestionAnswering.from_pretrained(modelname)

Transformers comes with a useful class called [`pipeline`](https://huggingface.co/transformers/main_classes/pipelines.html) which allows us to setup easy to use pipelines for common architectures.

One of those pipelines is the `question-answering` pipeline which allows us to feed a  dictionary containing a `'question'` and `'context'` and return an answer. Which we initialize like so:

In [14]:
from transformers import pipeline

qa = pipeline('question-answering', model=model, tokenizer=tokenizer)

Now we can begin asking questions, let's take a few examples from our `squad` data.

In [15]:
squad[:2]

[{'answer': 'France',
  'context': 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.',
  'question': 'In what country is Normandy located?'},
 {'answer': 'in the 10th and 11th centuries',
  'context': 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 1

In [16]:
# we will intialize a list for answers
answers = []

for pair in squad[:5]:
    # pass in our question and context to return an answer
    ans = qa({
        'question': pair['question'],
        'context': pair['context']
    })
    # append predicted answer and real to answers list
    answers.append({
        'predicted': ans['answer'],
        'true': pair['answer']
    })

In [17]:
answers

[{'predicted': 'France.', 'true': 'France'},
 {'predicted': '10th and 11th centuries',
  'true': 'in the 10th and 11th centuries'},
 {'predicted': '10th and 11th centuries', 'true': '10th and 11th centuries'},
 {'predicted': 'Denmark, Iceland and Norway',
  'true': 'Denmark, Iceland and Norway'},
 {'predicted': 'Rollo,', 'true': 'Rollo'}]

So we can see that we're getting almost exact matches. Next, we'll take a look at how we can begin quantifying these results.

And now we build a list of predicted answers `model_out` and true answers `reference` and calculate the ROUGE score based on these.

In [18]:
from tqdm import tqdm

model_out = []
reference = []

for pair in tqdm(squad[0:20], leave=True):
    ans = qa({
        'question': pair['question'],
        'context': pair['context']
    })
    # append the prediction and reference to the respective lists
    model_out.append(ans['answer'])
    reference.append(pair['answer'])

100%|██████████| 20/20 [00:36<00:00,  1.84s/it]


In [20]:
!pip install rouge

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


In [21]:
from rouge import Rouge

# initialize
rouge = Rouge()

# get scores
rouge.get_scores(model_out, reference, avg=True)

{'rouge-1': {'f': 0.5153968224170924, 'p': 0.5654761904761905, 'r': 0.525},
 'rouge-2': {'f': 0.31249999826562497, 'p': 0.325, 'r': 0.305},
 'rouge-l': {'f': 0.5153968224170924, 'p': 0.5654761904761905, 'r': 0.525}}

That doesn't seem to be scoring as high as we would expect, if we print some of the results we can see why:

In [22]:
# recalculate individual scores
scores = rouge.get_scores(model_out, reference)

print(model_out[4], ' | ', reference[4], ' | ', scores[4]['rouge-1']['f'])
print(model_out[12], ' | ', reference[12], ' | ', scores[12]['rouge-1']['f'])

Rollo,  |  Rollo  |  0.0
William the Conqueror,  |  William the Conqueror  |  0.6666666616666668


Clearly the punctuation differences are causing our ROUGE score to view these words as not matching. To fix this, we'll import `re` and remove any characters that are not spaces, letters, or numbers.

In [23]:
import re

clean = re.compile('(?i)[^0-9a-z ]')

# apply this to both lists
model_out = [clean.sub('', text) for text in model_out]
reference = [clean.sub('', text) for text in reference]

In [24]:
# recalculate individual scores
scores = rouge.get_scores(model_out, reference)

print(model_out[4], ' | ', reference[4], ' | ', scores[4]['rouge-1']['f'])
print(model_out[12], ' | ', reference[12], ' | ', scores[12]['rouge-1']['f'])

Rollo  |  Rollo  |  0.999999995
William the Conqueror  |  William the Conqueror  |  0.999999995


These scores are looking better now, let's calculate the average again:

In [25]:
rouge.get_scores(model_out, reference, avg=True)

{'rouge-1': {'f': 0.5931745999448702,
  'p': 0.6392857142857143,
  'r': 0.6166666666666666},
 'rouge-2': {'f': 0.34999999815624994, 'p': 0.3571428571428571, 'r': 0.38},
 'rouge-l': {'f': 0.5931745999448702,
  'p': 0.6392857142857143,
  'r': 0.6166666666666666}}