# SQuAD-Question-Answering

## Install dependencies

We will be using the Transformers library from Hugging Face which will give us a pytorch interface for working with transformers. Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets. At the same time, each python module defining an architecture is fully standalone and can be modified to enable quick research experiments.

hugs Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration between them. We will be using TensorFlow.

In [1]:
!pip install transformers
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.22.1-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 3.3 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 44.7 MB/s 
Collecting huggingface-hub<1.0,>=0.9.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 69.9 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.9.1 tokenizers-0.12.1 transformers-4.22.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.4.0-py3-none-any.whl (365 kB)
[K     |████████████████████████████████| 365 kB 4.1 

## Import dependencies

In [2]:
import json
import transformers
import pandas as pd
import numpy as np
from pathlib import Path
import tensorflow as tf
from datasets import Dataset
import collections
from transformers import AutoTokenizer
from transformers import DefaultDataCollator
from transformers import create_optimizer
from transformers import TFAutoModelForQuestionAnswering
from transformers import AutoConfig, TFAutoModel
from tqdm.auto import tqdm

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

### Google drive

In [3]:
# libraries for the files in google drive
from pydrive.auth import GoogleAuth
from google.colab import drive
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

In [4]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

file_id = '1BcEgcjOvTt6CsbycmBJhAFK9r3MkOnYO' #<-- You add in here the id from you google drive file, you can find it


download = drive.CreateFile({'id': file_id})

In [22]:
file_pred = '1VibrBqGBHoTeSo4TE2OoHju50Vb_Zh-U'

download2 = drive.CreateFile({'id': file_pred})

In [6]:
tf.__version__

'2.8.2'

## Load data


Using Drive to load the json file containing the dataset, and check the version and the lenght.

In [36]:
json_file_pred = 'predictions.json' # File name
file_preds = download2.GetContentFile(json_file_pred)

In [37]:
with open(json_file_pred) as json_file:
    data = json.load(json_file)
    print(data)
    dbert_unc_preds = data

{'572efa9ecb0c0d14000f16ba': 'Hyderabad', '572efa9ecb0c0d14000f16bb': '250', '572efa9ecb0c0d14000f16bc': 'Musi River', '572efa9ecb0c0d14000f16bd': '6.7 million', '572efa9ecb0c0d14000f16be': '542 metres (1,778', '572efb57dfa6aa1500f8d517': '1591', '572efb57dfa6aa1500f8d518': 'Muhammad Quli Qutb Shah', '572efb57dfa6aa1500f8d519': 'Qutb Shahi dynasty', '572efb57dfa6aa1500f8d51a': 'Asif Jah I', '572efb57dfa6aa1500f8d51b': 'Nizams of Hyderabad', '572efd6403f9891900756b2d': 'Muhammad Quli Qutb Shah', '572efd6403f9891900756b2e': 'mid-19th century', '572efd6403f9891900756b2f': 'The Qutb Shahis and Nizams', '572efd6403f9891900756b30': 'Mughlai', '572efd6403f9891900756b31': 'motion pictures', '572efe44dfa6aa1500f8d52b': 'pearl and diamond', '572efe44dfa6aa1500f8d52c': 'City of Pearls', '572efe44dfa6aa1500f8d52d': 'Laad Bazaar, Begum Bazaar and Sultan Bazaar', '572efe44dfa6aa1500f8d52e': 'US$74 billion', '572efe44dfa6aa1500f8d52f': 'fifth-largest', '572f6358a23a5019007fc5b9': 'Haydar\'s city"', '

In [14]:
json_file_input = 'training_set.json' # File name
file_input = download.GetContentFile(json_file_input)
input_data = pd.read_json(json_file_input)

print(f'The input dataset is SQUAD version {input_data["version"][0]}')
print(f'lenght input dataset: {len(input_data["data"])}')

The input dataset is SQUAD version 1.1
lenght input dataset: 442


An example of the structure of the json file. It's divided in **paragraphs**. Each paragraph has some **context** and each of them has **qas** field. In every qas there are **answers** (with the index where the **answer starts** and its **ID**) and the relative **questions**.

In [35]:
input_data["data"][0]

{'title': 'University_of_Notre_Dame',
 'paragraphs': [{'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
   'qas': [{'answers': [{'answer_start': 515,
       'text': 'Saint Bernadette Soubirous'}],
     'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
     'id': '5733be284776f41900661182'},
    {'ans

## Splitting based on the title

We split the dataset into training and validation set (ratio 0.8 : 0.2) and with a loop we take every article, divide per paragraph and append the information from the json to the training and the validation set. For convenience we also save the entire answer, as well as the start index.

In [38]:
# Splitting the dataset into training and validation
split = 0.25 # Percentage for the validation + test
len_training = len(input_data['data']) * (1 - split)
len_valid = (len(input_data['data']) - len_training)//2.1

data_training = []
data_validation = []
data_test = []

# Splitting as suggested based on the title
for i, article in enumerate(input_data['data']):
    # article is a dictionary with keys: title, paragraphs
    title = article['title'].strip()

    for paragraph in article['paragraphs']:
        # paragraph is a dectionary with keys: context, qas
        context = paragraph['context'].strip()

        for qa in paragraph["qas"]:
            # qa is a dectionary with keys: answers, question, id
            question = qa["question"].strip()
            id_ = qa["id"]

            answer_starts = [answer["answer_start"] for answer in qa["answers"]]
            answers = [answer["text"].strip() for answer in qa["answers"]]

            if i <= len_training:
                data_training.append({'title': title,
                                    'context': context,
                                    'question': question,
                                    'id': id_,
                                    "answer_start": answer_starts[0],
                                    "answer_text": answers[0]
                                    })
            elif i > len_training and i < (len_training + len_valid):
                data_validation.append({'title': title,
                                    'context': context,
                                    'question': question,
                                    'id': id_,
                                    "answer_start": answer_starts[0],
                                    "answer_text": answers[0]
                                    })
                
            else:
                data_test.append({'title': title,
                    'context': context,
                    'question': question,
                    'id': id_,
                    "answer_start": answer_starts[0],
                    "answer_text": answers[0]
                    })

In [39]:
print(f"lenght training: {len(data_training)}")         
print(f"lenght validation: {len(data_validation)}")
print(f"lenght test: {len(data_test)}")

lenght training: 66230
lenght validation: 9369
lenght test: 12000


# Error Analysis

In [40]:
import re
import string

def normalize_answer(s):
  """Lower text and remove punctuation, articles and extra whitespace."""
  def remove_articles(text):
    regex = re.compile(r'\b(a|an|the)\b', re.UNICODE)
    return re.sub(regex, ' ', text)
  def white_space_fix(text):
    return ' '.join(text.split())
  def remove_punc(text):
    exclude = set(string.punctuation)
    return ''.join(ch for ch in text if ch not in exclude)
  def lower(text):
    return text.lower()
  return white_space_fix(remove_articles(remove_punc(lower(s))))

def get_tokens(s):
  if not s: return []
  return normalize_answer(s).split()

def compute_exact(a_gold, a_pred):
  return int(normalize_answer(a_gold) == normalize_answer(a_pred))

def compute_f1(a_gold, a_pred):
  gold_toks = get_tokens(a_gold)
  pred_toks = get_tokens(a_pred)
  common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
  num_same = sum(common.values())
  if len(gold_toks) == 0 or len(pred_toks) == 0:
    # If either is no-answer, then F1 is 1 if they agree, 0 otherwise
    return int(gold_toks == pred_toks)
  if num_same == 0:
    return 0
  precision = 1.0 * num_same / len(pred_toks)
  recall = 1.0 * num_same / len(gold_toks)
  f1 = (2 * precision * recall) / (precision + recall)
  return f1

In [41]:
cont=0
m=0
tits={}
allt={}
totf1=0
for i in data_test:
  pred = dbert_unc_preds[i['id']]
  gold = i['answer_text']
  em = compute_exact(gold, pred)
  if em == 0:
    em = 'NO'
  else:
    em = 'YES'
  f1 = compute_f1(gold, pred)
  
  if(i['title'] in allt):
      allt[i['title']]+=1
  else:
      allt[i['title']]=1

  def info():
    print('Title: '+i['title'])
    print('Question: '+i['question'])
    print('Gold: '+gold)
    print('Pred: '+pred)
    print('Exact Match: {} /// F1: {}'.format(em, round(f1, 3)))
    print('='*160)

  if(not((gold in pred) or (pred in gold))): # remove not to obtain possible GOOD answers
    #info()
    cont+=1
    if(i['title'] in tits):
      tits[i['title']]+=1
    else:
      tits[i['title']]=1
  totf1+=f1
  if(em=='YES'):
    m+=1

print('Percentage of possible good answers: '+str((len(data_test)-cont)/len(data_test)*100))
print('Percentage of EMs: '+str(m/len(data_test)*100))
print('Average F1: '+str(totf1/len(data_test)*100))

Percentage of possible good answers: 85.5
Percentage of EMs: 61.625
Average F1: 76.9028916062104


### Percentage of bad answers for each title

In [43]:
cands=[]
for i in tits:
  perc = round(tits[i]/allt[i]*100, 2)
  rel = round(allt[i]/12000*100, 2)
  imp = round(perc*rel, 2)
  if(imp > 34): # arbitrary threshold
    cands.append(i)
    print(i + ' & '+ str(perc) + ' & ' + str(rel) + ' & ' + str(imp) +'\n')

Tucson,_Arizona & 14.05 & 3.08 & 43.27

Bacteria & 28.7 & 1.8 & 51.66

Premier_League & 13.13 & 2.98 & 39.13

Roman_Republic & 18.11 & 3.27 & 59.22

Pacific_War & 10.49 & 3.26 & 34.2

Richmond,_Virginia & 12.65 & 2.7 & 34.16

Tuvalu & 17.73 & 2.49 & 44.15

Immaculate_Conception & 53.85 & 0.87 & 46.85

United_States_Air_Force & 18.67 & 2.01 & 37.53

Qing_dynasty & 20.68 & 2.7 & 55.84

Religion_in_ancient_Rome & 23.46 & 3.38 & 79.29

The_Bronx & 18.82 & 2.26 & 42.53



In [44]:
len(cands), len(tits)

(12, 58)