# Initial exploration

- Gaining familiarity with huggingface
- Gaining familiarity with SQuAD dataset
- Training simple models on SQuAD
- Simple error analysis

## Set-up

In [1]:
from datasets import load_dataset
import pandas as pd

## Load SQuAD

In [2]:
squad_train = load_dataset("squad", split='train')
squad_test = load_dataset("squad", split='validation')  # Use validation set as test holdout

Reusing dataset squad (/Users/stevengeorge/.cache/huggingface/datasets/squad/plain_text/1.0.0/4c81550d83a2ac7c7ce23783bd8ff36642800e6633c1f18417fb58c3ff50cdd7)
Reusing dataset squad (/Users/stevengeorge/.cache/huggingface/datasets/squad/plain_text/1.0.0/4c81550d83a2ac7c7ce23783bd8ff36642800e6633c1f18417fb58c3ff50cdd7)


In [3]:
len(squad_test) / (len(squad_test) + len(squad_train))

0.10767146451527468

In [4]:
type(squad_train)

datasets.arrow_dataset.Dataset

In [5]:
squad_train_df = squad_train.data.to_pandas()
print(squad_train_df.shape)
squad_train_df.head()

(87599, 5)


Unnamed: 0,answers,context,id,question,title
0,"{'answer_start': [515], 'text': ['Saint Bernad...","Architecturally, the school has a Catholic cha...",5733be284776f41900661182,To whom did the Virgin Mary allegedly appear i...,University_of_Notre_Dame
1,"{'answer_start': [188], 'text': ['a copper sta...","Architecturally, the school has a Catholic cha...",5733be284776f4190066117f,What is in front of the Notre Dame Main Building?,University_of_Notre_Dame
2,"{'answer_start': [279], 'text': ['the Main Bui...","Architecturally, the school has a Catholic cha...",5733be284776f41900661180,The Basilica of the Sacred heart at Notre Dame...,University_of_Notre_Dame
3,"{'answer_start': [381], 'text': ['a Marian pla...","Architecturally, the school has a Catholic cha...",5733be284776f41900661181,What is the Grotto at Notre Dame?,University_of_Notre_Dame
4,"{'answer_start': [92], 'text': ['a golden stat...","Architecturally, the school has a Catholic cha...",5733be284776f4190066117e,What sits on top of the Main Building at Notre...,University_of_Notre_Dame


In [6]:
squad_train_df.loc[0]['answers']['text']

array(['Saint Bernadette Soubirous'], dtype=object)

In [7]:
squad_train[0]['answers']

{'answer_start': [515], 'text': ['Saint Bernadette Soubirous']}

In [8]:
squad_train[0]['context'][515:]

'Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.'

`answer_start` is index number of answer start

In [9]:
print(squad_train_df.loc[0]['answers']['text'].item())
len(squad_train_df.loc[0]['answers']['text'].item())

Saint Bernadette Soubirous


26

In [10]:
squad_train[0]['context'][515 + (26-1):]

's in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.'

`answer_end` = `answer_start` + (`len(text)` - 1)

## Preprocessing
- Calculate `answer_end`
- https://huggingface.co/docs/datasets/processing.html

In [20]:
def add_answer_end(example):
    example['answer_end'] = example['answers']['answer_start'][0] + len(example['answers']['text'][0]) - 1
    return example

In [21]:
squad_train = squad_train.map(add_answer_end)

HBox(children=(FloatProgress(value=0.0, max=87599.0), HTML(value='')))




In [24]:
squad_train[0]

{'answer_end': 540,
 'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'title': 'University_of_Notre_Dame'}

## TODO
- Approaches:
    - Character-level tokens and predict start and end character of span - requires character tokenizer and potentially large embedding
    - Word-level tokens and predict start and end word of span - requires word tokenizer but can use pre-trained embedding (GloVe)
- Get tokens --> get vocab
- Load GloVe embeddings

## Questions
- Are contexts in validation set also present in training set?

In [31]:
squad_train[0]['answers']['answer_start'][0] + len(squad_train[0]['answers']['text'][0]) - 1

540