# Raw data exploration and processing

This notebook is used to test how to process SQuAD data to create the data set. The final code used actually used is in `src/data/SQuAD.py`

In [1]:
import json
import pandas as pd

In [2]:
data_path='../data/'
train_set_path= data_path+'SQuAD-train-v2.0.json'
validation_set_path=data_path+'SQuAD-dev-v2.0.json'

In [3]:
f = open(validation_set_path)
data=json.load(f)
f.close()

In [4]:
data=data.get('data')
len(data)

35

In [5]:
df_questions = pd.json_normalize(data, ['paragraphs', 'qas'], ['title', ['paragraphs', 'context']])
print(len(df_questions))
df_questions.head(3)

11873


Unnamed: 0,question,id,answers,is_impossible,plausible_answers,title,paragraphs.context
0,In what country is Normandy located?,56ddde6b9a695914005b9628,"[{'text': 'France', 'answer_start': 159}, {'te...",False,,Normans,The Normans (Norman: Nourmands; French: Norman...
1,When were the Normans in Normandy?,56ddde6b9a695914005b9629,"[{'text': '10th and 11th centuries', 'answer_s...",False,,Normans,The Normans (Norman: Nourmands; French: Norman...
2,From which countries did the Norse originate?,56ddde6b9a695914005b962a,"[{'text': 'Denmark, Iceland and Norway', 'answ...",False,,Normans,The Normans (Norman: Nourmands; French: Norman...


In [6]:
df_questions=df_questions.set_index('id')\
                         .drop(columns=['answers', 'is_impossible', 'plausible_answers'])\
                         .rename(columns={'paragraphs.context':'context', 'title':'q_title'})
df_questions['q_title'] = pd.factorize(df_questions['q_title'])[0]
df_questions['q_context'] = pd.factorize(df_questions['context'])[0]
df_questions.sample(3)

Unnamed: 0_level_0,question,q_title,context,q_context
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5711648850c2381900b54ac6,What is the approximate turbine entry temperat...,6,One of the principal advantages the Rankine cy...,258
5ad3cd43604f3c001a3ff186,In what constituent country of the United King...,6,The first full-scale working railway steam loc...,219
5acf808877cf76001a685006,Why did Hutchins eliminate hospitals from the ...,23,"In 1929, the university's fifth president, Rob...",721


In [7]:
# duplications
print(df_questions.duplicated().any())
print(len(df_questions[df_questions.duplicated()]))
df_questions[df_questions.duplicated(keep=False)].head(6)

True
7


Unnamed: 0_level_0,question,q_title,context,q_context
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
571153422419e3140095557e,Who designed Salamanca?,6,Trevithick continued his own experiments using...,237
5ad3d2aa604f3c001a3ff262,Who designed Salamanca?,6,Trevithick continued his own experiments using...,237
5711669550c2381900b54ae0,Where does heat rejection occur in the Rankine...,6,The Rankine cycle is sometimes referred to as ...,262
5ad414dd604f3c001a4002c8,Where does heat rejection occur in the Rankine...,6,The Rankine cycle is sometimes referred to as ...,262
5725b7f389a1e219009abd5e,What are the main sources of primary law?,9,European Union law is a body of treaties and l...,330
57268b43dd62a815002e88f1,What are the main sources of primary law?,9,European Union law is a body of treaties and l...,330


In [8]:
df_questions = df_questions.drop_duplicates(keep=False)
df_questions.duplicated().any()

False

In [9]:
# create separate dataframe for contexts
df_context = df_questions[['context', 'q_context', 'q_title']].copy()\
             .rename(columns={'q_context':'context_id', 'q_title':'c_title'})\
             .set_index('context_id')
df_context = df_context.drop_duplicates()
df_context = df_context.sort_index()

# remove contexts from questions dataframe
df_questions = df_questions.drop(columns=['context'])

In [10]:
print(df_questions.duplicated().any())
print(len(df_questions))
df_questions.head(3)

False
11859


Unnamed: 0_level_0,question,q_title,q_context
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
56ddde6b9a695914005b9628,In what country is Normandy located?,0,0
56ddde6b9a695914005b9629,When were the Normans in Normandy?,0,0
56ddde6b9a695914005b962a,From which countries did the Norse originate?,0,0


In [11]:
print(df_context.duplicated().any())
print(len(df_context))
df_context.head(3)

False
1204


Unnamed: 0_level_0,context,c_title
context_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,The Normans (Norman: Nourmands; French: Norman...,0
1,"The Norman dynasty had a major political, cult...",0
2,"The English name ""Normans"" comes from the Fren...",0


In [24]:
df_context['context']
print(len(df_context['context']))

1204


In [27]:
print(len(df_context['context'].tolist()))
df_context['context'].tolist()[0]

1204


'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.'