# Raw data exploration and processing

This notebook is used to test how to process SQuAD data to create the data set. The final code used actually used is in `src/data/SQuAD.py`

In [1]:
import json
import pandas as pd

In [2]:
data_path='../data/'
train_set_path= data_path+'SQuAD-train-v2.0.json'
validation_set_path=data_path+'SQuAD-dev-v2.0.json'

In [3]:
f = open(validation_set_path)
data=json.load(f)
f.close()

In [4]:
data=data.get('data')
len(data)

35

In [5]:
df_questions = pd.json_normalize(data, ['paragraphs', 'qas'], ['title', ['paragraphs', 'context']])
print(len(df_questions))
df_questions.head(3)

11873


Unnamed: 0,question,id,answers,is_impossible,plausible_answers,title,paragraphs.context
0,In what country is Normandy located?,56ddde6b9a695914005b9628,"[{'text': 'France', 'answer_start': 159}, {'te...",False,,Normans,The Normans (Norman: Nourmands; French: Norman...
1,When were the Normans in Normandy?,56ddde6b9a695914005b9629,"[{'text': '10th and 11th centuries', 'answer_s...",False,,Normans,The Normans (Norman: Nourmands; French: Norman...
2,From which countries did the Norse originate?,56ddde6b9a695914005b962a,"[{'text': 'Denmark, Iceland and Norway', 'answ...",False,,Normans,The Normans (Norman: Nourmands; French: Norman...


In [6]:
df_questions=df_questions.set_index('id')\
                         .drop(columns=['answers', 'is_impossible', 'plausible_answers'])\
                         .rename(columns={'paragraphs.context':'context', 'title':'q_title'})
df_questions['q_title'] = pd.factorize(df_questions['q_title'])[0]
df_questions['q_context'] = pd.factorize(df_questions['context'])[0]
df_questions.sample(3)

Unnamed: 0_level_0,question,q_title,context,q_context
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
572659ea5951b619008f7053,How are the combs spaced?,11,There are eight rows of combs that run from ne...,406
56e1cbe2cd28a01900c67bac,What is the most frequently employed type of r...,1,The most commonly used reduction is a polynomi...,69
5a679db8f038b7001ab0c350,How are school fees in the rest of the world c...,19,"In Ireland, private schools (Irish: scoil phrí...",606


In [7]:
# duplications
print(df_questions.duplicated().any())
print(len(df_questions[df_questions.duplicated()]))
df_questions[df_questions.duplicated(keep=False)].head(6)

True
7


Unnamed: 0_level_0,question,q_title,context,q_context
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
571153422419e3140095557e,Who designed Salamanca?,6,Trevithick continued his own experiments using...,237
5ad3d2aa604f3c001a3ff262,Who designed Salamanca?,6,Trevithick continued his own experiments using...,237
5711669550c2381900b54ae0,Where does heat rejection occur in the Rankine...,6,The Rankine cycle is sometimes referred to as ...,262
5ad414dd604f3c001a4002c8,Where does heat rejection occur in the Rankine...,6,The Rankine cycle is sometimes referred to as ...,262
5725b7f389a1e219009abd5e,What are the main sources of primary law?,9,European Union law is a body of treaties and l...,330
57268b43dd62a815002e88f1,What are the main sources of primary law?,9,European Union law is a body of treaties and l...,330


In [8]:
df_questions = df_questions.drop_duplicates(keep=False)
df_questions.duplicated().any()

False

In [9]:
# create separate dataframe for contexts
df_context = df_questions[['context', 'q_context', 'q_title']].copy()\
             .rename(columns={'q_context':'context_id', 'q_title':'c_title'})\
             .set_index('context_id')
df_context = df_context.drop_duplicates()

# remove contexts from questions dataframe
df_questions = df_questions.drop(columns=['context'])

In [10]:
print(df_questions.duplicated().any())
print(len(df_questions))
df_questions.head(3)

False
11859


Unnamed: 0_level_0,question,q_title,q_context
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
56ddde6b9a695914005b9628,In what country is Normandy located?,0,0
56ddde6b9a695914005b9629,When were the Normans in Normandy?,0,0
56ddde6b9a695914005b962a,From which countries did the Norse originate?,0,0


In [11]:
print(df_context.duplicated().any())
print(len(df_context))
df_context.head(3)

False
1204


Unnamed: 0_level_0,context,c_title
context_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,The Normans (Norman: Nourmands; French: Norman...,0
1,"The Norman dynasty had a major political, cult...",0
2,"The English name ""Normans"" comes from the Fren...",0


In [12]:
df_questions['qid'] = df_questions.index
df_context['cid'] = df_context.index
df_final = df_questions.merge(df_context, how='cross')

In [13]:
df_final['context_in_title'] = df_final['q_title'] == df_final['c_title']
df_final['context_corresponds'] = df_final['q_context'] == df_final['cid']

In [14]:
print(len(df_final))
df_final.sample(10)

14278236


Unnamed: 0,question,q_title,q_context,qid,context,c_title,cid,context_in_title,context_corresponds
6071583,What do scientists use to determine the proces...,15,510,5a5808ef770dc0001aeeff66,After al-Nimeiry was overthrown in 1985 the pa...,30,1015,False,False
3310458,Who was the worlds second largest wheat export...,8,310,5a38b07ca4b263001a8c189e,"Much of the city's tax base dissipated, leadin...",21,662,False,False
7460886,At what level are 13% of children in Scottish ...,19,616,5a67d87ef038b7001ab0c48c,"In particular, this norm gets smaller when a n...",27,902,False,False
542658,What expression does not usually contain DTIME...,1,66,5ad561c85b96ef001a10ad3e,"In addition to climate assessment reports, the...",26,858,False,False
9892262,What people are least vulnerable to infection?,25,822,5ad4d7395b96ef001a10a2f4,"After this, Huguenots (with estimates ranging ...",5,198,False,False
2798332,What inventor built on to the findings of Phil...,7,267,571a4ead10f8ca1400304fdd,Steam engines can be said to have been the mov...,6,236,False,False
7004378,What do global firms report on for the constru...,18,576,5a25ade0ef59cd001a623c45,"Firstly, certain costs are difficult to avoid ...",22,710,False,False
148093,What did Sybilla of Normandy introduce to Scot...,0,20,5ad3f8d2604f3c001a3ffa8e,"The Norman dynasty had a major political, cult...",0,1,True,False
6409599,What is the secondary reason consulting pharma...,16,534,5a6ce6164eec6b001a80a69d,While acknowledging the central role economic ...,22,707,False,False
11997925,Where did Maududi exert the least impact?,30,998,5acfe95877cf76001a686443,Other important complexity classes include BPP...,1,65,False,False


In [15]:
df_final = df_final[['question', 'context', 'context_in_title', 'context_corresponds']]
df_final.sample(5)

Unnamed: 0,question,context,context_in_title,context_corresponds
9467214,What was the Yuan's unofficial state religion?,Rail transport in Victoria is provided by seve...,False,False
13152103,What South Korean car manufacturer purchased t...,Neutrophils and macrophages are phagocytes tha...,False,False
12428256,Terra Nullius is a Latin expression meaning wh...,Some civil disobedience defendants choose to m...,False,False
9140123,Which tribes did Genghis Khan fight against?,Civil disobedients have chosen a variety of di...,False,False
5306145,What is level two of the connection-orientated...,Downtown San Diego is the central business dis...,False,False


In [16]:
data = df_final.to_records(index=False)
data = list(data)

In [17]:
print(len(data))
data[0]

14278236


('In what country is Normandy located?', 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.', True, True)