# Converting from JSON to Tidy Data Pandas Dataframe

In this notebook I share a script to convert the [Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer/) dataset from JSON format to a [Tidy Data](https://vita.had.co.nz/papers/tidy-data.pdf) Pandas Dataframe.

**About the SQuAD dataset**

The dataset is used in natural language processing (NLP) research in the field of machine learning and reading comprehension. The datset consists of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. 

<br>


In [1]:
import requests
import pandas as pd
from pandas.io.json import json_normalize

### Load data

In [2]:
# dev url
url = "https://raw.githubusercontent.com/aswalin/SQuAD/master/data/dev-v1.1.json"

# train url
# url = "https://raw.githubusercontent.com/aswalin/SQuAD/master/data/train-v1.1.json"

r = requests.get(url)
json_dict = r.json()

### Explore JSON file

In [3]:
# Nested Keys
print('top-level-keys: {}'.format(list(json_dict.keys())))
print('data keys: {}'.format(list(json_dict['data'][0].keys())))
print('paragraphs keys: {}'.format(list(json_dict['data'][0]['paragraphs'][0].keys())))
print('qas keys: {}'.format(list(json_dict['data'][0]['paragraphs'][0]['qas'][0].keys())))
print('answers keys: {}'.format(list(json_dict['data'][0]['paragraphs'][0]['qas'][0]['answers'][0].keys())))

top-level-keys: ['data', 'version']
data keys: ['title', 'paragraphs']
paragraphs keys: ['context', 'qas']
qas keys: ['answers', 'question', 'id']
answers keys: ['answer_start', 'text']


In [4]:
# Count Corpora
print('Nbr Corpora: {}'.format(len(json_dict['data'])))

Nbr Corpora: 48


In [5]:
# Print Corpora Titles
# print(list(json_normalize(json_dict,'data')['title']))

### Convert to Tidy DF

In [6]:
def convert_squad_to_tidy_df(json_dict, corpus):
    """This function converts the SQuAD JSON data to a Tidy Data Pandas Dataframe.
    
    :param obj json_dict: squad json data
    :param str corpus: name of squad corpora to select subset from json object
    
    :returns: converted json data
    :rtype: pandas dataframe
    
    """
    data = [c for c in json_dict['data'] if c['title']==corpus][0]
    df = pd.DataFrame()
    data_paragraphs = data['paragraphs']
    for article_dict in data_paragraphs:
        row = []
        for answers_dict in article_dict['qas']:
            for answer in answers_dict['answers']:
                row.append((article_dict['context'][:50], 
                            answers_dict['question'], 
                            answers_dict['id'],
                            answer['answer_start'],
                            answer['text']
                           ))
        df = pd.concat([df, pd.DataFrame.from_records(row, columns=['context', 'question', 'id', 'answer_start', 'text'])], axis=0, ignore_index=True)
        df.drop_duplicates(inplace=True)
    return df

In [7]:
corpus = 'Super_Bowl_50' # only in dev dataset
# corpus = 'Culture'
df = convert_squad_to_tidy_df(json_dict, corpus)#.reset_index()
print(len(df))
df.head()

1370


Unnamed: 0,context,question,id,answer_start,text
0,Super Bowl 50 was an American football game to...,Which NFL team represented the AFC at Super Bo...,56be4db0acb8001400a502ec,177,Denver Broncos
1,Super Bowl 50 was an American football game to...,Which NFL team represented the NFC at Super Bo...,56be4db0acb8001400a502ed,249,Carolina Panthers
2,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,56be4db0acb8001400a502ee,403,"Santa Clara, California"
3,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,56be4db0acb8001400a502ee,355,Levi's Stadium
4,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,56be4db0acb8001400a502ee,355,Levi's Stadium in the San Francisco Bay Area a...


#### Some useful Links

* https://github.com/aswalin/SQuAD.git
* https://github.com/priya-dwivedi/cs224n-Squad-Project
* https://mindtrove.info/flatten-nested-json-with-pandas/