# Exploring and Transforming JSON Schemas

# Introduction

In this lesson, you'll formalize how to explore a JSON file whose structure and schema is unknown to you. This often happens in practice when you are handed a file or stumble upon one with little documentation.

## Objectives
You will be able to:
* Use the JSON module to load and parse JSON documents
* Load and explore unknown JSON schemas
* Convert JSON to a pandas dataframe

## Loading the JSON file

Load the data from the file `disease_data.json`.

In [26]:
import json
f = open('disease_data.json')
data = json.load(f)

## Explore the first and second levels of the schema hierarchy

In [27]:
type(data)

dict

In [28]:
data.keys()

dict_keys(['meta', 'data'])

In [29]:
type(data['meta'])

dict

In [30]:
data['meta'].keys

<function dict.keys>

In [31]:
type(data['data'])

list

In [32]:
len(data['data'])

60266

In [33]:
type(data['data'][0])

list

In [34]:
len(data['data'][0])

42

In [35]:
data['data'][0][0]

1

In [36]:
data['data'][0][8]

'2016'

## Convert to a DataFrame

Create a DataFrame from the JSON file. Be sure to retrive the column names for the dataframe. (Search within the 'meta' key of the master dictionary.) The DataFrame should include all 42 columns.

In [37]:
import pandas as pd

In [38]:
df = pd.DataFrame(data['data'])
print(df.shape)
df.columns = [item['name'] for item in data['meta']['view']['columns']]
print(df.columns)
df.head(10)

(60266, 42)
Index(['sid', 'id', 'position', 'created_at', 'created_meta', 'updated_at',
       'updated_meta', 'meta', 'YearStart', 'YearEnd', 'LocationAbbr',
       'LocationDesc', 'DataSource', 'Topic', 'Question', 'Response',
       'DataValueUnit', 'DataValueType', 'DataValue', 'DataValueAlt',
       'DataValueFootnoteSymbol', 'DatavalueFootnote', 'LowConfidenceLimit',
       'HighConfidenceLimit', 'StratificationCategory1', 'Stratification1',
       'StratificationCategory2', 'Stratification2', 'StratificationCategory3',
       'Stratification3', 'GeoLocation', 'ResponseID', 'LocationID', 'TopicID',
       'QuestionID', 'DataValueTypeID', 'StratificationCategoryID1',
       'StratificationID1', 'StratificationCategoryID2', 'StratificationID2',
       'StratificationCategoryID3', 'StratificationID3'],
      dtype='object')


Unnamed: 0,sid,id,position,created_at,created_meta,updated_at,updated_meta,meta,YearStart,YearEnd,...,LocationID,TopicID,QuestionID,DataValueTypeID,StratificationCategoryID1,StratificationID1,StratificationCategoryID2,StratificationID2,StratificationCategoryID3,StratificationID3
0,1,FF49C41F-CE8D-46C4-9164-653B1227CF6F,1,1527194521,959778,1527194521,959778,,2016,2016,...,59,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
1,2,F4468C3D-340A-4CD2-84A3-DF554DFF065E,2,1527194521,959778,1527194521,959778,,2016,2016,...,1,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
2,3,65609156-A343-4869-B03F-2BA62E96AC19,3,1527194521,959778,1527194521,959778,,2016,2016,...,2,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
3,4,0DB09B00-EFEB-4AC0-9467-A7CBD2B57BF3,4,1527194521,959778,1527194521,959778,,2016,2016,...,4,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
4,5,D98DA5BA-6FD6-40F5-A9B1-ABD45E44967B,5,1527194521,959778,1527194521,959778,,2016,2016,...,5,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
5,6,49758545-682D-46D8-A9F8-0F98EFDDE64A,6,1527194521,959778,1527194521,959778,,2016,2016,...,6,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
6,7,AEB36999-5746-463F-B921-97E404EEF234,7,1527194521,959778,1527194521,959778,,2016,2016,...,8,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
7,8,FEBA783D-B277-4DAD-A93B-BF333F9B582D,8,1527194521,959778,1527194521,959778,,2016,2016,...,9,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
8,9,85670BEF-2891-4372-A5AE-A4B7867CEEE9,9,1527194521,959778,1527194521,959778,,2016,2016,...,10,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
9,10,E3B61235-D0C5-40F3-8B5F-E7D4A9509654,10,1527194521,959778,1527194521,959778,,2016,2016,...,11,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,


## Level-Up
## Create a bar graph of states with the highest asthma rates for adults age 18+

In [39]:
df[df.TopicID == 'Asthma'].Question.value_counts(normalize=True).cumsum()[:10]

Series([], Name: Question, dtype: float64)

In [40]:
df[df.Topic == 'Asthma'].Question.value_counts(normalize=True).cumsum()[:10]

Pneumococcal vaccination among noninstitutionalized adults aged >= 65 years with asthma    0.186096
Pneumococcal vaccination among noninstitutionalized adults aged 18-64 years with asthma    0.372193
Influenza vaccination among noninstitutionalized adults aged 18-64 years with asthma       0.558289
Current asthma prevalence among adults aged >= 18 years                                    0.744385
Influenza vaccination among noninstitutionalized adults aged >= 65 years with asthma       0.930481
Asthma prevalence among women aged 18-44 years                                             1.000000
Name: Question, dtype: float64

In [45]:
cols = ['LocationAbbr', 'LocationDesc', 'DataSource','Topic', 'Question', 'YearStart', 'YearEnd', 'DataValue']
view = df[df.Question == 'Current asthma prevalence among adults aged >= 18 years'][cols]
view.head(10)

Unnamed: 0,LocationAbbr,LocationDesc,DataSource,Topic,Question,YearStart,YearEnd,DataValue
4725,IL,Illinois,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,6.5
5529,IN,Indiana,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,6.7
5632,IA,Iowa,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,5.6
6777,KS,Kansas,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,6.1
7034,KY,Kentucky,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,6.9
7337,LA,Louisiana,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,5.4
7428,ME,Maine,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,9.4
7499,MD,Maryland,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,7.2
7966,VT,Vermont,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,7.2
8114,VA,Virginia,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,5.3


In [47]:
view.sort_values(by='LocationAbbr').head(10)

Unnamed: 0,LocationAbbr,LocationDesc,DataSource,Topic,Question,YearStart,YearEnd,DataValue
9797,AK,Alaska,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,
10013,AK,Alaska,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,10.3
9427,AK,Alaska,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,9.0
9959,AK,Alaska,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,
9905,AK,Alaska,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,
9851,AK,Alaska,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,
10121,AK,Alaska,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,
9482,AK,Alaska,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,5.7
10176,AK,Alaska,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,
9372,AK,Alaska,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,8.8


In [52]:
df.StratificationCategoryID1.value_counts(normalize=True)

RACE       0.631534
GENDER     0.231673
OVERALL    0.136794
Name: StratificationCategoryID1, dtype: float64

In [54]:
view = df[(df.Question == 'Current asthma prevalence among adults aged >= 18 years')
         & (df.StratificationCategoryID1 == 'OVERALL')]
view = view.sort_values(by='LocationAbbr')
print(view.shape)
view.head(10)

(110, 42)


Unnamed: 0,sid,id,position,created_at,created_meta,updated_at,updated_meta,meta,YearStart,YearEnd,...,LocationID,TopicID,QuestionID,DataValueTypeID,StratificationCategoryID1,StratificationID1,StratificationCategoryID2,StratificationID2,StratificationCategoryID3,StratificationID3
9372,9370,5D6EDDA9-B241-4498-A262-ED20AB78C44C,9370,1527194523,959778,1527194523,959778,,2016,2016,...,2,AST,AST1_1,CRDPREV,OVERALL,OVR,,,,
9427,9425,332B0889-ED65-4080-9373-D92FE918CD1D,9425,1527194523,959778,1527194523,959778,,2016,2016,...,2,AST,AST1_1,AGEADJPREV,OVERALL,OVR,,,,
9426,9424,CD846EC4-617B-4D38-B287-88DCF9BA8751,9424,1527194523,959778,1527194523,959778,,2016,2016,...,1,AST,AST1_1,AGEADJPREV,OVERALL,OVR,,,,
9371,9369,6BEC61D0-E04B-44BA-8170-F7D6A4C40A09,9369,1527194523,959778,1527194523,959778,,2016,2016,...,1,AST,AST1_1,CRDPREV,OVERALL,OVR,,,,
9374,9372,68F151CE-3084-402C-B672-78A43FBDE287,9372,1527194523,959778,1527194523,959778,,2016,2016,...,5,AST,AST1_1,CRDPREV,OVERALL,OVR,,,,
9429,9427,7DD2D8A6-F34C-476F-A597-B4DC666D959D,9427,1527194523,959778,1527194523,959778,,2016,2016,...,5,AST,AST1_1,AGEADJPREV,OVERALL,OVR,,,,
9373,9371,5FCE0D49-11FD-4545-B9E7-14F503123105,9371,1527194523,959778,1527194523,959778,,2016,2016,...,4,AST,AST1_1,CRDPREV,OVERALL,OVR,,,,
9428,9426,BF430518-45D1-48E5-A9AC-34DB0E4715BE,9426,1527194523,959778,1527194523,959778,,2016,2016,...,4,AST,AST1_1,AGEADJPREV,OVERALL,OVR,,,,
9430,9428,CD1718CB-7515-4340-BF97-1FCA8FE928E4,9428,1527194523,959778,1527194523,959778,,2016,2016,...,6,AST,AST1_1,AGEADJPREV,OVERALL,OVR,,,,
9375,9373,D3F00ED2-A069-4E40-B42B-5A2528A91B6F,9373,1527194523,959778,1527194523,959778,,2016,2016,...,6,AST,AST1_1,CRDPREV,OVERALL,OVR,,,,


## Summary

Well done! In this lab you got some extended practice exploring the structure of JSON files, converting json files to pandas DataFrame, and visualizing data!