# Exploring and Transforming JSON Schemas

# Introduction

In this lesson, you'll formalize how to explore a JSON file whose structure and schema is unknown to you. This often happens in practice when you are handed a file or stumble upon one with little documentation.

## Objectives
You will be able to:
* Explore unknown JSON schemas
* Access and manipulate data inside a JSON file
* Convert JSON to alternative data formats

## Loading the JSON file

Load the data from the file disease_data.json.

In [1]:
#Your code here
import json

f = open('disease_data.json')
data = json.load(f)
print(type(data))

<class 'dict'>


## Explore the first and second levels of the schema hierarchy

In [2]:
#Your code here

data.keys()
#type(data['meta'])
data['meta']['view'].keys()

dict_keys(['id', 'name', 'attribution', 'attributionLink', 'averageRating', 'category', 'createdAt', 'description', 'displayType', 'downloadCount', 'hideFromCatalog', 'hideFromDataJson', 'indexUpdatedAt', 'licenseId', 'newBackend', 'numberOfComments', 'oid', 'provenance', 'publicationAppendEnabled', 'publicationDate', 'publicationGroup', 'publicationStage', 'rowClass', 'rowsUpdatedAt', 'rowsUpdatedBy', 'tableId', 'totalTimesRated', 'viewCount', 'viewLastModified', 'viewType', 'columns', 'grants', 'license', 'metadata', 'owner', 'query', 'rights', 'tableAuthor', 'tags', 'flags'])

In [3]:
#type(data['data'])
len(data['data'])

60266

## Convert to a DataFrame

Create a DataFrame from the JSON file. Be sure to retrive the column names for the dataframe. (Search within the 'meta' key of the master dictionary.) The DataFrame should include all 42 columns.

In [4]:
#Your code here
import pandas as pd

#df_meta = pd.DataFrame(data['meta']['view'])

df = pd.DataFrame(data['data'])

In [5]:
df2 = pd.DataFrame(data['meta'])

In [6]:
df2.shape

(40, 1)

In [7]:
df2.head()

Unnamed: 0,view
attribution,"Centers for Disease Control and Prevention, Na..."
attributionLink,http://www.cdc.gov/nccdphp/dph/
averageRating,0
category,Chronic Disease Indicators
columns,"[{'id': -1, 'name': 'sid', 'dataTypeName': 'me..."


In [8]:
df.shape

(60266, 42)

In [9]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,32,33,34,35,36,37,38,39,40,41
0,1,FF49C41F-CE8D-46C4-9164-653B1227CF6F,1,1527194521,959778,1527194521,959778,,2016,2016,...,59,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
1,2,F4468C3D-340A-4CD2-84A3-DF554DFF065E,2,1527194521,959778,1527194521,959778,,2016,2016,...,1,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
2,3,65609156-A343-4869-B03F-2BA62E96AC19,3,1527194521,959778,1527194521,959778,,2016,2016,...,2,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
3,4,0DB09B00-EFEB-4AC0-9467-A7CBD2B57BF3,4,1527194521,959778,1527194521,959778,,2016,2016,...,4,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
4,5,D98DA5BA-6FD6-40F5-A9B1-ABD45E44967B,5,1527194521,959778,1527194521,959778,,2016,2016,...,5,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,


In [10]:
df.columns = [item['name'] for item in data['meta']['view']['columns']]
print(df.columns)
df.head()

Index(['sid', 'id', 'position', 'created_at', 'created_meta', 'updated_at',
       'updated_meta', 'meta', 'YearStart', 'YearEnd', 'LocationAbbr',
       'LocationDesc', 'DataSource', 'Topic', 'Question', 'Response',
       'DataValueUnit', 'DataValueType', 'DataValue', 'DataValueAlt',
       'DataValueFootnoteSymbol', 'DatavalueFootnote', 'LowConfidenceLimit',
       'HighConfidenceLimit', 'StratificationCategory1', 'Stratification1',
       'StratificationCategory2', 'Stratification2', 'StratificationCategory3',
       'Stratification3', 'GeoLocation', 'ResponseID', 'LocationID', 'TopicID',
       'QuestionID', 'DataValueTypeID', 'StratificationCategoryID1',
       'StratificationID1', 'StratificationCategoryID2', 'StratificationID2',
       'StratificationCategoryID3', 'StratificationID3'],
      dtype='object')


Unnamed: 0,sid,id,position,created_at,created_meta,updated_at,updated_meta,meta,YearStart,YearEnd,...,LocationID,TopicID,QuestionID,DataValueTypeID,StratificationCategoryID1,StratificationID1,StratificationCategoryID2,StratificationID2,StratificationCategoryID3,StratificationID3
0,1,FF49C41F-CE8D-46C4-9164-653B1227CF6F,1,1527194521,959778,1527194521,959778,,2016,2016,...,59,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
1,2,F4468C3D-340A-4CD2-84A3-DF554DFF065E,2,1527194521,959778,1527194521,959778,,2016,2016,...,1,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
2,3,65609156-A343-4869-B03F-2BA62E96AC19,3,1527194521,959778,1527194521,959778,,2016,2016,...,2,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
3,4,0DB09B00-EFEB-4AC0-9467-A7CBD2B57BF3,4,1527194521,959778,1527194521,959778,,2016,2016,...,4,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
4,5,D98DA5BA-6FD6-40F5-A9B1-ABD45E44967B,5,1527194521,959778,1527194521,959778,,2016,2016,...,5,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,


## Level-Up
## Create a bar graph of states with the highest asthma rates for adults age 18+

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60266 entries, 0 to 60265
Data columns (total 42 columns):
sid                          60266 non-null int64
id                           60266 non-null object
position                     60266 non-null int64
created_at                   60266 non-null int64
created_meta                 60266 non-null object
updated_at                   60266 non-null int64
updated_meta                 60266 non-null object
meta                         0 non-null object
YearStart                    60266 non-null object
YearEnd                      60266 non-null object
LocationAbbr                 60266 non-null object
LocationDesc                 60266 non-null object
DataSource                   60266 non-null object
Topic                        60266 non-null object
Question                     60266 non-null object
Response                     0 non-null object
DataValueUnit                60158 non-null object
DataValueType                60266 n

In [12]:
df[df.Topic == 'Asthma'].Question.value_counts(normalize=True).cumsum()[:10]

Current asthma prevalence among adults aged >= 18 years                                    0.186096
Pneumococcal vaccination among noninstitutionalized adults aged 18-64 years with asthma    0.372193
Influenza vaccination among noninstitutionalized adults aged >= 65 years with asthma       0.558289
Influenza vaccination among noninstitutionalized adults aged 18-64 years with asthma       0.744385
Pneumococcal vaccination among noninstitutionalized adults aged >= 65 years with asthma    0.930481
Asthma prevalence among women aged 18-44 years                                             1.000000
Name: Question, dtype: float64

In [13]:
cols = ['LocationAbbr', 'LocationDesc', 'DataSource','Topic', 'Question', 'YearStart', 'YearEnd', 'DataValue']
view = df[df.Question == 'Current asthma prevalence among adults aged >= 18 years'][cols]
view.head()

Unnamed: 0,LocationAbbr,LocationDesc,DataSource,Topic,Question,YearStart,YearEnd,DataValue
4725,IL,Illinois,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,6.5
5529,IN,Indiana,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,6.7
5632,IA,Iowa,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,5.6
6777,KS,Kansas,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,6.1
7034,KY,Kentucky,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,6.9


In [14]:
view.sort_values(by='LocationAbbr').head()

Unnamed: 0,LocationAbbr,LocationDesc,DataSource,Topic,Question,YearStart,YearEnd,DataValue
9797,AK,Alaska,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,
10013,AK,Alaska,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,10.3
9427,AK,Alaska,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,9.0
9959,AK,Alaska,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,
9905,AK,Alaska,BRFSS,Asthma,Current asthma prevalence among adults aged >=...,2016,2016,


In [15]:
df.StratificationCategoryID1.value_counts(normalize=True)

RACE       0.631534
GENDER     0.231673
OVERALL    0.136794
Name: StratificationCategoryID1, dtype: float64

In [16]:
view = df[(df.Question == 'Current asthma prevalence among adults aged >= 18 years')
         & (df.StratificationCategoryID1 == 'OVERALL')]
view = view.sort_values(by='LocationAbbr')
print(view.shape)
view.head()

(110, 42)


Unnamed: 0,sid,id,position,created_at,created_meta,updated_at,updated_meta,meta,YearStart,YearEnd,...,LocationID,TopicID,QuestionID,DataValueTypeID,StratificationCategoryID1,StratificationID1,StratificationCategoryID2,StratificationID2,StratificationCategoryID3,StratificationID3
9372,9370,5D6EDDA9-B241-4498-A262-ED20AB78C44C,9370,1527194523,959778,1527194523,959778,,2016,2016,...,2,AST,AST1_1,CRDPREV,OVERALL,OVR,,,,
9427,9425,332B0889-ED65-4080-9373-D92FE918CD1D,9425,1527194523,959778,1527194523,959778,,2016,2016,...,2,AST,AST1_1,AGEADJPREV,OVERALL,OVR,,,,
9426,9424,CD846EC4-617B-4D38-B287-88DCF9BA8751,9424,1527194523,959778,1527194523,959778,,2016,2016,...,1,AST,AST1_1,AGEADJPREV,OVERALL,OVR,,,,
9371,9369,6BEC61D0-E04B-44BA-8170-F7D6A4C40A09,9369,1527194523,959778,1527194523,959778,,2016,2016,...,1,AST,AST1_1,CRDPREV,OVERALL,OVR,,,,
9374,9372,68F151CE-3084-402C-B672-78A43FBDE287,9372,1527194523,959778,1527194523,959778,,2016,2016,...,5,AST,AST1_1,CRDPREV,OVERALL,OVR,,,,


## Level-Up!
## Create a function (or class) that returns an outline of the schema structure like this: 
<img src="images/outline.jpg" width="350">

Rules:
* Your outline should follow the numbering outline above (I, A, 1, a, i).
* Your outline should be properly indented! (Four spaces or one tab per indentation level.)
* Your function goes to at least a depth of 5 (Level-up: create a parameter so that the user can specify this)
* If an entry is a dictionary, list its keys as the subheadings
* After listing a key name (where applicable) include a space, a dash and the data type of the entry
* If an entry is a dict or list put in parentheses how many items are in the entry
* lists will not have key names for their entries (they're just indexed)
* For subheadings of a list, state their datatypes. 
* If a dictionary or list is more then 5 items long, only show the first 5 (we want to limit our previews); make an arbitrary order choice for dictionaries. (Level-up: Parallel to above; allow user to specify number of items to preview for large subheading collections.)

In [None]:
# Your code here; you will probably want to define subfunctions.
def print_obj_outline(json_obj):
    return outline

In [22]:
outline = print_obj_outline(data)

In [23]:
print(outline) #Your function should produce the following output for this json object (and work for all json files!)

I. root - <class 'dict'> (2 items)
    A. meta <class 'dict'> (1 items)
        1. view <class 'dict'> (40 items)
            a. id <class 'str'> 
            b. name <class 'str'> 
            c. attribution <class 'str'> 
            d. attributionLink <class 'str'> 
            e. averageRating <class 'int'> 
    B. data <class 'list'> (60266 items)
        1. <class 'list'> (42 items)
            a. <class 'int'> 
            b. <class 'str'> 
            c. <class 'int'> 
            d. <class 'int'> 
            e. <class 'str'> 
        2. <class 'list'> (42 items)
            a. <class 'int'> 
            b. <class 'str'> 
            c. <class 'int'> 
            d. <class 'int'> 
            e. <class 'str'> 
        3. <class 'list'> (42 items)
            a. <class 'int'> 
            b. <class 'str'> 
            c. <class 'int'> 
            d. <class 'int'> 
            e. <class 'str'> 
        4. <class 'list'> (42 items)
            a. <class 'int'> 
            b. <c

## Summary

Well done! In this lab you got some extended practice exploring the structure of JSON files and writing a recursive generalized function for outlining a JSON file's schema! 