# JSON examples and exercise
****
+ get familiar with packages for dealing with JSON
+ study examples with JSON strings and files 
+ work on exercise to be completed and submitted 
****
+ reference: http://pandas.pydata.org/pandas-docs/stable/io.html#io-json-reader
+ data source: http://jsonstudio.com/resources/
****

In [None]:
import pandas as pd

## imports for Python, Pandas

In [None]:
import json
from pandas.io.json import json_normalize

## JSON example, with string

+ demonstrates creation of normalized dataframes (tables) from nested json string
+ source: http://pandas.pydata.org/pandas-docs/stable/io.html#normalization

In [None]:
# define json string
data = [{'state': 'Florida', 
         'shortname': 'FL',
         'info': {'governor': 'Rick Scott'},
         'counties': [{'name': 'Dade', 'population': 12345},
                      {'name': 'Broward', 'population': 40000},
                      {'name': 'Palm Beach', 'population': 60000}]},
        {'state': 'Ohio',
         'shortname': 'OH',
         'info': {'governor': 'John Kasich'},
         'counties': [{'name': 'Summit', 'population': 1234},
                      {'name': 'Cuyahoga', 'population': 1337}]}]

In [None]:
# use normalization to create tables from nested element
json_normalize(data, 'counties')

In [None]:
# further populate tables created from nested element
json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']])

****
## JSON example, with file

+ demonstrates reading in a json file as a string and as a table
+ uses small sample file containing data about projects funded by the World Bank 
+ data source: http://jsonstudio.com/resources/

In [None]:
# load json as string
json.load((open('data/world_bank_projects_less.json')))

In [None]:
# load as Pandas dataframe
sample_json_df = pd.read_json('data/world_bank_projects_less.json')
sample_json_df

****
## JSON exercise

Using data in file 'data/world_bank_projects.json' and the techniques demonstrated above,
1. Find the 10 countries with most projects
2. Find the top 10 major project themes (using column 'mjtheme_namecode')
3. In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

In [None]:
# importing numpy package
import numpy as np
dict=json.load(open('data/world_bank_projects.json'))
df=pd.DataFrame(dict)
df.head()
df.columns

# 1.Find the 10 countries with most projects

Steps: 
1.create dataframe from jsonstring (dict) using Pandas.DataFrame() function.
2. Dataframe is two dimensional data structure can perform arithmetic operation using column and row labels
3. df.head(n) is used to display 'n' rows of dataframe(df),default is first five rows
4. df.columns gives column names in that Data frame
5. we can query the data frame using column name. to identify the countries with most project, will query dataframe -df.countryname gives list of country name in data frame.
6.value_counts()returns the counts of series : df.country_name. The result will be in descending order. The country name whose count is higher is considered as countries involved in most projects.
7. To identify top 10 will use the head function with argument n=10 gives the top 10 countries with most project.

In [None]:
# value_counts() gives the counts of country name 
df.countryname.value_counts().head(10)

# 2.Find the top 10 major project themes (using column 'mjtheme_namecode')


steps:
1.To identify 10 major project theme,will query dataframe using column 'mjtheme_namecode'
2.mjtheme_namecode column is list of dictionaries.
3.Using json_normalize (), will convert list or list of dict into data frame ie.mjp_df
4.mjp_df can be queried using columns.It has two columns code and name. But in column 'name' some values are not filed.
5 Using mjp_df.replace() function, will replace blank space in columns by none using numpy.nan
6.Value_counts() gives the counts of name of project theme, it also counts no of NaN value in that name column. 
7. But count of NaN is not relevant to this finding, inorder to remove nan count from the result  will drop all NAN value in column by dropna() function.Which will return a series.
8.mjp_ser is series which holds the projectthemename and its count. In order to display top 10 major projecttheme, will use head(n=10) fuction to display top 10 major project theme. The project which has higher values is considered as major project theme.
   

In [None]:
#json_normalize is used to convert dict into flat table
mjp_df=json_normalize(dict,'mjtheme_namecode')
mjp_df=mjp_df.replace('',np.nan)
mjp_ser=mjp_df.name.dropna(how='any').value_counts()
mjp_ser.head(10)




                                    or

In [None]:
# group the series based on code on column and find its size in each group.
#Sort_values() will sort the numering value in descending order,ie Large values are top rows nad so on.
# head display first n values from DF or series.
mjp_df.groupby(['code','name']).size().sort_values(ascending=False).head(10)

# In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

steps:
1.In order to identify no of columns and rows of dataframe, shape() function is used.
2.To identify no of Nan in name column, isnull() and sum() functions are used. isnull() will return boolean series, True if the column has null value and False if it doesnt. Sum() used to find total no of True bcoz boolean True is considered as numeric one and it find total no of one in that series.
3. To find no of not null values, notnull() and sum() functions are used.
4.To fill the missing values of name, will use 'code' for that name to find the missing value. Each project name has unique code. 
5.First, we should identify code and its name, will use groupby() function. It will group the series using columns specifed as an argument. After spliting,performing aggregation ie finding total no project in each group.
6. Ser is series stores the result of groupby and sum. That series gives list of project name and its count. And then series is converted to dataframe using pd.DataFrame() function, ie mjp_name_df.
7.DataFrame.reset_index(inplace=True) is used to reset index for each row,inplace argument tells that the changes made by the function is saved in original dataframe that the  function is called. Default value is False
8.To fill missing values, merging two dataframe on column 'code'. ie. Merging one column of Dataframe ie mjp_df[['code']] with 
mjp_name_df with argument how='left' is passed. Left Dataframe column 'code' remains the same and merging the right DF column based on left DF column 'code'.

In [None]:
mjp_df.shape

In [None]:
mjp_df.name.isnull().sum()

In [None]:
mjp_df.name.notnull().sum()

In [None]:
ser=mjp_df.groupby(['code','name']).size()
ser

In [None]:
mjp_name_df=pd.DataFrame(ser)
mjp_name_df.reset_index(inplace=True)
mjp_name_df.dtypes
mjp_name_df


In [None]:
#merging right table with left table column 'code'
mjp_filled_df=pd.merge(mjp_df[['code']],mjp_name_df[['code','name']], on='code',how='left')
mjp_filled_df     
        