# JSON examples and exercise
****
+ get familiar with packages for dealing with JSON
+ study examples with JSON strings and files 
+ work on exercise to be completed and submitted 
****
+ reference: http://pandas.pydata.org/pandas-docs/stable/io.html#io-json-reader
+ data source: http://jsonstudio.com/resources/
****

****
## JSON exercise

Using data in file 'data/world_bank_projects.json' and the techniques demonstrated above,
1. Find the 10 countries with most projects
2. Find the top 10 major project themes (using column 'mjtheme_namecode')
3. In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

In [94]:
# load json object
d = json.load(open('world_bank_projects.json'))
# load json dataframe
d_df = pd.read_json('world_bank_projects.json')

### Q1: Find the 10 countries with most projects

In [95]:
d_df.countryname.value_counts().head(10)

Republic of Indonesia              19
People's Republic of China         19
Socialist Republic of Vietnam      17
Republic of India                  16
Republic of Yemen                  13
Nepal                              12
Kingdom of Morocco                 12
People's Republic of Bangladesh    12
Africa                             11
Republic of Mozambique             11
Name: countryname, dtype: int64

### Q2: Find the top 10 major project themes (using column 'mjtheme_namecode')

In [96]:
#json_normalize mjtheme_namecode and get counts by name; extract top 10 only
json_normalize(d, 'mjtheme_namecode').name.value_counts()[:10]

Environment and natural resources management    223
Rural development                               202
Human development                               197
Public sector governance                        184
Social protection and risk management           158
Financial and private sector development        130
                                                122
Social dev/gender/inclusion                     119
Trade and integration                            72
Urban development                                47
Name: name, dtype: int64

In [97]:
#json_normalize mjtheme_namecode and get counts by code; extract top 10 only
json_normalize(d, 'mjtheme_namecode').code.value_counts()[:10]

11    250
10    216
8     210
2     199
6     168
4     146
7     130
5      77
9      50
1      38
Name: code, dtype: int64

In [49]:
# note that the top ten count "by code" do not match "by name" cos of missing values in name. This will be fixed in Q3 below.

### Q3 In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

In [98]:
# normalise for 'mjtheme_namecode' and obtain list of names for a later merge
th = json_normalize(d, 'mjtheme_namecode')
th_name = th[th.name!=""].drop_duplicates()
th_name.head(10)

Unnamed: 0,code,name
0,8,Human development
2,1,Economic management
3,6,Social protection and risk management
4,5,Trade and integration
5,2,Public sector governance
6,11,Environment and natural resources management
8,7,Social dev/gender/inclusion
11,4,Financial and private sector development
18,10,Rural development
53,9,Urban development


In [99]:
# merge on code index
th = th.merge(th_name,on='code')

In [100]:
# drop name column with missing values and rename new name column as 'name'
th = th.drop(['name_x'],axis=1)
th = th.rename(columns={'name_y':'name'})

In [101]:
th.name.value_counts()[:10]
# this should now produce a top ten name lsit that matches top tep codes in Q2 !!

Environment and natural resources management    250
Rural development                               216
Human development                               210
Public sector governance                        199
Social protection and risk management           168
Financial and private sector development        146
Social dev/gender/inclusion                     130
Trade and integration                            77
Urban development                                50
Economic management                              38
Name: name, dtype: int64