# JSON examples and exercise
****
+ get familiar with packages for dealing with JSON
+ study examples with JSON strings and files 
+ work on exercise to be completed and submitted 
****
+ reference: http://pandas.pydata.org/pandas-docs/stable/io.html#io-json-reader
+ data source: http://jsonstudio.com/resources/

## JSON exercise

Using data in file 'data/world_bank_projects.json' and the techniques demonstrated above,
1. Find the 10 countries with most projects
2. Find the top 10 major project themes (using column 'mjtheme_namecode')
3. In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

## Q1: Find the 10 countries with the most projects

In [1]:
# Import necessary modules
import pandas as pd
import json
from pandas.io.json import json_normalize

In [2]:
# Load/Open json file as a string
data = json.load((open('/Users/scoredy/Downloads/data_wrangling_json/data/world_bank_projects.json')))

In [3]:
# Read in json file as a Pandas df
json_df = pd.read_json('/Users/scoredy/Downloads/data_wrangling_json/data/world_bank_projects.json')

In [4]:
# Inspect dataframe
json_df.columns

Index(['_id', 'approvalfy', 'board_approval_month', 'boardapprovaldate',
       'borrower', 'closingdate', 'country_namecode', 'countrycode',
       'countryname', 'countryshortname', 'docty', 'envassesmentcategorycode',
       'grantamt', 'ibrdcommamt', 'id', 'idacommamt', 'impagency',
       'lendinginstr', 'lendinginstrtype', 'lendprojectcost',
       'majorsector_percent', 'mjsector_namecode', 'mjtheme',
       'mjtheme_namecode', 'mjthemecode', 'prodline', 'prodlinetext',
       'productlinetype', 'project_abstract', 'project_name', 'projectdocs',
       'projectfinancialtype', 'projectstatusdisplay', 'regionname', 'sector',
       'sector1', 'sector2', 'sector3', 'sector4', 'sector_namecode',
       'sectorcode', 'source', 'status', 'supplementprojectflg', 'theme1',
       'theme_namecode', 'themecode', 'totalamt', 'totalcommamt', 'url'],
      dtype='object')

In [5]:
json_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 50 columns):
_id                         500 non-null object
approvalfy                  500 non-null int64
board_approval_month        500 non-null object
boardapprovaldate           500 non-null object
borrower                    485 non-null object
closingdate                 370 non-null object
country_namecode            500 non-null object
countrycode                 500 non-null object
countryname                 500 non-null object
countryshortname            500 non-null object
docty                       446 non-null object
envassesmentcategorycode    430 non-null object
grantamt                    500 non-null int64
ibrdcommamt                 500 non-null int64
id                          500 non-null object
idacommamt                  500 non-null int64
impagency                   472 non-null object
lendinginstr                495 non-null object
lendinginstrtype            495 non

In [6]:
# How many distinct countries are listed?
json_df.countryname.value_counts(dropna=False)

Republic of Indonesia                       19
People's Republic of China                  19
Socialist Republic of Vietnam               17
Republic of India                           16
Republic of Yemen                           13
Kingdom of Morocco                          12
Nepal                                       12
People's Republic of Bangladesh             12
Africa                                      11
Republic of Mozambique                      11
Federative Republic of Brazil                9
Islamic Republic of Pakistan                 9
Burkina Faso                                 9
United Republic of Tanzania                  8
Republic of Tajikistan                       8
Republic of Armenia                          8
Lao People's Democratic Republic             7
Kyrgyz Republic                              7
Hashemite Kingdom of Jordan                  7
Federal Republic of Nigeria                  7
West Bank and Gaza                           6
Republic of N

In [7]:
# How many distinct projects are there?
json_df.project_name.value_counts(dropna=False)

JO-Badia Ecosystem and Livelihoods                                                                                                         1
MNXTA: Enhancing Microfinance Amongst Women and Youth in MENA                                                                              1
Social Inclusion and Improvement of Livelihoods of Youth, Vulnerable Women and Handicapped in Post Conflict Western Cote d&#8217;Ivoire    1
Tajikistan JSDF Nutrition Grant Scale Up                                                                                                   1
Second Rural Transport Improvement Project                                                                                                 1
Tajikistan PDPG6                                                                                                                           1
LB Supporting Innovation in SMEs Project                                                                                                   1
Afghanistan A

It seems there are 500 unique entries for project_names, and duplicate entries of countrynames
<br/>
Let's create a top 10 list & a bar chart to more easily visualize those countries (using the countryshortname)

In [8]:
top_10_countries = pd.DataFrame(json_df['countryshortname'].value_counts().head(10))
top_10_countries

Unnamed: 0,countryshortname
China,19
Indonesia,19
Vietnam,17
India,16
"Yemen, Republic of",13
Bangladesh,12
Morocco,12
Nepal,12
Mozambique,11
Africa,11


## Q2: Find the top 10 major project themes (using column 'mjtheme_namecode')

We will return to this after answering Q3.

## Q3: In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

In [9]:
# Normalize 'mjtheme_namecode' within data to see what we're working with
norm = json_normalize(data, 'mjtheme_namecode')
norm

Unnamed: 0,code,name
0,8,Human development
1,11,
2,1,Economic management
3,6,Social protection and risk management
4,5,Trade and integration
5,2,Public sector governance
6,11,Environment and natural resources management
7,6,Social protection and risk management
8,7,Social dev/gender/inclusion
9,7,Social dev/gender/inclusion


Looks like there are many missing values. Let's clean this up and find our top 10 major project themes.

In [10]:
# Sort norm: normalized
normalized = norm.sort_values(['name', 'code'])
normalized

Unnamed: 0,code,name
212,1,
363,1,
1024,1,
1114,1,
1437,1,
121,10,
165,10,
217,10,
275,10,
391,10,


In [11]:
# Create a df of correct pairs of terms: theme_df
theme_pairs = normalized.drop_duplicates().iloc[11:, :]
theme_pairs

Unnamed: 0,code,name
2,1,Economic management
6,11,Environment and natural resources management
11,4,Financial and private sector development
0,8,Human development
5,2,Public sector governance
252,3,Rule of law
18,10,Rural development
8,7,Social dev/gender/inclusion
3,6,Social protection and risk management
4,5,Trade and integration


In [12]:
# Join the theme_df and original norm to match on the code column, 'filling' the old column
new_normalized = pd.merge(norm, theme_pairs, left_on='code', right_on='code', how='left', suffixes = ['_old_name', '_new_name'])
new_normalized

Unnamed: 0,code,name_old_name,name_new_name
0,8,Human development,Human development
1,11,,Environment and natural resources management
2,1,Economic management,Economic management
3,6,Social protection and risk management,Social protection and risk management
4,5,Trade and integration,Trade and integration
5,2,Public sector governance,Public sector governance
6,11,Environment and natural resources management,Environment and natural resources management
7,6,Social protection and risk management,Social protection and risk management
8,7,Social dev/gender/inclusion,Social dev/gender/inclusion
9,7,Social dev/gender/inclusion,Social dev/gender/inclusion


In [13]:
# With missing values filled, clean the new_normalized by removing the old inner column
new_normalized.pop('name_old_name')
new_normalized

Unnamed: 0,code,name_new_name
0,8,Human development
1,11,Environment and natural resources management
2,1,Economic management
3,6,Social protection and risk management
4,5,Trade and integration
5,2,Public sector governance
6,11,Environment and natural resources management
7,6,Social protection and risk management
8,7,Social dev/gender/inclusion
9,7,Social dev/gender/inclusion


Addressing Q2, let's see what the top 10 projects were.

In [14]:
# Create list of the counts of each project theme name, similar to Q1
top_10_projects = pd.DataFrame(new_normalized['name_new_name'].value_counts().head(10))
top_10_projects

Unnamed: 0,name_new_name
Environment and natural resources management,250
Rural development,216
Human development,210
Public sector governance,199
Social protection and risk management,168
Financial and private sector development,146
Social dev/gender/inclusion,130
Trade and integration,77
Urban development,50
Economic management,38
