# JSON Data Wrangling

#### Problem Statement
Given data from the World Bank in JSON file format, 
1. Find the 10 countries with most projects
2. Find the top 10 major project themes (using column 'mjtheme_namecode')
3. In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

#### Dataset

+ A JSON file containing data about projects funded by the World Bank: 'data/world_bank_projects.json'

*****
## Problem 1 - Find the 10 countries with the most projects

In [1]:
# Import packages
import json
import pandas as pd
from collections import Counter

# load the JSON file as a Pandas dataframe
json_df = pd.read_json('data/world_bank_projects.json')

# Display the count of values in the countryname column, for the top 10
json_df.countryname.value_counts().head(10)

People's Republic of China         19
Republic of Indonesia              19
Socialist Republic of Vietnam      17
Republic of India                  16
Republic of Yemen                  13
Nepal                              12
People's Republic of Bangladesh    12
Kingdom of Morocco                 12
Republic of Mozambique             11
Africa                             11
Name: countryname, dtype: int64

We see from these results that Indonesia and China tie for the most World Bank projects.

*****
## Problem 2 - Find the Top 10 Major Project Themes

In [2]:
# Show the column where the Major Project Themes are stored
json_df.mjtheme_namecode.head()

0    [{'code': '8', 'name': 'Human development'}, {...
1    [{'code': '1', 'name': 'Economic management'},...
2    [{'code': '5', 'name': 'Trade and integration'...
3    [{'code': '7', 'name': 'Social dev/gender/incl...
4    [{'code': '5', 'name': 'Trade and integration'...
Name: mjtheme_namecode, dtype: object

We see that the Major Project Themes are in a list of dictionaries in the column mjtheme_namecode.

To count the Major Project Themes, we must loop through each dictionary in each row.

In [3]:
# Create a counter for projects by each major theme
projects_by_mjtheme = Counter()

# Loop through each instance of mjtheme_namecode
for major_theme_nc in json_df.mjtheme_namecode:
    major_proj_theme = pd.json_normalize(major_theme_nc, max_level=1)
    
    #Loop through each dictionary in this mjtheme_namecode
    for n in major_proj_theme['name']:
        projects_by_mjtheme[n] += 1

# Print the counter 
projects_by_mjtheme


Counter({'Human development': 197,
         '': 122,
         'Economic management': 33,
         'Social protection and risk management': 158,
         'Trade and integration': 72,
         'Public sector governance': 184,
         'Environment and natural resources management': 223,
         'Social dev/gender/inclusion': 119,
         'Financial and private sector development': 130,
         'Rural development': 202,
         'Urban development': 47,
         'Rule of law': 12})

In [4]:
# Convert the Counter dictionary to a dataframe df_major_theme_count for sorting and display.
df_major_theme_count = pd.DataFrame.from_dict(projects_by_mjtheme, orient='index', columns=['Count'])
df_major_theme_count.index.name = 'Major Project Theme'

# Display the count of the 10 major project themes
df_major_theme_count.sort_values('Count', ascending=False).head(10)

Unnamed: 0_level_0,Count
Major Project Theme,Unnamed: 1_level_1
Environment and natural resources management,223
Rural development,202
Human development,197
Public sector governance,184
Social protection and risk management,158
Financial and private sector development,130
,122
Social dev/gender/inclusion,119
Trade and integration,72
Urban development,47


The most common major theme for World Bank projects is "Environment and natural resources management" with 223 projects.

However, we see that we have 122 missing values. 

*****
## Problem 3 - Handling Missing Values
As shown in Problem 2, the name of the major project theme is missing in 122 instances. We can fill in the missing values based on other data using these steps:
1. Create a dictionary of major project theme codes and names.
2. Populate the missing values from this new dictionary.
3. Rerun the analysis to find the top 10 major project themes in the corrected dataset.


In [5]:
# Create a dictionary of major project theme codes and their associated names
mjtheme_dict = {}

# Loop through each instance of mjtheme_namecode
for mjtheme_nc in json_df.mjtheme_namecode:
    
    #Loop through each dictionary in this mjtheme_namecode
    for d in mjtheme_nc:
        
        # If the name is not blank, add the code and name pair to the dictionary
        if d['name'] != '':
            mjtheme_dict[d['code']] = d['name']

# Sort and display the dictionary
print(sorted(mjtheme_dict.items()))

[('1', 'Economic management'), ('10', 'Rural development'), ('11', 'Environment and natural resources management'), ('2', 'Public sector governance'), ('3', 'Rule of law'), ('4', 'Financial and private sector development'), ('5', 'Trade and integration'), ('6', 'Social protection and risk management'), ('7', 'Social dev/gender/inclusion'), ('8', 'Human development'), ('9', 'Urban development')]


We see that major projects are classified under 11 distinct themes. 

Now, create a copy of the dataframe with the missing values filled in from the dictionary. 

In [6]:
# Create a copy of the JSON dataframe
df_corrected = json_df

# Set counter, index
replacements_made = 0
i = 0

# Loop through each instance of mjtheme_namecode
for mjtheme_nc in df_corrected.mjtheme_namecode:
    j = 0
    
    # Loop through each dictionary in this mjtheme_namecode 
    for d in mjtheme_nc:
        
        # If the name *is* blank, replace it with the dictionary value
        if d['name'] == '':
            df_corrected.mjtheme_namecode[i][j]['name'] = mjtheme_dict[d['code']]
            replacements_made += 1
        j += 1
    
    i += 1

# Display the counter to check how many blank names were replaced     
print(replacements_made)

122


As expected, 122 null values were replaced.

Now, rerun our count on the corrected process to show the Top 10 Major Project Themes.

In [7]:
# Reset the counter for projects by each major project theme
theme_count = Counter()

# Loop through each instance of mjtheme_namecode in the corrected dataframe
for major_theme_nc in df_corrected.mjtheme_namecode:
    major_proj_theme = pd.json_normalize(major_theme_nc, max_level=1)
    
    #Loop through each dictionary in this mjtheme_namecode
    for n in major_proj_theme['name']:
        theme_count[n] += 1

# Convert the Counter dictionary to a dataframe df_major_theme_count for sorting and display.
df_major_theme_count = pd.DataFrame.from_dict(theme_count, orient='index', columns=['Count'])
df_major_theme_count.index.name = 'Major Project Theme'

# Display the count of the 10 major project themes
df_major_theme_count.sort_values('Count', ascending=False).head(10)

Unnamed: 0_level_0,Count
Major Project Theme,Unnamed: 1_level_1
Environment and natural resources management,250
Rural development,216
Human development,210
Public sector governance,199
Social protection and risk management,168
Financial and private sector development,146
Social dev/gender/inclusion,130
Trade and integration,77
Urban development,50
Economic management,38
