#  Springboard Datascience - JSON Assignment 1



Using data in file 'data/world_bank_projects.json' and the techniques demonstrated in the example notebook,
1. Find the 10 countries with most projects
2. Find the top 10 major project themes (using column 'mjtheme_namecode')
3. In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

## 1. Top 10 Countries for Projects
### Task: Find the 10 countries with most projects

In [1]:
import pandas as pd
import data_wrangling_json

First we generate a **`pandas.Dataframe`** by reading in the **"world_bank_projects.json"** file using the **`pandas.read_json`** function.

In [2]:
data_wrangling_json.df_world_bank_projects.head(3)

Unnamed: 0,_id,approvalfy,board_approval_month,boardapprovaldate,borrower,closingdate,country_namecode,countrycode,countryname,countryshortname,...,sectorcode,source,status,supplementprojectflg,theme1,theme_namecode,themecode,totalamt,totalcommamt,url
0,{'$oid': '52b213b38594d8a2be17c780'},1999,November,2013-11-12T00:00:00Z,FEDERAL DEMOCRATIC REPUBLIC OF ETHIOPIA,2018-07-07T00:00:00Z,Federal Democratic Republic of Ethiopia!$!ET,ET,Federal Democratic Republic of Ethiopia,Ethiopia,...,"ET,BS,ES,EP",IBRD,Active,N,"{'Name': 'Education for all', 'Percent': 100}","[{'code': '65', 'name': 'Education for all'}]",65,130000000,130000000,http://www.worldbank.org/projects/P129828/ethi...
1,{'$oid': '52b213b38594d8a2be17c781'},2015,November,2013-11-04T00:00:00Z,GOVERNMENT OF TUNISIA,,Republic of Tunisia!$!TN,TN,Republic of Tunisia,Tunisia,...,"BZ,BS",IBRD,Active,N,"{'Name': 'Other economic management', 'Percent...","[{'code': '24', 'name': 'Other economic manage...",5424,0,4700000,http://www.worldbank.org/projects/P144674?lang=en
2,{'$oid': '52b213b38594d8a2be17c782'},2014,November,2013-11-01T00:00:00Z,MINISTRY OF FINANCE AND ECONOMIC DEVEL,,Tuvalu!$!TV,TV,Tuvalu,Tuvalu,...,TI,IBRD,Active,Y,"{'Name': 'Regional integration', 'Percent': 46}","[{'code': '47', 'name': 'Regional integration'...",52812547,6060000,6060000,http://www.worldbank.org/projects/P145310?lang=en


Since the created dataframe has a unique row for each project, we can look at the **`countryshortname`** column to get the name of the country associated with each project. Pandas has a function **`value_counts`** that will count up the number of times a value appears in a vector. 

We apply this function to the **`countryshortname`** column to tally up how many projects each country has. 

Next we manipulate the Series to only show the top 10 using **`head(10)`** and reset the index to push the country names into their own column. Next we clean the presentation up by renaming the columns and index values to correspond the the ranking of each country.

In [3]:
# The top 10 countries for world bank projects
data_wrangling_json.top_10_project_countries

Unnamed: 0,country,projectcount
1,China,19
2,Indonesia,19
3,Vietnam,17
4,India,16
5,"Yemen, Republic of",13
6,Morocco,12
7,Nepal,12
8,Bangladesh,12
9,Mozambique,11
10,Africa,11


## 2. Top 10 Major Project Themes
### Task: Find the top 10 major project themes (using column 'mjtheme_namecode')

Each project can have multiple themes. The data is stored in the JSON file as a dictionary under each project's 'mjtheme_namecode' attribute. 

We can pull these dictionaries out into a DataFrame using **`json_normalize`**.

In [4]:
data_wrangling_json.df_world_bank_project_themes.head(10)

Unnamed: 0,code,name
0,8,Human development
1,11,
2,1,Economic management
3,6,Social protection and risk management
4,5,Trade and integration
5,2,Public sector governance
6,11,Environment and natural resources management
7,6,Social protection and risk management
8,7,Social dev/gender/inclusion
9,7,Social dev/gender/inclusion


Since each row contains a code representing a theme of the project we can use **`value_counts`** again to tally up the number of times a code (or theme) is used for a project. We also apply the same Series -> DataFrame manipulations done in the previous question to make our results more presentable.

In [5]:
data_wrangling_json.top_10_project_themes

Unnamed: 0,code,count
1,11,250
2,10,216
3,8,210
4,2,199
5,6,168
6,4,146
7,7,130
8,5,77
9,9,50
10,1,38


However, it would be better to list the project theme titles as opposed to the codes. Codes are not very descriptive. 

There were 250 projects with code 11. Great! What does that mean? 

We will come up with a way to map those codes to their respective names in the next question.

## 3. Find All Names for Project Themes
### Task: Create a dataframe with the missing names filled in.
In 2. above you will notice that some entries have only the code and the name is missing. 

To fill in the missing names we will create a dictionary that has the theme codes for keys and theme names for values.

* Subset the existing code/codename DataFrame, removing any row that has a blank name field
* Set the new DataFrame's index to the `code` column 
* Convert new DataFrame to a dictionary

In creating the dictionary, pandas consolidated the results based on the index value (which we set to the project code) and took the first corresponding `name` value for that index. Removing the blank values ensures that the resulting dictionary won't have a blank value for one of the theme code keys.

In [6]:
data_wrangling_json.namecodes

{'name': {'1': 'Economic management',
  '10': 'Rural development',
  '11': 'Environment and natural resources management',
  '2': 'Public sector governance',
  '3': 'Rule of law',
  '4': 'Financial and private sector development',
  '5': 'Trade and integration',
  '6': 'Social protection and risk management',
  '7': 'Social dev/gender/inclusion',
  '8': 'Human development',
  '9': 'Urban development'}}

Next, create a copy of the project theme dataframe and map the `namecodes` dictionary to the `code` column, assigning the result to the `name` column. This will fill in any blanks in the dataframe with the correct theme name.

In [7]:
data_wrangling_json.df_complete_wbp_themes.head(10)

Unnamed: 0,code,name
0,8,Human development
1,11,Environment and natural resources management
2,1,Economic management
3,6,Social protection and risk management
4,5,Trade and integration
5,2,Public sector governance
6,11,Environment and natural resources management
7,6,Social protection and risk management
8,7,Social dev/gender/inclusion
9,7,Social dev/gender/inclusion


Using this same dictionary we can put names to the codes in our answer to question 2.

In [8]:
data_wrangling_json.top_10_project_themes_with_names

Unnamed: 0,theme_name,count
1,Environment and natural resources management,250
2,Rural development,216
3,Human development,210
4,Public sector governance,199
5,Social protection and risk management,168
6,Financial and private sector development,146
7,Social dev/gender/inclusion,130
8,Trade and integration,77
9,Urban development,50
10,Economic management,38
