# Cleaning Data

Cleaning data is a big topic. Every data set might have its own issues whether that involves missing values, duplicated entries, data entry mistakes, etc. In this exercise, you'll do some data cleaning on the World Bank projects and World Bank indicators data sets. 

Currently, the projects data and the indicators data have different values for country names. Your task in this exercise is to clean both data sets so that they have consistent country names. This will allow you to join the two data sets together. Cleaning data, unfortunately, can be tedious and take a lot of your time as a data scientist.

Why might you want to join these data sets together? What if, for example, you wanted to run linear regression to try to predict project costs based on indicator data? Or you might want to analyze the types of projects that get approved versus the indicator data. For example, do countries with low rates of rural electrification have more rural themed projects?

# Part 1 - Explore the Data

Run the code cells below to import the data sets. The code cell will also output the unique country names in each data set.

In [4]:
pd.set_option("display.max.columns", None)

In [1]:
import pandas as pd

# read in the population data and drop the final column
df_indicator = pd.read_csv('../data/population_data.csv', skiprows=4)
df_indicator.drop(['Unnamed: 62'], axis=1, inplace=True)

# read in the projects data set with all columns type string
df_projects = pd.read_csv('../data/projects_data.csv', dtype=str)
df_projects.drop(['Unnamed: 56'], axis=1, inplace=True)

The next code cell outputs the unique country names and ISO abbreviations in the population indicator data set. You'll notice a few values that represent world regions such as 'East Asia & Pacific' and 'East Asia & Pacific (excluding high income)'.

In [6]:
df_indicator[['Country Name', 'Country Code']].drop_duplicates()

Unnamed: 0,Country Name,Country Code
0,Aruba,ABW
1,Afghanistan,AFG
2,Angola,AGO
3,Albania,ALB
4,Andorra,AND
...,...,...
259,Kosovo,XKX
260,"Yemen, Rep.",YEM
261,South Africa,ZAF
262,Zambia,ZMB


In [9]:
set(df_indicator['Country Name'])

{'Afghanistan',
 'Albania',
 'Algeria',
 'American Samoa',
 'Andorra',
 'Angola',
 'Antigua and Barbuda',
 'Arab World',
 'Argentina',
 'Armenia',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas, The',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bermuda',
 'Bhutan',
 'Bolivia',
 'Bosnia and Herzegovina',
 'Botswana',
 'Brazil',
 'British Virgin Islands',
 'Brunei Darussalam',
 'Bulgaria',
 'Burkina Faso',
 'Burundi',
 'Cabo Verde',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Caribbean small states',
 'Cayman Islands',
 'Central African Republic',
 'Central Europe and the Baltics',
 'Chad',
 'Channel Islands',
 'Chile',
 'China',
 'Colombia',
 'Comoros',
 'Congo, Dem. Rep.',
 'Congo, Rep.',
 'Costa Rica',
 "Cote d'Ivoire",
 'Croatia',
 'Cuba',
 'Curacao',
 'Cyprus',
 'Czech Republic',
 'Denmark',
 'Djibouti',
 'Dominica',
 'Dominican Republic',
 'Early-demographic dividend',
 'East Asia & Pacific',
 'East Asia & Pacific (IDA & IBRD coun

Run the next code cell to see the unique country names in the project data set. Notice that the projects data has two columns for country name. One is called 'countryname' and the other is called 'Country'. The 'Country' column only has NaN values.

Another thing of note: It would've been easier to join the two data sets together if the projects data had the [ISO country abbreviations](https://en.wikipedia.org/wiki/ISO_3166-1) like the indicator data has. Unfortunately, the projects data does not have the ISO country abbreviations. To join these two data sets together, you essentially have two choices:
* add a column of ISO 3 codes to the projects data set
* find the difference between the projects data country names and indicator data country names. Then clean the data so that they are the same.

Run the code cell below to see what the project countries look like:

In [14]:
df_projects[['countryname', 'Country']].drop_duplicates()

Unnamed: 0,countryname,Country
0,World;World,
1,Democratic Republic of the Congo;Democratic Re...,
2,People's Republic of Bangladesh;People's Repub...,
3,Islamic Republic of Afghanistan;Islamic Repu...,
4,Federal Republic of Nigeria;Federal Republic o...,NG;NG;NG;NG;NG;NG
...,...,...
17917,Commonwealth of Australia;Commonwealth of Aust...,
18069,Kingdom of Belgium;Kingdom of Belgium,
18080,Kingdom of the Netherlands;Kingdom of the Neth...,
18244,Grand Duchy of Luxembourg;Grand Duchy of Luxem...,


In [11]:
set(df_projects['countryname'])

{'Africa;Africa',
 'American Samoa;American Samoa',
 'Andean Countries;Andean Countries',
 'Antigua and Barbuda;Antigua and Barbuda',
 'Arab Republic of Egypt;Arab Republic of Egypt',
 'Aral Sea;Aral Sea',
 'Argentine Republic;Argentine Republic',
 'Asia;Asia',
 'Barbados;Barbados',
 'Belize;Belize',
 'Bosnia and Herzegovina;Bosnia and Herzegovina',
 'Burkina Faso;Burkina Faso',
 'Caribbean;Caribbean',
 'Caucasus;Caucasus',
 'Central Africa;Central Africa',
 'Central African Republic;Central African Republic',
 'Central America;Central America',
 'Central Asia;Central Asia',
 'Co-operative Republic of Guyana;Co-operative Republic of Guyana',
 'Commonwealth of Australia;Commonwealth of Australia',
 'Commonwealth of Dominica;Commonwealth of Dominica',
 'Commonwealth of The Bahamas;Commonwealth of The Bahamas',
 'Czech Republic;Czech Republic',
 'Democratic Republic of Sao Tome and Prin;Democratic Republic of Sao Tome and Prin',
 'Democratic Republic of Timor-Leste;Democratic Republic of 

In [10]:
df_projects['countryname'].unique()

array(['World;World',
       'Democratic Republic of the Congo;Democratic Republic of the Congo',
       "People's Republic of Bangladesh;People's Republic of Bangladesh",
       'Islamic  Republic of Afghanistan;Islamic  Republic of Afghanistan',
       'Federal Republic of Nigeria;Federal Republic of Nigeria',
       'Republic of Tunisia;Republic of Tunisia',
       'Lebanese Republic;Lebanese Republic',
       'Democratic Socialist Republic of Sri Lan;Democratic Socialist Republic of Sri Lan',
       'Nepal;Nepal', 'Kyrgyz Republic;Kyrgyz Republic',
       'Hashemite Kingdom of Jordan;Hashemite Kingdom of Jordan',
       'Republic of the Union of Myanmar;Republic of the Union of Myanmar',
       'Arab Republic of Egypt;Arab Republic of Egypt',
       'United Republic of Tanzania;United Republic of Tanzania',
       'Federal Democratic Republic of Ethiopia;Federal Democratic Republic of Ethiopia',
       'Burkina Faso;Burkina Faso',
       'Republic of Uzbekistan;Republic of Uzbekist

# Part 2 - Use the Pycountry library

Did you notice a pattern in the projects data country names? The entries are repeated and separated by a semi-colon like this:
```text
'Kingdom of Spain;Kingdom of Spain'
'New Zealand;New Zealand'
```

The first step is to clean the country name column and get rid of the semi-colon. Do that below:

In [28]:
### 
#
# TODO: In the df_projects dataframe, create a new column called 'Official Country Name' so that the country name only appears once. 
# For example, `Republic of Malta;Republic of Malta` should be `Republic of Malta`.
#
# HINT: use the split() method - see https://pandas.pydata.org/pandas-docs/stable/text.html for examples
# HINT: with pandas, you can do all of this with just one line of code
###

df_projects['Official Country Name'] = df_projects['countryname'].str.split(pat=';').str[0]

In [29]:
df_projects['Official Country Name']

0                                   World
1        Democratic Republic of the Congo
2         People's Republic of Bangladesh
3        Islamic  Republic of Afghanistan
4             Federal Republic of Nigeria
                       ...               
18243                   Republic of Chile
18244           Grand Duchy of Luxembourg
18245                  Kingdom of Denmark
18246          Kingdom of the Netherlands
18247                     French Republic
Name: Official Country Name, Length: 18248, dtype: object

It looks like the projects data set has official country names. Hence, this data set has an entry like "Kingdom of Spain" whereas the indicators data has just "Spain".

Luckily, someone has developed a Python library called [pycountry](https://pypi.org/project/pycountry/). This library has country names, ISO abbreviations, and official country names. While you might not be able to clean all of the data with the help of this Python library, it will probably help. 

Run the code cells below to install the pycountry library and see how it works.

In [30]:
# Run this code cell to install and import the pycountry library
!pip install pycountry
from pycountry import countries

Collecting pycountry
  Downloading https://files.pythonhosted.org/packages/16/b6/154fe93072051d8ce7bf197690957b6d0ac9a21d51c9a1d05bd7c6fdb16f/pycountry-19.8.18.tar.gz (10.0MB)
Building wheels for collected packages: pycountry
  Building wheel for pycountry (setup.py): started
  Building wheel for pycountry (setup.py): finished with status 'done'
  Created wheel for pycountry: filename=pycountry-19.8.18-py2.py3-none-any.whl size=10627366 sha256=905770478e054337b43828733e4173a94ef00803d7bc730d8bf76f70c86a9509
  Stored in directory: C:\Users\DJ\AppData\Local\pip\Cache\wheels\a2\98\bf\f0fa1c6bf8cf2cbdb750d583f84be51c2cd8272460b8b36bd3
Successfully built pycountry
Installing collected packages: pycountry
Successfully installed pycountry-19.8.18


In [31]:
# Run this code cell to see an example of how the library works
countries.get(name='Spain')

Country(alpha_2='ES', alpha_3='ESP', name='Spain', numeric='724', official_name='Kingdom of Spain')

In [32]:
# Run this code cell to see how you can also look up countries without specifying the key
countries.lookup('Kingdom of Spain')

Country(alpha_2='ES', alpha_3='ESP', name='Spain', numeric='724', official_name='Kingdom of Spain')

Using the pycountry library, your task is to clean the projects data so that the country names match the indicators data country names. Iterate through the unique countries in df_projects['Country Name']. 

1. **Create a dictionary mapping the 'Country Name' to the alpha_3 ISO abbreviations.**

   The dictionary should look like:
   `{'Kingdom of Spain':'ESP'}`

2. If a country name cannot be found in the pycountry library, add it to a list called `country_not_found`.

In [35]:
df_projects['Official Country Name'].head(3)

0                               World
1    Democratic Republic of the Congo
2     People's Republic of Bangladesh
Name: Official Country Name, dtype: object

In [33]:
df_indicator.head(3)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,Aruba,ABW,"Population, total",SP.POP.TOTL,54211.0,55438.0,56225.0,56695.0,57032.0,57360.0,57715.0,58055.0,58386.0,58726.0,59063.0,59440.0,59840.0,60243.0,60528.0,60657.0,60586.0,60366.0,60103.0,59980.0,60096.0,60567.0,61345.0,62201.0,62836.0,63026.0,62644.0,61833.0,61079.0,61032.0,62149.0,64622.0,68235.0,72504.0,76700.0,80324.0,83200.0,85451.0,87277.0,89005.0,90853.0,92898.0,94992.0,97017.0,98737.0,100031.0,100832.0,101220.0,101353.0,101453.0,101669.0,102053.0,102577.0,103187.0,103795.0,104341.0,104822.0,105264.0
1,Afghanistan,AFG,"Population, total",SP.POP.TOTL,8996351.0,9166764.0,9345868.0,9533954.0,9731361.0,9938414.0,10152331.0,10372630.0,10604346.0,10854428.0,11126123.0,11417825.0,11721940.0,12027822.0,12321541.0,12590286.0,12840299.0,13067538.0,13237734.0,13306695.0,13248370.0,13053954.0,12749645.0,12389269.0,12047115.0,11783050.0,11601041.0,11502761.0,11540888.0,11777609.0,12249114.0,12993657.0,13981231.0,15095099.0,16172719.0,17099541.0,17822884.0,18381605.0,18863999.0,19403676.0,20093756.0,20966463.0,21979923.0,23064851.0,24118979.0,25070798.0,25893450.0,26616792.0,27294031.0,28004331.0,28803167.0,29708599.0,30696958.0,31731688.0,32758020.0,33736494.0,34656032.0,35530081.0
2,Angola,AGO,"Population, total",SP.POP.TOTL,5643182.0,5753024.0,5866061.0,5980417.0,6093321.0,6203299.0,6309770.0,6414995.0,6523791.0,6642632.0,6776381.0,6927269.0,7094834.0,7277960.0,7474338.0,7682479.0,7900997.0,8130988.0,8376147.0,8641521.0,8929900.0,9244507.0,9582156.0,9931562.0,10277321.0,10609042.0,10921037.0,11218268.0,11513968.0,11827237.0,12171441.0,12553446.0,12968345.0,13403734.0,13841301.0,14268994.0,14682284.0,15088981.0,15504318.0,15949766.0,16440924.0,16983266.0,17572649.0,18203369.0,18865716.0,19552542.0,20262399.0,20997687.0,21759420.0,22549547.0,23369131.0,24218565.0,25096150.0,25998340.0,26920466.0,27859305.0,28813463.0,29784193.0


In [68]:
country_not_found = [] # stores countries not found in the pycountry library

##? set up the libraries and variables
from collections import defaultdict
project_country_abbrev_dict = defaultdict(str) # set up an empty dictionary of string values

# TODO: iterate through the country names in df_projects. 
# Create a dictionary mapping the country name to the alpha_3 ISO code

for country in df_projects['Official Country Name'].drop_duplicates().sort_values():
    try: 
        # TODO: look up the country name in the pycountry library
        # store the country name as the dictionary key and the ISO-3 code as the value
        project_country_abbrev_dict[country] = countries.lookup(country).alpha_3
        ### project_country_abbrev_dict.country does not work

    except:
        # If the country name is not in the pycountry library, then print out the country name
        # And store the results in the country_not_found list
        print(country, ' not found')
        country_not_found.append(country)

Africa  not found
Andean Countries  not found
Aral Sea  not found
Asia  not found
Caribbean  not found
Caucasus  not found
Central Africa  not found
Central America  not found
Central Asia  not found
Co-operative Republic of Guyana  not found
Commonwealth of Australia  not found
Democratic Republic of Sao Tome and Prin  not found
Democratic Republic of the Congo  not found
Democratic Socialist Republic of Sri Lan  not found
EU Accession Countries  not found
East Asia and Pacific  not found
Eastern Africa  not found
Europe and Central Asia  not found
Islamic  Republic of Afghanistan  not found
Kingdom of Swaziland  not found
Latin America  not found
Macedonia  not found
Mekong  not found
Mercosur  not found
Middle East and North Africa  not found
Multi-Regional  not found
Organization of Eastern Caribbean States  not found
Oriental Republic of Uruguay  not found
Pacific Islands  not found
Red Sea and Gulf of Aden  not found
Republic of Congo  not found
Republic of Cote d'Ivoire  not fou

In [69]:
len(country_not_found)

55

In [70]:
project_country_abbrev_dict

defaultdict(str,
            {'American Samoa': 'ASM',
             'Antigua and Barbuda': 'ATG',
             'Arab Republic of Egypt': 'EGY',
             'Argentine Republic': 'ARG',
             'Barbados': 'BRB',
             'Belize': 'BLZ',
             'Bosnia and Herzegovina': 'BIH',
             'Burkina Faso': 'BFA',
             'Central African Republic': 'CAF',
             'Commonwealth of Dominica': 'DMA',
             'Commonwealth of The Bahamas': 'BHS',
             'Czech Republic': 'CZE',
             'Democratic Republic of Timor-Leste': 'TLS',
             'Dominican Republic': 'DOM',
             'Federal Democratic Republic of Ethiopia': 'ETH',
             'Federal Republic of Nigeria': 'NGA',
             'Federated States of Micronesia': 'FSM',
             'Federative Republic of Brazil': 'BRA',
             'French Republic': 'FRA',
             'Gabonese Republic': 'GAB',
             'Georgia': 'GEO',
             'Grand Duchy of Luxembourg': 'LUX',
    

In [71]:
len(project_country_abbrev_dict)

151

Quite a few country names were not in the pycountry library. Some of these are regions like "South Asia" or "Southern Africa", so it makes sense that these would not show up in the pycountry library.

# Part 3 - Making a Manual Mapping

Perhaps some of these missing df_projects countries are already in the indicators data set. In the next cell, check if any of the countries in the country_not_found list are in the indicator list of countries.

In [73]:
# Run this code cell to iterate through the country_not_found list 
# and check if the country name is in the df_indicator data set
indicator_countries = df_indicator[['Country Name', 'Country Code']].drop_duplicates().sort_values(by='Country Name')

indicator_countries

Unnamed: 0,Country Name,Country Code
1,Afghanistan,AFG
3,Albania,ALB
58,Algeria,DZA
9,American Samoa,ASM
4,Andorra,AND
...,...,...
194,West Bank and Gaza,PSE
257,World,WLD
260,"Yemen, Rep.",YEM
262,Zambia,ZMB


In [74]:
for country in country_not_found:
    if country in indicator_countries['Country Name'].tolist():
        print(country)

South Asia
St. Kitts and Nevis
St. Lucia
St. Vincent and the Grenadines
West Bank and Gaza
World


Unfortunately, there aren't too many country names that match between df_indicator and df_projects. This is where data cleaning becomes especially tedious, but in this case, we've done a lot of the work for you.

We've manually created a dictionary that maps all of the countries in country_not_found to the ISO-3 alpha codes. You **could** try to do this programatically using some sophisticated string matching rules. That might be worth your time for a larger data set. But in this case, it's probably faster to type out the dictionary.

In [75]:
# run this code cell to load the dictionary

country_not_found_mapping = {'Co-operative Republic of Guyana': 'GUY',
             'Commonwealth of Australia':'AUS',
             'Democratic Republic of Sao Tome and Prin':'STP',
             'Democratic Republic of the Congo':'COD',
             'Democratic Socialist Republic of Sri Lan':'LKA',
             'East Asia and Pacific':'EAS',
             'Europe and Central Asia': 'ECS',
             'Islamic  Republic of Afghanistan':'AFG',
             'Latin America':'LCN',
              'Caribbean':'LCN',
             'Macedonia':'MKD',
             'Middle East and North Africa':'MEA',
             'Oriental Republic of Uruguay':'URY',
             'Republic of Congo':'COG',
             "Republic of Cote d'Ivoire":'CIV',
             'Republic of Korea':'KOR',
             'Republic of Niger':'NER',
             'Republic of Kosovo':'XKX',
             'Republic of Rwanda':'RWA',
              'Republic of The Gambia':'GMB',
              'Republic of Togo':'TGO',
              'Republic of the Union of Myanmar':'MMR',
              'Republica Bolivariana de Venezuela':'VEN',
              'Sint Maarten':'SXM',
              "Socialist People's Libyan Arab Jamahiriy":'LBY',
              'Socialist Republic of Vietnam':'VNM',
              'Somali Democratic Republic':'SOM',
              'South Asia':'SAS',
              'St. Kitts and Nevis':'KNA',
              'St. Lucia':'LCA',
              'St. Vincent and the Grenadines':'VCT',
              'State of Eritrea':'ERI',
              'The Independent State of Papua New Guine':'PNG',
              'West Bank and Gaza':'PSE',
              'World':'WLD'}

In [76]:
len(country_not_found_mapping)

35

Next, update the project_country_abbrev_dict variable with these new values.

In [77]:
# TODO: Update the project_country_abbrev_dict with the country_not_found_mapping dictionary
# HINT: This is relatively straightforward. Python dictionaries have a method called update(), which essentially
# appends a dictionary to another dictionary

project_country_abbrev_dict.update(country_not_found_mapping)

In [78]:
len(project_country_abbrev_dict)

185

In [79]:
project_country_abbrev_dict

defaultdict(str,
            {'American Samoa': 'ASM',
             'Antigua and Barbuda': 'ATG',
             'Arab Republic of Egypt': 'EGY',
             'Argentine Republic': 'ARG',
             'Barbados': 'BRB',
             'Belize': 'BLZ',
             'Bosnia and Herzegovina': 'BIH',
             'Burkina Faso': 'BFA',
             'Central African Republic': 'CAF',
             'Commonwealth of Dominica': 'DMA',
             'Commonwealth of The Bahamas': 'BHS',
             'Czech Republic': 'CZE',
             'Democratic Republic of Timor-Leste': 'TLS',
             'Dominican Republic': 'DOM',
             'Federal Democratic Republic of Ethiopia': 'ETH',
             'Federal Republic of Nigeria': 'NGA',
             'Federated States of Micronesia': 'FSM',
             'Federative Republic of Brazil': 'BRA',
             'French Republic': 'FRA',
             'Gabonese Republic': 'GAB',
             'Georgia': 'GEO',
             'Grand Duchy of Luxembourg': 'LUX',
    

In [80]:
project_country_abbrev_dict['Georgia']

'GEO'

# Part 5 - Make a 'Country Code' Column

Next, create a 'Country Code' column in the data_projects data frame. Use the project_country_abbrev_dict and df_projects['Country Name'] column to create a new columns called 'Country 'Code'.

In [81]:
# TODO: Use the project_country_abbrev_dict and the df_projects['Country Name'] column to make a new column
# of the alpha-3 country codes. This new column should be called 'Country Code'.

# HINT: Use the apply method and a lambda function
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html

###
df_projects['Country Code'] = df_projects['Official Country Name'].apply(lambda x: project_country_abbrev_dict[x])

In [87]:
# Run this code cell to see which projects in the df_projects data frame still have no country code abbreviation.
# In other words, these projects do not have a matching population value in the df_indicator data frame.
set(df_projects[df_projects['Country Code'] == '']['Official Country Name'].tolist())

{'Africa',
 'Andean Countries',
 'Aral Sea',
 'Asia',
 'Caucasus',
 'Central Africa',
 'Central America',
 'Central Asia',
 'EU Accession Countries',
 'Eastern Africa',
 'Kingdom of Swaziland',
 'Mekong',
 'Mercosur',
 'Multi-Regional',
 'Organization of Eastern Caribbean States',
 'Pacific Islands',
 'Red Sea and Gulf of Aden',
 'Socialist Federal Republic of Yugoslavia',
 'Southern Africa',
 'Western Africa',
 'Western Balkans'}

In [89]:
df_projects[df_projects['Country Code'] == '']

Unnamed: 0,id,regionname,countryname,prodline,lendinginstr,lendinginstrtype,envassesmentcategorycode,supplementprojectflg,productlinetype,projectstatusdisplay,status,project_name,boardapprovaldate,board_approval_month,closingdate,lendprojectcost,ibrdcommamt,idacommamt,totalamt,grantamt,borrower,impagency,url,projectdoc,majorsector_percent,sector1,sector2,sector3,sector4,sector5,sector,mjsector1,mjsector2,mjsector3,mjsector4,mjsector5,mjsector,theme1,theme2,theme3,theme4,theme5,theme,goal,financier,mjtheme1name,mjtheme2name,mjtheme3name,mjtheme4name,mjtheme5name,location,GeoLocID,GeoLocName,Latitude,Longitude,Country,Official Country Name,Country Code
31,P166648,Africa,Central Africa;Central Africa,RE,Investment Project Financing,IN,B,N,L,Active,Active,Strengthening DRM Capacity in ECCAS,2018-06-22T00:00:00Z,June,,1270000,0,0,0,1270000,,,http://projects.worldbank.org/P166648?lang=en,,,!$!0,,,,,,,,,,,,!$!0,,,,,,,,,,,,,,,,,,,Central Africa,
39,P163752,Africa,Africa;Africa,PE,Investment Project Financing,IN,A,N,L,Active,Active,AFCC2/RI-3A Tanzania-Zambia Transmission Inter...,2018-06-18T00:00:00Z,June,2024-06-28T00:00:00Z,605000000,0,465000000,465000000,0,,,http://projects.worldbank.org/P163752?lang=en,,,!$!0,,,,,,,,,,,,!$!0,,,,,,,,,,,,,,,,,,,Africa,
58,P164728,Africa,Africa;Africa,PE,Investment Project Financing,IN,,Y,L,Active,Active,Africa Region - Improved Investment Climate wi...,2018-06-08T00:00:00Z,June,,15000000,0,15000000,15000000,0,,,http://projects.worldbank.org/P164728?lang=en,,,!$!0,,,,,,,,,,,,!$!0,,,,,,,,,,,,,,,,,,,Africa,
69,P161329,Africa,Western Africa;Western Africa,PE,Investment Project Financing,IN,B,N,L,Active,Active,West Africa Unique Identification for Regional...,2018-06-05T00:00:00Z,June,2024-07-03T00:00:00Z,122100000,0,122100000,122100000,0,,,http://projects.worldbank.org/P161329?lang=en,,,!$!0,,,,,,,,,,,,!$!0,,,,,,,,,,,,,,,,,,,Western Africa,
103,P164468,East Asia and Pacific,Pacific Islands;Pacific Islands,PE,Investment Project Financing,IN,,Y,L,Active,Active,Pacific Aviation Safety Office Reform Project ...,2018-05-23T00:00:00Z,May,,3550000,0,3550000,3550000,0,,,http://projects.worldbank.org/P164468?lang=en,,,Public Administration - Transportation!$!75!$!TF,Aviation!$!14!$!TV,ICT Infrastructure!$!11!$!CI,,,Public Administration - Transportation;Public ...,,,,,,Transportation;Transportation;Transportation;I...,!$!0,,,,,,,,,,,,,0002134431!$!Republic of Vanuatu!$!-16!$!167!$...,0002134431;0002135171,Republic of Vanuatu;Port-Vila,-16;-17.73381,167;168.32188,VU;VU,Pacific Islands,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17963,P009136,Europe and Central Asia,Socialist Federal Republic of Yugoslavia;Socia...,PE,Specific Investment Loan,IN,,N,L,Closed,Closed,Electric Power Project,1961-02-23T00:00:00Z,February,1968-01-31T00:00:00Z,30000000,30000000,0,30000000,0,,,http://projects.worldbank.org/P009136/electric...,,,(Historic)Hydro!$!100!$!PH,,,,,(Historic)Hydro;(Historic)Hydro,,,,,,(Historic)Electric Power & Other Energy;(Histo...,!$!0,,,,,,,,,,,,,,,,,,,Socialist Federal Republic of Yugoslavia,
18134,P000620,Africa,Africa;Africa,PE,Specific Investment Loan,IN,C,N,L,Closed,Closed,Railways and Harbours Project,1955-03-15T00:00:00Z,March,1957-06-30T00:00:00Z,24000000,24000000,0,24000000,0,,,http://projects.worldbank.org/P000620/railways...,,,Railways!$!100!$!TW,,,,,Railways;Railways,,,,,,Transportation;Transportation,!$!0,,,,,,,,,,,,,,,,,,,Africa,
18175,P009135,Europe and Central Asia,Socialist Federal Republic of Yugoslavia;Socia...,PE,Structural Adjustment Loan,AD,,N,L,Closed,Closed,Power Mining Industry Project,1953-02-11T00:00:00Z,February,1957-12-31T00:00:00Z,30000000,30000000,0,30000000,0,,,http://projects.worldbank.org/P009135/power-mi...,,,(Historic)Economic management!$!100!$!ME,,,,,(Historic)Economic management;(Historic)Econom...,,,,,,(Historic)Multisector;(Historic)Multisector,!$!0,,,,,,,,,,,,,,,,,,,Socialist Federal Republic of Yugoslavia,
18197,P009134,Europe and Central Asia,Socialist Federal Republic of Yugoslavia;Socia...,PE,Structural Adjustment Loan,AD,,N,L,Closed,Closed,Power Mining Industry Project,1951-10-11T00:00:00Z,October,1957-12-31T00:00:00Z,28000000,28000000,0,28000000,0,,,http://projects.worldbank.org/P009134/power-mi...,,,(Historic)Economic management!$!100!$!ME,,,,,(Historic)Economic management;(Historic)Econom...,,,,,,(Historic)Multisector;(Historic)Multisector,!$!0,,,,,,,,,,,,,,,,,,,Socialist Federal Republic of Yugoslavia,


In [105]:
df_projects.iloc[0:5, -2:]

Unnamed: 0,Official Country Name,Country Code
0,World,WLD
1,Democratic Republic of the Congo,COD
2,People's Republic of Bangladesh,BGD
3,Islamic Republic of Afghanistan,AFG
4,Federal Republic of Nigeria,NGA


You'll notice that there are still a few entries without country abbreviations. This includes projects that were labeled as "Africa" rather than a specific country. It also includes "Yugoslavia", which is a country that ceased to exist in the 1990s.

# Conclusion

Now the df_projects dataframe and the df_indicator dataframe have a matching column called 'Country Code'. 

But these two data frames *can't be merged quite yet*. Each project in the df_projects dataframe also has a **date** associated with it. 

The idea would be to merge the df_projects dataframe with the df_indicator dataframe so that each project also had a **population value** associated with it. There are still more data transformations to do in order for that to be possible. 

In fact, the **challenge** problem from the previous exercise on merging data would help quite a bit. In that exercise, the indicator data was transformed from a [wide format to a long format](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html).

You could then **merge the df_projects dataframe and the df_indicator dataframe using the alpha-3 country abbreviation and the project or indicator year**. You can start to see how data transformations become a series of processes that pipeline data from one format into a different format.
