*(Note: this is file 2 of 3 submitted for evaluation)*

## Investigate Datasets for Inflation, Life Expectancy at Birth, and Forest Coverage - Part 2 ##

## Data Wrangling (continued) - Comparing cleaned datasets

I have three cleaned datasets stored in the following files:
1. forest_cleaned.csv
2. inflation_cleaned.csv
3. life_expectancy_cleaned.csv

Here I'll compare the three datasets and see if there are any changes required for analysis.

In [1]:
import pandas as pd
import numpy as np

df_forest = pd.read_csv('forest_cleaned.csv')
df_inflation = pd.read_csv('inflation_cleaned.csv')
df_life = pd.read_csv('life_expectancy_cleaned.csv')

In [2]:
df_forest = df_forest.set_index('country')
df_inflation = df_inflation.set_index('country')
df_life = df_life.set_index('country')

## Comparing inflation dataset with life expectancy dataset ##

I need to make sure that I have data for the same countries and time period in all the datasets, sao I can compare them. Here I noticed the following issues.
1. The inflation dataset has data between 1961 and 2011, while the life expectancy dataset has datapoints dating 1800 onwards. We need to have the time period consistent across both these datasets. 
2. The inflation dataset has data for 204 countries, while life expectancy dataset has info for only 201 countries. We need to check whether there are any countries missing in either dataset.

In [3]:
df_inflation.shape

(204, 51)

In [4]:
df_life.shape

(201, 217)

In [5]:
# Get the location of the column for year 1960
df_life.columns.get_loc('1960')

160

In [6]:
# Removing columns before year 1961 from life dataset
df_life = df_life.iloc[:,161:212]

In [7]:
df_life.head()

Unnamed: 0_level_0,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,...,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,32.47,33.01,33.53,34.07,34.6,35.13,35.66,36.17,36.69,37.2,...,51.0,51.4,51.8,52.0,52.1,52.4,52.8,53.3,53.6,54.0
Albania,63.92,64.84,65.6,66.18,66.59,66.88,67.11,67.32,67.55,67.83,...,75.5,75.7,75.9,76.2,76.4,76.6,76.8,77.0,77.2,77.4
Algeria,48.02,48.55,49.07,49.58,50.09,50.58,51.05,51.49,51.95,52.41,...,73.8,73.9,74.4,74.8,75.0,75.3,75.5,75.7,76.0,76.1
Angola,36.53,37.08,37.63,38.18,38.74,39.28,39.84,40.39,40.95,41.5,...,53.3,53.9,54.5,55.2,55.7,56.2,56.7,57.1,57.6,58.1
Antigua and Barbuda,63.46,63.93,64.38,64.81,65.23,65.63,66.03,66.41,66.81,67.19,...,74.3,74.5,74.6,74.9,74.9,75.3,75.5,75.7,75.8,75.9


### Removing unmatching datapoints in the datasets: Inflation and Life Expectancy ###

We now have the same number of columns in both 'life' and 'inflation' dataset. It might still be possible that one dataset has different countries than the other. Let's find the unmatching datapoints and remove them.

In [8]:
# countries unique to inflation dataset
df_inflation.index.difference(df_life.index)

Index(['Andorra', 'Bermuda', 'Cayman Islands', 'Channel Islands', 'Dominica',
       'Isle of Man', 'Kosovo', 'Liechtenstein', 'Marshall Islands', 'Monaco',
       'Palau', 'San Marino', 'St. Kitts and Nevis', 'Tuvalu'],
      dtype='object', name='country')

In [9]:
# countries unique to life dataset
df_life.index.difference(df_inflation.index)

Index(['French Guiana', 'Guadeloupe', 'Guam', 'Martinique', 'Mayotte',
       'Netherlands Antilles', 'North Korea', 'Reunion', 'South Sudan',
       'Taiwan', 'Western Sahara'],
      dtype='object', name='country')

In [10]:
# Dropping unmatching data points
df_inflation.drop(['Andorra', 'Bermuda', 'Cayman Islands', 'Channel Islands', 'Dominica',
       'Isle of Man', 'Kosovo', 'Liechtenstein', 'Marshall Islands', 'Monaco',
       'Palau', 'San Marino', 'St. Kitts and Nevis', 'Tuvalu'], inplace=True)

In [11]:
df_inflation.shape

(190, 51)

In [12]:
df_life.drop(['French Guiana', 'Guadeloupe', 'Guam', 'Martinique', 'Mayotte',
       'Netherlands Antilles', 'North Korea', 'Reunion', 'South Sudan',
       'Taiwan', 'Western Sahara'], inplace=True)
df_life.shape

(190, 51)

We now have the data for the same tiem period and for the same countries in both these datasets. Now I'll move to the third dataset.

## Comparing forest dataset with the cleaned inflation and life expectancy datasets ##

Let's check the forest dataframe and compare it with either of the two cleaned datasets. I'll remove any unmatched countries present in the forest dataset.

In [13]:
df_forest.index.difference(df_inflation.index)

Index(['American Samoa', 'Andorra', 'Anguilla', 'Bermuda',
       'British Indian Ocean Territory', 'British Virgin Islands',
       'Cayman Islands', 'Central African Rep.', 'Channel Islands',
       'Cook Islands', 'Czech Rep.', 'Dominica', 'Dominican Rep.',
       'French Guiana', 'Guadeloupe', 'Guam', 'Isle of Man',
       'Korea, Dem. Rep.', 'Korea, Rep.', 'Kyrgyzstan', 'Laos',
       'Liechtenstein', 'Martinique', 'Mayotte', 'Montserrat',
       'Netherlands Antilles', 'Niue', 'Northern Mariana Islands', 'Palau',
       'Pitcairn', 'Reunion', 'Saint Helena', 'Saint Kitts and Nevis',
       'Saint Lucia', 'Saint Vincent and the Grenadines',
       'Saint-Pierre-et-Miquelon', 'Serbia and Montenegro',
       'Turks and Caicos Islands', 'Tuvalu', 'Wallis et Futuna',
       'Western Sahara', 'Yemen, Rep.'],
      dtype='object', name='country')

In [14]:
df_forest.drop(['American Samoa', 'Andorra', 'Anguilla', 'Bermuda',
       'British Indian Ocean Territory', 'British Virgin Islands',
       'Cayman Islands', 'Central African Rep.', 'Channel Islands',
       'Cook Islands', 'Czech Rep.', 'Dominica', 'Dominican Rep.',
       'French Guiana', 'Guadeloupe', 'Guam', 'Isle of Man',
       'Korea, Dem. Rep.', 'Korea, Rep.', 'Kyrgyzstan', 'Laos',
       'Liechtenstein', 'Martinique', 'Mayotte', 'Montserrat',
       'Netherlands Antilles', 'Niue', 'Northern Mariana Islands', 'Palau',
       'Pitcairn', 'Reunion', 'Saint Helena', 'Saint Kitts and Nevis',
       'Saint Lucia', 'Saint Vincent and the Grenadines',
       'Saint-Pierre-et-Miquelon', 'Serbia and Montenegro',
       'Turks and Caicos Islands', 'Tuvalu', 'Wallis et Futuna',
       'Western Sahara', 'Yemen, Rep.'], inplace=True)
df_forest.shape

(172, 3)

### Going back to inflation and life_expectancy datasets to clean them ##

Looks like there is still some data in the life and inflation dataset that we might need to lose if we want to be able to compare it with forest dataset. There are some countries for which we have no forest data. For comparison purposes we need to lose that data.

In [15]:
inflation_extra_rows = df_inflation.index.difference(df_forest.index)
df_inflation.drop(inflation_extra_rows, inplace=True)
df_inflation.shape

(172, 51)

In [16]:
life_extra_rows = df_life.index.difference(df_forest.index)
df_life.drop(life_extra_rows, inplace=True)
df_life.shape

(172, 51)

### Saving cleaned datasets ###

At this point all our datasets have the same number of countries, and the 'inflation' and 'life' datasets have data for the same years. Let's save this finally cleaned data and start our analysis.

In [17]:
df_forest.to_csv('cleaned_forest_final.csv', index=True)
df_inflation.to_csv('cleaned_inflation_final.csv', index=True)
df_life.to_csv('cleaned_life_final.csv', index=True)