# WEEK 03
# Encounter 02 - Clean Gapminder
# Project Challenge

## Task Description

   1. Rename columns: use the method from the example above to properly name any columns that seem mislabeled in the **population** dataset. The **population** dataset was given in the **EDA** lesson warmer

   2. Missing data: first check and see which and how much data is missing in the **population** dataset

   3. Remove missing data: drop all observations with missing data

   4. Filter for relevant data: filter the dataset that it begins with the year **1950**

   5. Make data persistant: save the dataset as a **.csv** file in your data folder as they will be used for the week’s project

   6. Repeat for the the **life_expectancy**, and **fertility_rate** datasets which are available below

**Hint:** one of the files is not a **.csv** and must be read in using a pandas function other than `read_csv()`

Related files

>    fertility_rate.csv (1 MB)
>    life_expectancy.xls (3 MB)

In [5]:
import pandas as pd

## Clean Gapminder: population dataset

In [7]:
# read population dataset from csv-file
population = pd.read_csv('../data/population.csv')
population.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22275 entries, 0 to 22274
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Total population  22275 non-null  object 
 1   year              22275 non-null  int64  
 2   population        20176 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 522.2+ KB


In [9]:
# Rename columns: use the method from the example above to properly name any columns that seem mislabeled in the population dataset. The population dataset was given in the EDA lesson warmer
population.rename(columns={'Total population': 'country'}, inplace=True)
population.columns

Index(['country', 'year', 'population'], dtype='object')

In [10]:
# Missing data: first check and see which and how much data is missing in the population dataset
population.isnull().sum()

country          0
year             0
population    2099
dtype: int64

In [12]:
# Remove missing data: drop all observations with missing data
population.dropna(axis=0, inplace=True)
population.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20176 entries, 1 to 22261
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     20176 non-null  object 
 1   year        20176 non-null  int64  
 2   population  20176 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 630.5+ KB


In [14]:
# Filter for relevant data: filter the dataset that it begins with the year 1950
mask_pop = population['year'] >= 1950
population_after_1950 = population[mask_pop]
population_after_1950

Unnamed: 0,country,year,population
4126,Afghanistan,1950,7752118.0
4127,Akrotiri and Dhekelia,1950,10661.0
4128,Albania,1950,1263171.0
4129,Algeria,1950,8872247.0
4130,American Samoa,1950,18937.0
...,...,...,...
22256,Zambia,2015,16211767.0
22257,Zimbabwe,2015,15602751.0
22259,South Sudan,2015,12339812.0
22260,Curaçao,2015,157203.0


In [16]:
# Make data persistant: save the dataset as a .csv file in your data folder as they will be used for the week’s project
population_after_1950.to_csv('../data/population_after_1950.csv', index=False)

In [19]:
# test if the saved file could be read
df_test = pd.read_csv('../data/population_after_1950.csv')
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16741 entries, 0 to 16740
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     16741 non-null  object 
 1   year        16741 non-null  int64  
 2   population  16741 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 392.5+ KB


## Clean Gapminder: life_expectancy dataset

In [20]:
# Read life_expectancy
life_exp = pd.read_excel('../data/life_expectancy.xls')
life_exp.head(10)

Unnamed: 0,Life expectancy,year,life expectancy
0,Abkhazia,1800,
1,Afghanistan,1800,28.21
2,Akrotiri and Dhekelia,1800,
3,Albania,1800,35.4
4,Algeria,1800,28.82
5,American Samoa,1800,
6,Andorra,1800,
7,Angola,1800,26.98
8,Anguilla,1800,
9,Antigua and Barbuda,1800,33.54


In [21]:
# Rename any columns that seem mislabeled in the dataset
life_exp.rename(columns={'Life expectancy': 'country'}, inplace=True)
life_exp.columns

Index(['country', 'year', 'life expectancy'], dtype='object')

In [22]:
# check and see which and how much data is missing in the dataset
life_exp.isnull().sum()

country                0
year                   0
life expectancy    12563
dtype: int64

In [23]:
life_exp.info()    # -> 22% rows has missed data!!

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56420 entries, 0 to 56419
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   country          56420 non-null  object 
 1   year             56420 non-null  int64  
 2   life expectancy  43857 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 1.3+ MB


In [27]:
life_exp[life_exp['life expectancy'].isnull()]['year'].unique()  # -> Min_year=1800 with 'NaN' values, Max_year = 2016 :(

array([1800, 1801, 1802, 1803, 1804, 1805, 1806, 1807, 1808, 1809, 1810,
       1811, 1812, 1813, 1814, 1815, 1816, 1817, 1818, 1819, 1820, 1821,
       1822, 1823, 1824, 1825, 1826, 1827, 1828, 1829, 1830, 1831, 1832,
       1833, 1834, 1835, 1836, 1837, 1838, 1839, 1840, 1841, 1842, 1843,
       1844, 1845, 1846, 1847, 1848, 1849, 1850, 1851, 1852, 1853, 1854,
       1855, 1856, 1857, 1858, 1859, 1860, 1861, 1862, 1863, 1864, 1865,
       1866, 1867, 1868, 1869, 1870, 1871, 1872, 1873, 1874, 1875, 1876,
       1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887,
       1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 1896, 1897, 1898,
       1899, 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909,
       1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920,
       1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931,
       1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942,
       1943, 1944, 1945, 1946, 1947, 1948, 1949, 19

In [28]:
# drop all observations with missing data
life_exp.dropna(axis=0, inplace=True)
life_exp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43857 entries, 1 to 56419
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   country          43857 non-null  object 
 1   year             43857 non-null  int64  
 2   life expectancy  43857 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 1.3+ MB


In [29]:
# Filter for relevant data: filter the dataset that it begins with the year 1950
mask_life_exp = life_exp['year'] >= 1950
life_exp_after_1950 = life_exp[mask_life_exp]
life_exp_after_1950

Unnamed: 0,country,year,life expectancy
39001,Afghanistan,1950,26.85
39003,Albania,1950,54.48
39004,Algeria,1950,42.77
39007,Angola,1950,30.70
39009,Antigua and Barbuda,1950,57.97
...,...,...,...
56411,Virgin Islands (U.S.),2016,80.82
56414,Yemen,2016,64.92
56416,Zambia,2016,57.10
56417,Zimbabwe,2016,61.69


In [30]:
# save the dataset as a .csv file
life_exp_after_1950.to_csv('../data/life_expectancy_after_1950.csv', index=False)

In [31]:
# test if the saved file could be read
df_tmp = pd.read_csv('../data/life_expectancy_after_1950.csv')
df_tmp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13707 entries, 0 to 13706
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   country          13707 non-null  object 
 1   year             13707 non-null  int64  
 2   life expectancy  13707 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 321.4+ KB


## Clean Gapminder: fertility_rate dataset

In [32]:
# Read fertility_rate dataset
fertility_rate = pd.read_csv('../data/fertility_rate.csv')
fertility_rate.head(10)

Unnamed: 0,Total fertility rate,year,fertility
0,Abkhazia,1800,
1,Afghanistan,1800,7.0
2,Akrotiri and Dhekelia,1800,
3,Albania,1800,4.6
4,Algeria,1800,6.99
5,American Samoa,1800,
6,Andorra,1800,
7,Angola,1800,6.93
8,Anguilla,1800,
9,Antigua and Barbuda,1800,5.0


In [33]:
# Rename any columns that seem mislabeled in the dataset
fertility_rate.rename(columns={'Total fertility rate': 'country'}, inplace=True)
fertility_rate.columns

Index(['country', 'year', 'fertility'], dtype='object')

In [34]:
# check and see which and how much data is missing in the dataset
fertility_rate.isnull().sum()

country          0
year             0
fertility    12747
dtype: int64

In [35]:
fertility_rate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56159 entries, 0 to 56158
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    56159 non-null  object 
 1   year       56159 non-null  int64  
 2   fertility  43412 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 1.3+ MB


In [36]:
# drop all observations with missing data
fertility_rate.dropna(axis=0, inplace=True)
fertility_rate.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43412 entries, 1 to 56157
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    43412 non-null  object 
 1   year       43412 non-null  int64  
 2   fertility  43412 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 1.3+ MB


In [37]:
# Filter for relevant data: filter the dataset that it begins with the year 1950
mask_fert = fertility_rate['year'] >= 1950
fertility_rate_after_1950 = fertility_rate[mask_fert]
fertility_rate_after_1950

Unnamed: 0,country,year,fertility
39001,Afghanistan,1950,7.67
39003,Albania,1950,5.80
39004,Algeria,1950,7.65
39007,Angola,1950,6.93
39009,Antigua and Barbuda,1950,4.45
...,...,...,...
56150,Vietnam,2015,1.70
56151,Virgin Islands (U.S.),2015,2.45
56154,Yemen,2015,3.83
56156,Zambia,2015,5.59


In [38]:
# save the dataset as a .csv file
fertility_rate_after_1950.to_csv('../data/fertility_rate_after_1950.csv', index=False)

In [39]:
# test if the saved file could be read
df_t = pd.read_csv('../data/fertility_rate_after_1950.csv')
df_t.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13262 entries, 0 to 13261
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    13262 non-null  object 
 1   year       13262 non-null  int64  
 2   fertility  13262 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 311.0+ KB
