# WEEK 03
# Encounter 04 - Merging Dataframes
# Project Challlenge - Combining DataFrames

## Task Description

   1. Open empty notebook and read in **life_expectancy** and **continents** datasets. Both of these datasets are available from earlier lessons. Life expectancy dataset should be a cleaned version of the original dataset (the outcome of Data Cleansing project milestones).
   

   2. Merge the two dataframes into one using `pandas.DataFrame.merge()`

   > See the documentation for `pandas.DataFrame.merge()`

    **Note:** Keep in mind this will render the merged dataframe in your notebook. However in order to execute commands on the merged dataframe you must put it in a variable i.e. `df_merged = df1.merge(df2)`


   3. Repeat steps 1 and 2 with **population** and **total_fertility** (the cleaned versions as well) until you have a single dataframe that contains the information from all four original dataframes

    **Tip:** the column on which the dataframes are merged on must have the same data type in both dataframes. If you have numbers in both but in one dataframe they are strings and the other integers the dataframes will not merge properly. Use `.astype()` to remedy this.
    

   4. Write new dataframe to hard drive as `gapminder_total.csv` in this week’s data folder for use in the upcoming lessons.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [2]:
# reading life_expectancy cleaned dataset

life_expectancy = pd.read_csv('../data/life_expectancy_after_1950.csv')
life_expectancy

Unnamed: 0,country,year,life expectancy
0,Afghanistan,1950,26.85
1,Albania,1950,54.48
2,Algeria,1950,42.77
3,Angola,1950,30.70
4,Antigua and Barbuda,1950,57.97
...,...,...,...
13702,Virgin Islands (U.S.),2016,80.82
13703,Yemen,2016,64.92
13704,Zambia,2016,57.10
13705,Zimbabwe,2016,61.69


### Merge 'continents' data

In [3]:
# reading continents dataset

continents = pd.read_csv('../data/continents.csv', sep=';')
continents

Unnamed: 0,continent,country
0,Africa,Algeria
1,Africa,Angola
2,Africa,Benin
3,Africa,Botswana
4,Africa,Burkina
...,...,...
189,South America,Paraguay
190,South America,Peru
191,South America,Suriname
192,South America,Uruguay


In [4]:
# Merge the two dataframes into one using pandas.DataFrame.merge()

df_merged = pd.merge(left=life_expectancy, right=continents, how='outer', on='country')
df_merged

Unnamed: 0,country,year,life expectancy,continent
0,Afghanistan,1950.0,26.85,Asia
1,Afghanistan,1951.0,27.13,Asia
2,Afghanistan,1952.0,27.67,Asia
3,Afghanistan,1953.0,28.19,Asia
4,Afghanistan,1954.0,28.73,Asia
...,...,...,...,...
13724,Saint Vincent and the Grenadines,,,North America
13725,Micronesia,,,Australia and Oceania
13726,Nauru,,,Australia and Oceania
13727,Palau,,,Australia and Oceania


### Merge 'population' data

In [5]:
# read cleaned 'population' dataset

population = pd.read_csv('../data/population_after_1950.csv')
population.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16741 entries, 0 to 16740
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     16741 non-null  object 
 1   year        16741 non-null  int64  
 2   population  16741 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 392.5+ KB


In [6]:
# merging df_merged ad 'population' datasets

df_merged = pd.merge(left=df_merged, right=population, how='outer', on=['country','year'])
df_merged

Unnamed: 0,country,year,life expectancy,continent,population
0,Afghanistan,1950.0,26.85,Asia,7752118.0
1,Afghanistan,1951.0,27.13,Asia,7839426.0
2,Afghanistan,1952.0,27.67,Asia,7934798.0
3,Afghanistan,1953.0,28.19,Asia,8038312.0
4,Afghanistan,1954.0,28.73,Asia,8150037.0
...,...,...,...,...,...
16970,Turks and Caicos Islands,2015.0,,,34339.0
16971,Tuvalu,2015.0,,,9916.0
16972,Wallis et Futuna,2015.0,,,13151.0
16973,Curaçao,2015.0,,,157203.0


### Merge 'total_fertility' dataset

In [7]:
# read cleaned 'total_fertility' dataset

fertility = pd.read_csv('../data/fertility_rate_after_1950.csv')
fertility.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13262 entries, 0 to 13261
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    13262 non-null  object 
 1   year       13262 non-null  int64  
 2   fertility  13262 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 311.0+ KB


In [8]:
# merging df_merged ad 'population' datasets

df_merged = pd.merge(left=df_merged, right=fertility, how='outer', on=['country','year'])
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16975 entries, 0 to 16974
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   country          16975 non-null  object 
 1   year             16953 non-null  float64
 2   life expectancy  13707 non-null  float64
 3   continent        11426 non-null  object 
 4   population       16741 non-null  float64
 5   fertility        13262 non-null  float64
dtypes: float64(4), object(2)
memory usage: 928.3+ KB


In [12]:
df_merged.head(30)

Unnamed: 0,country,year,life expectancy,continent,population,fertility
0,Afghanistan,1950.0,26.85,Asia,7752118.0,7.67
1,Afghanistan,1951.0,27.13,Asia,7839426.0,7.67
2,Afghanistan,1952.0,27.67,Asia,7934798.0,7.67
3,Afghanistan,1953.0,28.19,Asia,8038312.0,7.67
4,Afghanistan,1954.0,28.73,Asia,8150037.0,7.67
5,Afghanistan,1955.0,29.27,Asia,8270024.0,7.67
6,Afghanistan,1956.0,29.8,Asia,8398309.0,7.67
7,Afghanistan,1957.0,30.34,Asia,8534913.0,7.67
8,Afghanistan,1958.0,30.86,Asia,8679848.0,7.67
9,Afghanistan,1959.0,31.4,Asia,8833127.0,7.67


In [10]:
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16975 entries, 0 to 16974
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   country          16975 non-null  object 
 1   year             16953 non-null  float64
 2   life expectancy  13707 non-null  float64
 3   continent        11426 non-null  object 
 4   population       16741 non-null  float64
 5   fertility        13262 non-null  float64
dtypes: float64(4), object(2)
memory usage: 928.3+ KB


In [None]:
df_merged.info()

### Additional data clearing

In [13]:
# Convert 'year' and 'population' columns from 'float64' to 'Int64'

df_merged['year'] = df_merged['year'].astype('Int64')
df_merged['population'] = df_merged['population'].astype('Int64')
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16975 entries, 0 to 16974
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   country          16975 non-null  object 
 1   year             16953 non-null  Int64  
 2   life expectancy  13707 non-null  float64
 3   continent        11426 non-null  object 
 4   population       16741 non-null  Int64  
 5   fertility        13262 non-null  float64
dtypes: Int64(2), float64(2), object(2)
memory usage: 961.5+ KB


### Save **df_merged** dataframe to hard drive as `gapminder_total.csv`

In [15]:
# save DataFrame as CSV-file skipping its Index
df_merged.to_csv('../data/gapminder_total.csv', index=False)

In [17]:
# test if the DF could be read

#gapminder_total = pd.read_csv('../data/gapminder_total.csv')
#gapminder_total

# NOTE THAT AFTER READING CSV-FILE DTYPE FOR 'YEAR' AND 'POPULATION' IS 'FLOAT64' AGAIN, NOT 'INT64' :)