In [82]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## **Problem Statement: How does GDP affect the standard of living?**


These are the libraries required to solve our problem statement
> pandas: A powerful library for data manipulation and analysis. <br>
> numpy: A library for working with arrays, linear algebra, and other mathematical operations. <br>
> matplotlib: A library for creating static, animated, and interactive visualizations. <br>
> seaborn: A statistical data visualization library based on Matplotlib. <br>
> scikit-learn: A library for machine learning in Python, including various models, preprocessing tools, and evaluation metrics.

In [83]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

## Cleaning the Happiness Index Datasets

In [84]:
# Read the CSV files into DataFrames
happiness_df2015 = pd.read_csv("/content/drive/MyDrive/SC1015 Mini Project/datasets/2015_happinessindex.csv")
happiness_df2016 = pd.read_csv("/content/drive/MyDrive/SC1015 Mini Project/datasets/2016_happinessindex.csv")
happiness_df2017 = pd.read_csv("/content/drive/MyDrive/SC1015 Mini Project/datasets/2017_happinessindex.csv")
happiness_df2018 = pd.read_csv("/content/drive/MyDrive/SC1015 Mini Project/datasets/2018_happinessindex.csv")
happiness_df2019 = pd.read_csv("/content/drive/MyDrive/SC1015 Mini Project/datasets/2019_happinessindex.csv")

print(happiness_df2015.columns)
print(happiness_df2016.columns)
print(happiness_df2017.columns)
print(happiness_df2018.columns)
print(happiness_df2019.columns)



Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score',
       'Standard Error', 'Economy (GDP per Capita)', 'Family',
       'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)',
       'Generosity', 'Dystopia Residual'],
      dtype='object')
Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score',
       'Lower Confidence Interval', 'Upper Confidence Interval',
       'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)',
       'Freedom', 'Trust (Government Corruption)', 'Generosity',
       'Dystopia Residual'],
      dtype='object')
Index(['Country', 'Happiness.Rank', 'Happiness.Score', 'Whisker.high',
       'Whisker.low', 'Economy..GDP.per.Capita.', 'Family',
       'Health..Life.Expectancy.', 'Freedom', 'Generosity',
       'Trust..Government.Corruption.', 'Dystopia.Residual'],
      dtype='object')
Index(['Overall rank', 'Country or region', 'Score', 'GDP per capita',
       'Social support', 'Healthy life expectancy',
       'Fre

__From the 5 datasets above, we have observed that, although the context is the same, some of the columns have different names. So, in order to fix that we will rename them using the function .rename().__

In [85]:
#renaming 2017 dataset
happiness_df2017.rename(columns={
    'Happiness.Rank': 'Happiness Rank',
    'Happiness.Score': 'Happiness Score',
    'Economy..GDP.per.Capita': 'Economy (GDP per Capita)',
    'Health..Life.Expectancy.': 'Health (Life Expectancy)',
    'Happiness.Score': 'Happiness Score',
    'Trust..Government.Corruption.': 'Trust (Government Corruption)',
    'Dystopia.Residual': 'Dystopia Residual'
}, inplace=True)

#renaming 2018 dataset
happiness_df2018.rename(columns={
    'Country or region': 'Country',
    'Score': 'Happiness Score',
    'GDP per capita': 'Economy (GDP per Capita)',
    'Social support': 'Family',
    'Healthy life expectancy': 'Health (Life Expectancy)',
    'Freedom to make life choices': 'Freedom',
    'Perceptions of corruption': 'Trust (Government Corruption)'
}, inplace=True)

#renaming 2019 dataset
happiness_df2019.rename(columns={
    'Country or region': 'Country',
    'Score': 'Happiness Score',
    'GDP per capita': 'Economy (GDP per Capita)',
    'Social support': 'Family',
    'Healthy life expectancy': 'Health (Life Expectancy)',
    'Freedom to make life choices': 'Freedom',
    'Perceptions of corruption': 'Trust (Government Corruption)'
}, inplace=True)

## Picking out the data we want


__We decided that our group will be using the data that spans from 2015 to 2019, hence we cleaned and filtered out the Happiness Index, GDP and HDI datasets.__

In [86]:
# Find the intersection of countries in all five CSV files
common_countries = set(happiness_df2015['Country']).intersection(
    happiness_df2016['Country'], happiness_df2017['Country'], happiness_df2018['Country'], happiness_df2019['Country']
)

# Filter each file to keep only common countries
filtered_2015 = happiness_df2015[happiness_df2015['Country'].isin(common_countries)]
filtered_2016 = happiness_df2016[happiness_df2016['Country'].isin(common_countries)]
filtered_2017 = happiness_df2017[happiness_df2017['Country'].isin(common_countries)]
filtered_2018 = happiness_df2018[happiness_df2018['Country'].isin(common_countries)]
filtered_2019 = happiness_df2019[happiness_df2019['Country'].isin(common_countries)]

# Sort the columns by countries
sorted_2015 = filtered_2015.sort_values(by='Country', ascending=True)
sorted_2016 = filtered_2016.sort_values(by='Country', ascending=True)
sorted_2017 = filtered_2017.sort_values(by='Country', ascending=True)
sorted_2018 = filtered_2018.sort_values(by='Country', ascending=True)
sorted_2019 = filtered_2019.sort_values(by='Country', ascending=True)


##Cleaning the GDP (Gross domestic product) datasets

In [87]:
# Load the data into Pandas dataframes

gdp = pd.read_csv('/content/drive/MyDrive/SC1015 Mini Project/datasets/gdp.csv', index_col=0)
gdp_growth = pd.read_csv('/content/drive/MyDrive/SC1015 Mini Project/datasets/gdp_growth.csv', index_col=0)
gdp_per_capita = pd.read_csv('/content/drive/MyDrive/SC1015 Mini Project/datasets/gdp_per_capita.csv', index_col=0)
gdp_per_capita_growth = pd.read_csv('/content/drive/MyDrive/SC1015 Mini Project/datasets/gdp_per_capita_growth.csv', index_col=0)
gdp_ppp = pd.read_csv('/content/drive/MyDrive/SC1015 Mini Project/datasets/gdp_ppp.csv', index_col=0)
gdp_ppp_per_capita = pd.read_csv('/content/drive/MyDrive/SC1015 Mini Project/datasets/gdp_ppp_per_capita.csv', index_col=0)

## Drop Column "Code" and Extract Only 2015-2019 Data

gdp = gdp.drop(axis = 1, labels = ['Code'])
gdp_growth = gdp_growth.drop(axis = 1, labels = ['Code'])
gdp_per_capita = gdp_per_capita.drop(axis = 1, labels = ['Code'])
gdp_per_capita_growth = gdp_per_capita_growth.drop(axis = 1, labels = ['Code'])
gdp_ppp = gdp_ppp.drop(axis = 1, labels = ['Code'])
gdp_ppp_per_capita = gdp_ppp_per_capita.drop(axis = 1, labels = ['Code'])

# Extract the data for the Respective Years (2015)
gdp_2015 = gdp['2015'].rename('GDP')
gdp_growth_2015 = gdp_growth['2015'].rename('GDP Growth')
gdp_per_capita_2015 = gdp_per_capita['2015'].rename('GDP Per Capita')
gdp_per_capita_growth_2015 = gdp_per_capita_growth['2015'].rename('GDP Per Capita Growth')
gdp_ppp_2015 = gdp_ppp['2015'].rename('GDP PPP')
gdp_ppp_per_capita_2015 = gdp_ppp_per_capita['2015'].rename('GDP PPP Per Capita')

# Concat the extracted data into one Dataframe
data_2015 = pd.concat([gdp_2015, gdp_growth_2015, gdp_per_capita_2015, gdp_per_capita_growth_2015,gdp_ppp_2015, gdp_ppp_per_capita_2015], axis=1)

# Extract the data for the Respective Years (2016)
gdp_2016 = gdp['2016'].rename('GDP')
gdp_growth_2016 = gdp_growth['2016'].rename('GDP Growth')
gdp_per_capita_2016 = gdp_per_capita['2016'].rename('GDP Per Capita')
gdp_per_capita_growth_2016 = gdp_per_capita_growth['2016'].rename('GDP Per Capita Growth')
gdp_ppp_2016 = gdp_ppp['2016'].rename('GDP PPP')
gdp_ppp_per_capita_2016 = gdp_ppp_per_capita['2016'].rename('GDP PPP Per Capita')

# Concat the extracted data into one Dataframe 
data_2016 = pd.concat([gdp_2016, gdp_growth_2016, gdp_per_capita_2016, gdp_per_capita_growth_2016,gdp_ppp_2016, gdp_ppp_per_capita_2016], axis=1)

# Extract the data for the Respective Years (2017)
gdp_2017 = gdp['2017'].rename('GDP')
gdp_growth_2017 = gdp_growth['2017'].rename('GDP Growth')
gdp_per_capita_2017 = gdp_per_capita['2017'].rename('GDP Per Capita')
gdp_per_capita_growth_2017 = gdp_per_capita_growth['2017'].rename('GDP Per Capita Growth')
gdp_ppp_2017 = gdp_ppp['2017'].rename('GDP PPP')
gdp_ppp_per_capita_2017 = gdp_ppp_per_capita['2017'].rename('GDP PPP Per Capita')

# Concat the extracted data into one Dataframe
data_2017 = pd.concat([gdp_2017, gdp_growth_2017, gdp_per_capita_2017, gdp_per_capita_growth_2017,gdp_ppp_2017, gdp_ppp_per_capita_2017], axis=1)

# Extract the data for the Respective Years  (2018)
gdp_2018 = gdp['2018'].rename('GDP')
gdp_growth_2018 = gdp_growth['2018'].rename('GDP Growth')
gdp_per_capita_2018 = gdp_per_capita['2018'].rename('GDP Per Capita')
gdp_per_capita_growth_2018 = gdp_per_capita_growth['2018'].rename('GDP Per Capita Growth')
gdp_ppp_2018 = gdp_ppp['2018'].rename('GDP PPP')
gdp_ppp_per_capita_2018 = gdp_ppp_per_capita['2018'].rename('GDP PPP Per Capita')

# Concat the extracted data into one Dataframe
data_2018 = pd.concat([gdp_2018, gdp_growth_2018, gdp_per_capita_2018, gdp_per_capita_growth_2018,gdp_ppp_2018, gdp_ppp_per_capita_2018], axis=1)

# Extract the data for the Respective Years (2019)
gdp_2019 = gdp['2019'].rename('GDP')
gdp_growth_2019 = gdp_growth['2019'].rename('GDP Growth')
gdp_per_capita_2019 = gdp_per_capita['2019'].rename('GDP Per Capita')
gdp_per_capita_growth_2019 = gdp_per_capita_growth['2019'].rename('GDP Per Capita Growth')
gdp_ppp_2019 = gdp_ppp['2019'].rename('GDP PPP')
gdp_ppp_per_capita_2019 = gdp_ppp_per_capita['2019'].rename('GDP PPP Per Capita')

# Concat the extracted data into one Dataframe
data_2019 = pd.concat([gdp_2019, gdp_growth_2019, gdp_per_capita_2019, gdp_per_capita_growth_2019,gdp_ppp_2019, gdp_ppp_per_capita_2019], axis=1)

data_2015 = data_2015.sort_values(by='Country Name', ascending=True)
data_2016 = data_2016.sort_values(by='Country Name', ascending=True)
data_2017 = data_2017.sort_values(by='Country Name', ascending=True)
data_2018 = data_2018.sort_values(by='Country Name', ascending=True)
data_2019 = data_2019.sort_values(by='Country Name', ascending=True)

# For loop to print out data from 2015-2019
for year in range(2015, 2020):

    print(f'Data for year {year}:')
    print(locals()[f'data_{year}'].head())
    print('\n')


Data for year 2015:
                                      GDP  GDP Growth  GDP Per Capita  \
Country Name                                                            
Afghanistan                  1.913421e+10    1.451315      556.007221   
Africa Eastern and Southern  9.199300e+11    2.925591     1549.037940   
Africa Western and Central   7.607297e+11    2.745937     1894.310195   
Albania                      1.138685e+10    2.218726     3952.802538   
Algeria                      1.659793e+11    3.700000     4177.889542   

                             GDP Per Capita Growth       GDP PPP  \
Country Name                                                       
Afghanistan                              -1.622857  7.183170e+10   
Africa Eastern and Southern               0.187860  2.098286e+12   
Africa Western and Central                0.007402  1.662297e+12   
Albania                                   2.516827  3.358584e+10   
Algeria                                   1.600494  4.773576

##Cleaning the Human Development Index (HDI) dataset

In [91]:
# Cleaning Human Development Index (HDI) dataset
HDI = pd.read_csv("/content/drive/MyDrive/SC1015 Mini Project/datasets/human development index.csv")
print(HDI.columns)
print("\n")

# Keep only the columns you want (in this case, only 'Country' and 'year')
HDI = HDI[['Country', '2015','2016', '2017', '2018', '2019']]

# Create dataframes for each year of data
HDI_2015 = HDI[['Country', '2015']]
HDI_2015 = HDI_2015.set_index('Country').rename(columns={'2015': 'HDI'})
HDI_2015 = pd.concat([HDI_2015], axis=1).sort_values(by='Country', ascending=True)

HDI_2016 = HDI[['Country', '2016']]
HDI_2016 = HDI_2016.set_index('Country').rename(columns={'2016': 'HDI'})
HDI_2016 = pd.concat([HDI_2016], axis=1).sort_values(by='Country', ascending=True)

HDI_2017 = HDI[['Country', '2017']]
HDI_2017 = HDI_2017.set_index('Country').rename(columns={'2017': 'HDI'})
HDI_2017 = pd.concat([HDI_2017], axis=1).sort_values(by='Country', ascending=True)

HDI_2018 = HDI[['Country', '2018']]
HDI_2018 = HDI_2018.set_index('Country').rename(columns={'2018': 'HDI'})
HDI_2018 = pd.concat([HDI_2018], axis=1).sort_values(by='Country', ascending=True)

HDI_2019 = HDI[['Country', '2019']]
HDI_2019 = HDI_2019.set_index('Country').rename(columns={'2019': 'HDI'})
HDI_2019 = pd.concat([HDI_2019], axis=1).sort_values(by='Country', ascending=True)


# For loop to print out data from 2015-2019
for year in range(2015, 2020):

    print(f'Data for year {year}:')
    print(locals()[f'HDI_{year}'].head())
    print('\n')



Index(['HDI Rank', 'Country', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017', '2018', '2019'],
      dtype='object')


Data for year 2015:
               HDI
Country           
Afghanistan    0.5
Albania      0.788
Algeria       0.74
Andorra      0.862
Angola       0.572


Data for year 2016:
               HDI
Country           
Afghanistan  0.502
Albania      0.788
Algeria      0.743
Andorra      0.866
Angola       0.578


Data for year 2017:
               HDI
Country           
Afghanistan  0.506
Albania      0.790
Algeria      0.745
Andorra      0.863
Angola       0.582


Data for year 2018:
               HDI
Country           
Afghanistan  0.509
Albania      0.792
Algeria      0.746
Andorra      0.867
Angola       0.582


Data for year 2019:
               HDI
Country           
Afghan

__After filtering for data between 2015 - 2019, our group wants to also make sure that the filtered data contains the same country names across the three datasets. Hence, we decided to use the 'Country Name' column from the Happiness Index Dataset as the benchmark.__

In [102]:
country_benchmark = sorted_2015.sort_values(by='Country', ascending=True)
country_benchmark = country_benchmark['Country']

# Happiness Index
HI_2015 = sorted_2015[sorted_2015.index.isin(country_benchmark)]
HI_2016 = sorted_2016[sorted_2016.index.isin(country_benchmark)]
HI_2017 = sorted_2017[sorted_2017.index.isin(country_benchmark)]
HI_2018 = sorted_2018[sorted_2018.index.isin(country_benchmark)]
HI_2019 = sorted_2019[sorted_2019.index.isin(country_benchmark)]

# GDP
GDP_2015 = data_2015[data_2015.index.isin(country_benchmark)]
GDP_2016 = data_2016[data_2016.index.isin(country_benchmark)]
GDP_2017 = data_2017[data_2017.index.isin(country_benchmark)]
GDP_2018 = data_2018[data_2018.index.isin(country_benchmark)]
GDP_2019 = data_2019[data_2019.index.isin(country_benchmark)]

# HDI
HDI_2015 = HDI_2015[HDI_2015.index.isin(country_benchmark)]
HDI_2016 = HDI_2016[HDI_2016.index.isin(country_benchmark)]
HDI_2017 = HDI_2017[HDI_2017.index.isin(country_benchmark)]
HDI_2018 = HDI_2018[HDI_2018.index.isin(country_benchmark)]
HDI_2019 = HDI_2019[HDI_2019.index.isin(country_benchmark)]


## EDA (Exploratory Data Analysis)

__The datasets for Happiness Index, GDP and HDI have been cleaned and are now ready to be used for EDA (Exploratory data analysis).__