# Data Wrangling World Happiness Record (2015 - 2019)

The World Happiness Report is a yearly report that ranks countries based on how happy and satisfied their people are.   
It looks at things like how people feel about their lives, their emotions, and factors like having friends and being able to make choices.   
The report helps us understand what makes people in different countries happy and can be used to make policies to improve people's well-being.

This **Data Wrangling** involves the following tasks  
- Dropping columns
- Renaming columns
- Finding duplicates 
- Finding missing values
- Deriving new columns 

## INDEX 

1. [Wrangle 2015 dataset](#Wrangle-2015-dataset)
1. [Wrangle 2016 dataset](#Wrangle-2016-dataset)
1. [Wrangle 2017 dataset](#Wrangle-2017-dataset)
1. [Wrangle 2018 dataset](#Wrangle-2018-dataset)

The data cleaning process is recorded here. 

In [194]:
#import necessary libraries
import pandas as pd
import numpy as np
import os

In [195]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [196]:
#assigning path and import 2015-2022 csv
path = '/Users/satoruteshima/Documents/CareerFoundry/06 Date Immersion 6/Scripts'
df_2015 = pd.read_csv(os.path.join(path, 'Raw', '2015.csv'), index_col = False)
df_2016 = pd.read_csv(os.path.join(path, 'Raw', '2016.csv'), index_col = False)
df_2017 = pd.read_csv(os.path.join(path, 'Raw', '2017.csv'), index_col = False)
df_2018 = pd.read_csv(os.path.join(path, 'Raw', '2018.csv'), index_col = False)
df_2019 = pd.read_csv(os.path.join(path, 'Raw', '2019.csv'), index_col = False)
df_2020 = pd.read_csv(os.path.join(path, 'Raw', '2020.csv'), index_col = False)
df_2021 = pd.read_csv(os.path.join(path, 'Raw', '2021.csv'), index_col = False)
df_2022 = pd.read_csv(os.path.join(path, 'Raw', '2022.csv'), index_col = False)

## Wrangle 2015 dataset 

In [197]:
#Find missing values
missing_values = df_2015.isna() 
missing_values.sum()

Country                          0
Region                           0
Happiness Rank                   0
Happiness Score                  0
Standard Error                   0
Economy (GDP per Capita)         0
Family                           0
Health (Life Expectancy)         0
Freedom                          0
Trust (Government Corruption)    0
Generosity                       0
Dystopia Residual                0
dtype: int64

In [198]:
#Find duplicates
df_2015[df_2015.duplicated()]

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual


In [199]:
#Drop Standard Error

df_2015 = df_2015.drop(columns = ['Standard Error'])

In [200]:
#derive a new column 'Year' and fill it with the value 2015
df_2015['Year'] = 2015

In [266]:
#EDA 
df_2015.describe()

Unnamed: 0,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,Year
count,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0
mean,79.493671,5.375734,0.846137,0.991046,0.630259,0.428615,0.143422,0.237296,2.098977,2015.0
std,45.754363,1.14501,0.403121,0.272369,0.247078,0.150693,0.120034,0.126685,0.55355,0.0
min,1.0,2.839,0.0,0.0,0.0,0.0,0.0,0.0,0.32858,2015.0
25%,40.25,4.526,0.545808,0.856823,0.439185,0.32833,0.061675,0.150553,1.75941,2015.0
50%,79.5,5.2325,0.910245,1.02951,0.696705,0.435515,0.10722,0.21613,2.095415,2015.0
75%,118.75,6.24375,1.158448,1.214405,0.811013,0.549092,0.180255,0.309883,2.462415,2015.0
max,158.0,7.587,1.69042,1.40223,1.02525,0.66973,0.55191,0.79588,3.60214,2015.0


## Wrangle 2016 dataset

In [202]:
#Find missing values
missing_values = df_2016.isna() 
missing_values.sum()

Country                          0
Region                           0
Happiness Rank                   0
Happiness Score                  0
Lower Confidence Interval        0
Upper Confidence Interval        0
Economy (GDP per Capita)         0
Family                           0
Health (Life Expectancy)         0
Freedom                          0
Trust (Government Corruption)    0
Generosity                       0
Dystopia Residual                0
dtype: int64

In [203]:
#Find duplicates
df_2016[df_2016.duplicated()]

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Lower Confidence Interval,Upper Confidence Interval,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual


In [204]:
#Drop 'Lower Confidence Interval' and 'Upper Confidence Interval'

df_2016 = df_2016.drop(columns = ['Lower Confidence Interval'])
df_2016 = df_2016.drop(columns = ['Upper Confidence Interval'])


In [205]:
#derive a new column 'Year' and fill it with the value 2016
df_2016['Year'] = 2016

In [206]:
df_2016['Year'].value_counts()

2016    157
Name: Year, dtype: int64

In [267]:
#EDA 
df_2016.describe()

Unnamed: 0,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,Year
count,157.0,157.0,157.0,157.0,157.0,157.0,157.0,157.0,157.0,157.0
mean,78.980892,5.382185,0.95388,0.793621,0.557619,0.370994,0.137624,0.242635,2.325807,2016.0
std,45.46603,1.141674,0.412595,0.266706,0.229349,0.145507,0.111038,0.133756,0.54222,0.0
min,1.0,2.905,0.0,0.0,0.0,0.0,0.0,0.0,0.81789,2016.0
25%,40.0,4.404,0.67024,0.64184,0.38291,0.25748,0.06126,0.15457,2.03171,2016.0
50%,79.0,5.314,1.0278,0.84142,0.59659,0.39747,0.10547,0.22245,2.29074,2016.0
75%,118.0,6.269,1.27964,1.02152,0.72993,0.48453,0.17554,0.31185,2.66465,2016.0
max,157.0,7.526,1.82427,1.18326,0.95277,0.60848,0.50521,0.81971,3.83772,2016.0


## Wrangle 2017 dataset

In [207]:
#Find missing values
missing_values = df_2017.isna() 
missing_values.sum()

Country                          0
Happiness.Rank                   0
Happiness.Score                  0
Whisker.high                     0
Whisker.low                      0
Economy..GDP.per.Capita.         0
Family                           0
Health..Life.Expectancy.         0
Freedom                          0
Generosity                       0
Trust..Government.Corruption.    0
Dystopia.Residual                0
dtype: int64

In [208]:
#Find duplicates
df_2017[df_2017.duplicated()]

Unnamed: 0,Country,Happiness.Rank,Happiness.Score,Whisker.high,Whisker.low,Economy..GDP.per.Capita.,Family,Health..Life.Expectancy.,Freedom,Generosity,Trust..Government.Corruption.,Dystopia.Residual


In [209]:
#check df_2017.info

df_2017.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155 entries, 0 to 154
Data columns (total 12 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Country                        155 non-null    object 
 1   Happiness.Rank                 155 non-null    int64  
 2   Happiness.Score                155 non-null    float64
 3   Whisker.high                   155 non-null    float64
 4   Whisker.low                    155 non-null    float64
 5   Economy..GDP.per.Capita.       155 non-null    float64
 6   Family                         155 non-null    float64
 7   Health..Life.Expectancy.       155 non-null    float64
 8   Freedom                        155 non-null    float64
 9   Generosity                     155 non-null    float64
 10  Trust..Government.Corruption.  155 non-null    float64
 11  Dystopia.Residual              155 non-null    float64
dtypes: float64(10), int64(1), object(1)
memory usage: 

In [210]:
#Rename columns on 2017 dataset 

df_2017.rename(columns = {'Happiness.Rank' : 'Happiness Rank'}, inplace = True)
df_2017.rename(columns = {'Happiness.Score' : 'Happiness Score'}, inplace = True)
df_2017.rename(columns = {'Economy..GDP.per.Capita.' : 'Economy (GDP per Capita)'}, inplace = True)
df_2017.rename(columns = {'Health..Life.Expectancy.' : 'Health (Life Expectancy)'}, inplace = True)
df_2017.rename(columns = {'Trust..Government.Corruption.' : 'Trust (Government Corruption)'}, inplace = True)
df_2017.rename(columns = {'Dystopia.Residual' : 'Dystopia Residual'}, inplace = True)

In [211]:
#Drop ' Whisker.high' and 'Whisker.low'
df_2017 = df_2017.drop(columns = ['Whisker.high'])
df_2017 = df_2017.drop(columns = ['Whisker.low'])

In [212]:
#See the region from 2016 dataset 
df_2016['Region'].value_counts()

Sub-Saharan Africa                 38
Central and Eastern Europe         29
Latin America and Caribbean        24
Western Europe                     21
Middle East and Northern Africa    19
Southeastern Asia                   9
Southern Asia                       7
Eastern Asia                        6
North America                       2
Australia and New Zealand           2
Name: Region, dtype: int64

In [213]:
#Putting the countries assigned to each region to list

sub_saharan_africa_list = df_2016.loc[df_2016['Region']== 'Sub-Saharan Africa']['Country'].unique().tolist()
central_and_eastern_europe_list = df_2016.loc[df_2016['Region']== 'Central and Eastern Europe']['Country'].unique().tolist()
latin_america_and_caribbean_list = df_2016.loc[df_2016['Region']== 'Latin America and Caribbean']['Country'].unique().tolist()
western_europe_list = df_2016.loc[df_2016['Region']== 'Western Europe']['Country'].unique().tolist()
middle_east_and_northern_africa_list = df_2016.loc[df_2016['Region']== 'Middle East and Northern Africa']['Country'].unique().tolist()
southeastern_asia_list = df_2016.loc[df_2016['Region']== 'Southeastern Asia']['Country'].unique().tolist()
southern_asia_list = df_2016.loc[df_2016['Region']== 'Southern Asia']['Country'].unique().tolist()
eastern_asia_list = df_2016.loc[df_2016['Region']== 'Eastern Asia']['Country'].unique().tolist()
north_america_list = df_2016.loc[df_2016['Region']== 'North America']['Country'].unique().tolist()
australia_and_new_zealand_list = df_2016.loc[df_2016['Region']== 'Australia and New Zealand']['Country'].unique().tolist()

In [214]:
#derive region from region lists 
df_2017.loc[df_2017['Country'].isin(sub_saharan_africa_list), 'Region'] = 'Sub-Saharan Africa'
df_2017.loc[df_2017['Country'].isin(central_and_eastern_europe_list), 'Region'] = 'Central and Eastern Europe'
df_2017.loc[df_2017['Country'].isin(latin_america_and_caribbean_list), 'Region'] = 'Latin America and Caribbean'
df_2017.loc[df_2017['Country'].isin(western_europe_list), 'Region'] = 'Western Europe'
df_2017.loc[df_2017['Country'].isin(middle_east_and_northern_africa_list), 'Region'] = 'Middle East and Northern Africa'
df_2017.loc[df_2017['Country'].isin(southeastern_asia_list), 'Region'] = 'Southeastern Asia'
df_2017.loc[df_2017['Country'].isin(southern_asia_list), 'Region'] = 'Southern Asia'
df_2017.loc[df_2017['Country'].isin(eastern_asia_list), 'Region'] = 'Eastern Asia'
df_2017.loc[df_2017['Country'].isin(north_america_list), 'Region'] = 'North America'
df_2017.loc[df_2017['Country'].isin(australia_and_new_zealand_list), 'Region'] = 'Australia and New Zealand'


In [215]:
# Specify the countries where Region is Null 
df_2017[df_2017['Region'].isnull()]

Unnamed: 0,Country,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Generosity,Trust (Government Corruption),Dystopia Residual,Region
32,Taiwan Province of China,33,6.422,1.433627,1.384565,0.793984,0.361467,0.25836,0.063829,2.126607,
70,"Hong Kong S.A.R., China",71,5.472,1.551675,1.262791,0.943062,0.490969,0.374466,0.293934,0.554633,
112,Mozambique,113,4.55,0.234306,0.870701,0.106654,0.480791,0.322228,0.179436,2.355651,
138,Lesotho,139,3.808,0.521021,1.190095,0.0,0.390661,0.157497,0.119095,1.429835,
154,Central African Republic,155,2.693,0.0,0.0,0.018773,0.270842,0.280876,0.056565,2.066005,


Taiwan Province of China and Hong Kong S.A.R., China are both part of Eastern Asia  
Mozambique and Lesotho and Central African Republic are Sub-Saharan Africa

In [216]:
#Assign them countries
df_2017.loc[df_2017['Country'].isin(['Taiwan Province of China', 'Hong Kong S.A.R., China']), 'Region'] = 'Eastern Asia'
df_2017.loc[df_2017['Country'].isin(['Mozambique', 'Lesotho', 'Central African Republic']), 'Region'] = 'Sub-Saharan Africa'

In [217]:
#find Null values 
df_2017['Region'].isnull().sum()

0

In [218]:
#derive a new column 'Year' and fill it with the value 2017
df_2017['Year'] = 2017

In [268]:
#EDA 
df_2017.describe()

Unnamed: 0,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Generosity,Trust (Government Corruption),Dystopia Residual,Year
count,155.0,155.0,155.0,155.0,155.0,155.0,155.0,155.0,155.0,155.0
mean,78.0,5.354019,0.984718,1.188898,0.551341,0.408786,0.246883,0.12312,1.850238,2017.0
std,44.888751,1.13123,0.420793,0.287263,0.237073,0.149997,0.13478,0.101661,0.500028,0.0
min,1.0,2.693,0.0,0.0,0.0,0.0,0.0,0.0,0.377914,2017.0
25%,39.5,4.5055,0.663371,1.042635,0.369866,0.303677,0.154106,0.057271,1.591291,2017.0
50%,78.0,5.279,1.064578,1.253918,0.606042,0.437454,0.231538,0.089848,1.83291,2017.0
75%,116.5,6.1015,1.318027,1.414316,0.723008,0.516561,0.323762,0.153296,2.144654,2017.0
max,155.0,7.537,1.870766,1.610574,0.949492,0.658249,0.838075,0.464308,3.117485,2017.0


## Wrangling 2018 dataset

In [219]:
#Find missing values
missing_values = df_2018.isna() 
missing_values.sum()

Overall rank                    0
Country or region               0
Score                           0
GDP per capita                  0
Social support                  0
Healthy life expectancy         0
Freedom to make life choices    0
Generosity                      0
Perceptions of corruption       1
dtype: int64

In [220]:
#Find duplicates
df_2018[df_2018.duplicated()]

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption


In [221]:
#check df_2018 info

df_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 9 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Overall rank                  156 non-null    int64  
 1   Country or region             156 non-null    object 
 2   Score                         156 non-null    float64
 3   GDP per capita                156 non-null    float64
 4   Social support                156 non-null    float64
 5   Healthy life expectancy       156 non-null    float64
 6   Freedom to make life choices  156 non-null    float64
 7   Generosity                    156 non-null    float64
 8   Perceptions of corruption     155 non-null    float64
dtypes: float64(7), int64(1), object(1)
memory usage: 11.1+ KB


In [222]:
#Rename columns on 2018 dataset 

df_2018.rename(columns = {'Overall rank' : 'Happiness Rank'}, inplace = True)
df_2018.rename(columns = {'Country or region' : 'Country'}, inplace = True)
df_2018.rename(columns = {'Score' : 'Happiness Score'}, inplace = True)
df_2018.rename(columns = {'Healthy life expectancy' : 'Health (Life Expectancy)'}, inplace = True)
df_2018.rename(columns = {'Perceptions of corruption' : 'Trust (Government Corruption)'}, inplace = True)
df_2018.rename(columns = {'GDP per capita' : 'Economy (GDP per Capita)'}, inplace = True)
df_2018.rename(columns = {'Freedom to make life choices' : 'Freedom'}, inplace = True)



In [223]:
#derive region from region lists 
df_2018.loc[df_2018['Country'].isin(sub_saharan_africa_list), 'Region'] = 'Sub-Saharan Africa'
df_2018.loc[df_2018['Country'].isin(central_and_eastern_europe_list), 'Region'] = 'Central and Eastern Europe'
df_2018.loc[df_2018['Country'].isin(latin_america_and_caribbean_list), 'Region'] = 'Latin America and Caribbean'
df_2018.loc[df_2018['Country'].isin(western_europe_list), 'Region'] = 'Western Europe'
df_2018.loc[df_2018['Country'].isin(middle_east_and_northern_africa_list), 'Region'] = 'Middle East and Northern Africa'
df_2018.loc[df_2018['Country'].isin(southeastern_asia_list), 'Region'] = 'Southeastern Asia'
df_2018.loc[df_2018['Country'].isin(southern_asia_list), 'Region'] = 'Southern Asia'
df_2018.loc[df_2018['Country'].isin(eastern_asia_list), 'Region'] = 'Eastern Asia'
df_2018.loc[df_2018['Country'].isin(north_america_list), 'Region'] = 'North America'
df_2018.loc[df_2018['Country'].isin(australia_and_new_zealand_list), 'Region'] = 'Australia and New Zealand'


In [154]:
# Specify the countries where Region is Null 
df_2018[df_2018['Region'].isnull()]

Unnamed: 0,Happiness Rank,Country,Happiness Score,Economy (GDP per Capita),Social support,Health (Life Expectancy),Freedom,Generosity,Trust (Government Corruption),Region


'Trinidad & Tobago' belongs to Latin America and Caribbean  
Where 'Northern Cyprus' belongs to is an on-going debate, but for simplification in this analysis, we say it belongs to Western Europe.   
'Mozambique' and 'Lesotho' and 'Central African Republic' are 'Sub-Saharan Africa'  

In [224]:
#Assign them countries
df_2018.loc[df_2018['Country'] == 'Trinidad & Tobago', 'Region'] = 'Latin America and Caribbean'
df_2018.loc[df_2018['Country'] == 'Northern Cyprus', 'Region'] = 'Western Europe'
df_2018.loc[df_2018['Country'].isin(['Mozambique', 'Lesotho', 'Central African Republic']), 'Region'] = 'Sub-Saharan Africa'

In [225]:
#find Null values 
df_2018['Region'].isnull().sum()

0

In [226]:
#find a missing value from 'Trust'
df_2018[df_2018['Trust (Government Corruption)'].isnull()]

Unnamed: 0,Happiness Rank,Country,Happiness Score,Economy (GDP per Capita),Social support,Health (Life Expectancy),Freedom,Generosity,Trust (Government Corruption),Region
19,20,United Arab Emirates,6.774,2.096,0.776,0.67,0.284,0.186,,Middle East and Northern Africa


In [227]:
#locate the value from last year
df_2017[df_2017['Country'] == 'United Arab Emirates']


#imputate the value from last year
df_2018.loc[df_2018['Country'] == 'United Arab Emirates', 'Trust (Government Corruption)'] = 0.32449


In [228]:
#derive a new column 'Year' and fill it with the value 2018
df_2018['Year'] = 2018

In [269]:
#EDA 
df_2018.describe()

Unnamed: 0,Happiness Rank,Happiness Score,Economy (GDP per Capita),Social support,Health (Life Expectancy),Freedom,Generosity,Trust (Government Corruption),Year
count,156.0,156.0,156.0,156.0,156.0,156.0,156.0,156.0,156.0
mean,78.5,5.375917,0.891449,1.213237,0.597346,0.454506,0.181006,0.113362,2018.0
std,45.177428,1.119506,0.391921,0.302372,0.247579,0.162424,0.098471,0.097673,0.0
min,1.0,2.905,0.0,0.0,0.0,0.0,0.0,0.0,2018.0
25%,39.75,4.45375,0.61625,1.06675,0.42225,0.356,0.1095,0.051,2018.0
50%,78.5,5.378,0.9495,1.255,0.644,0.487,0.174,0.082,2018.0
75%,117.25,6.1685,1.19775,1.463,0.77725,0.5785,0.239,0.139,2018.0
max,156.0,7.632,2.096,1.644,1.03,0.724,0.598,0.457,2018.0


## Wrangling 2019 dataset

In [229]:
#Find missing values
missing_values = df_2019.isna() 
missing_values.sum()

Overall rank                    0
Country or region               0
Score                           0
GDP per capita                  0
Social support                  0
Healthy life expectancy         0
Freedom to make life choices    0
Generosity                      0
Perceptions of corruption       0
dtype: int64

In [230]:
#Find duplicates
df_2019[df_2019.duplicated()]

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption


In [231]:
#check df_2019 info

df_2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 9 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Overall rank                  156 non-null    int64  
 1   Country or region             156 non-null    object 
 2   Score                         156 non-null    float64
 3   GDP per capita                156 non-null    float64
 4   Social support                156 non-null    float64
 5   Healthy life expectancy       156 non-null    float64
 6   Freedom to make life choices  156 non-null    float64
 7   Generosity                    156 non-null    float64
 8   Perceptions of corruption     156 non-null    float64
dtypes: float64(7), int64(1), object(1)
memory usage: 11.1+ KB


In [232]:
#Rename columns on 2019 dataset 

df_2019.rename(columns = {'Overall rank' : 'Happiness Rank'}, inplace = True)
df_2019.rename(columns = {'Country or region' : 'Country'}, inplace = True)
df_2019.rename(columns = {'Score' : 'Happiness Score'}, inplace = True)
df_2019.rename(columns = {'Healthy life expectancy' : 'Health (Life Expectancy)'}, inplace = True)
df_2019.rename(columns = {'Perceptions of corruption' : 'Trust (Government Corruption)'}, inplace = True)
df_2019.rename(columns = {'GDP per capita' : 'Economy (GDP per Capita)'}, inplace = True)
df_2019.rename(columns = {'Freedom to make life choices' : 'Freedom'}, inplace = True)


In [233]:
#derive region from region lists 
df_2019.loc[df_2019['Country'].isin(sub_saharan_africa_list), 'Region'] = 'Sub-Saharan Africa'
df_2019.loc[df_2019['Country'].isin(central_and_eastern_europe_list), 'Region'] = 'Central and Eastern Europe'
df_2019.loc[df_2019['Country'].isin(latin_america_and_caribbean_list), 'Region'] = 'Latin America and Caribbean'
df_2019.loc[df_2019['Country'].isin(western_europe_list), 'Region'] = 'Western Europe'
df_2019.loc[df_2019['Country'].isin(middle_east_and_northern_africa_list), 'Region'] = 'Middle East and Northern Africa'
df_2019.loc[df_2019['Country'].isin(southeastern_asia_list), 'Region'] = 'Southeastern Asia'
df_2019.loc[df_2019['Country'].isin(southern_asia_list), 'Region'] = 'Southern Asia'
df_2019.loc[df_2019['Country'].isin(eastern_asia_list), 'Region'] = 'Eastern Asia'
df_2019.loc[df_2019['Country'].isin(north_america_list), 'Region'] = 'North America'
df_2019.loc[df_2019['Country'].isin(australia_and_new_zealand_list), 'Region'] = 'Australia and New Zealand'


In [239]:
# Specify the countries where Region is Null 
df_2019[df_2019['Region'].isnull()]

Unnamed: 0,Happiness Rank,Country,Happiness Score,Economy (GDP per Capita),Social support,Health (Life Expectancy),Freedom,Generosity,Trust (Government Corruption),Region,Year
38,39,Trinidad & Tobago,6.192,1.231,1.477,0.713,0.489,0.185,0.016,,2019
63,64,Northern Cyprus,5.718,1.263,1.252,1.042,0.417,0.191,0.162,,2019
83,84,North Macedonia,5.274,0.983,1.294,0.838,0.345,0.185,0.034,,2019
119,120,Gambia,4.516,0.308,0.939,0.428,0.382,0.269,0.167,,2019
134,135,Swaziland,4.212,0.811,1.149,0.0,0.313,0.074,0.135,,2019
143,144,Lesotho,3.802,0.489,1.169,0.168,0.359,0.107,0.093,,2019


'Slovakia' belongs to 'Central and Eastern Europe'  
'Japan' belongs to 'Eastern Asia'  
'Mozambique' and 'Lesotho' and 'Central African Republic' are 'Sub-Saharan Africa'

In [243]:
df_2018.loc[df_2018['Country'] == 'Gambia']

Unnamed: 0,Happiness Rank,Country,Happiness Score,Economy (GDP per Capita),Social support,Health (Life Expectancy),Freedom,Generosity,Trust (Government Corruption),Region,Year


In [247]:
#Assign them countries
df_2019.loc[df_2019['Country'].isin(['Mozambique', 'Central African Republic', 'Gambia', 'Swaziland', 'Lesotho']), 'Region'] = 'Sub-Saharan Africa'
df_2019.loc[df_2019['Country'] == 'North Macedonia', 'Region'] = 'Central and Eastern Europe'
df_2019.loc[df_2019['Country'].isin(['Trinidad & Tobago', 'Central African Republic']), 'Region'] = 'Latin America and Caribbean'
df_2019.loc[df_2019['Country'] == 'Northern Cyprus', 'Region'] = 'Western Europe'


In [245]:
#find Null values 
df_2019['Region'].isnull().sum()

1

In [249]:
#derive a new column 'Year' and fill it with the value 2019
df_2019['Year'] = 2019

In [271]:
#EDA 
df_2019.describe()

Unnamed: 0,Happiness Rank,Happiness Score,Economy (GDP per Capita),Social support,Health (Life Expectancy),Freedom,Generosity,Trust (Government Corruption),Year
count,156.0,156.0,156.0,156.0,156.0,156.0,156.0,156.0,156.0
mean,78.5,5.407096,0.905147,1.208814,0.725244,0.392571,0.184846,0.110603,2019.0
std,45.177428,1.11312,0.398389,0.299191,0.242124,0.143289,0.095254,0.094538,0.0
min,1.0,2.853,0.0,0.0,0.0,0.0,0.0,0.0,2019.0
25%,39.75,4.5445,0.60275,1.05575,0.54775,0.308,0.10875,0.047,2019.0
50%,78.5,5.3795,0.96,1.2715,0.789,0.417,0.1775,0.0855,2019.0
75%,117.25,6.1845,1.2325,1.4525,0.88175,0.50725,0.24825,0.14125,2019.0
max,156.0,7.769,1.684,1.624,1.141,0.631,0.566,0.453,2019.0


## Export Data

In [250]:
#export clean files
df_2015.to_csv(os.path.join(path, 'Clean', 'df_2015_clean.csv'))
df_2016.to_csv(os.path.join(path, 'Clean', 'df_2016_clean.csv'))
df_2017.to_csv(os.path.join(path, 'Clean', 'df_2017_clean.csv'))
df_2018.to_csv(os.path.join(path, 'Clean', 'df_2018_clean.csv'))
df_2019.to_csv(os.path.join(path, 'Clean', 'df_2019_clean.csv'))

### [Back to TOP](#INDEX)