# Introduction

## [World happiness report](https://worldhappiness.report/ed/2020/)

The World Happiness Report is a landmark survey of the state of global happiness that ranks 156 countries by how happy their citizens perceive themselves to be. The World Happiness Report 2020 for the first time ranks cities around the world by their subjective well-being and digs more deeply into how the social, urban and natural environments combine to affect our happiness. I urge you to check out their report if you already haven't!

## Data

The data is available on Kaggle [here](https://www.kaggle.com/mathurinache/world-happiness-report)

## This notebook

In this notebook we explore what makes the citizens of this planet happy. We also try predicting the happiness score of hypothetical countries not mentioned in this dataset. This is obtained by hyper-parameter tuning of various regression models. 

## Intended viewership

I invite data enthusiasts from all regions of the world to tell me something about their countries which is not reflected in the datasets.

# Table of Contents

1. [Initialisation](#Inititialisation)
  * [Libraries](#Libraries)
  * [Data loading](#Data_loading)
2. [Exploratory data analysis](#Exploratory_data_analysis)
  * [Data preparation](#Data_preparation)
  * [Visualization](#Visualization)
3. [Model development](#Model_development)
  * [Model testing](#Model_testing)  
  * [Benchmarking models](#Benchmarking_models)  
  * [Feature importance](#Feature_importance)
4. [Conclusion](#Conclusion)  

# 1. Initialisation

## Libraries

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 500)

from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('fivethirtyeight')
from cycler import cycler # for cycling through colors in a graph

from scipy.stats import skew, kurtosis

from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn import svm
from sklearn import neighbors as neigh
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn import metrics
import warnings; warnings.filterwarnings('ignore')

## Data_loading

In [None]:
first_yr = pd.read_csv('../input/world-happiness-report/2015.csv')
second_yr = pd.read_csv('../input/world-happiness-report/2016.csv')
third_yr = pd.read_csv('../input/world-happiness-report/2017.csv')
fourth_yr = pd.read_csv('../input/world-happiness-report/2018.csv')
fifth_yr = pd.read_csv('../input/world-happiness-report/2019.csv')
sixth_yr = pd.read_csv('../input/world-happiness-report/2020.csv')

# 2. Exploratory_data_analysis

## Data_preparation

In [None]:
#Sorting data by happiness ranks and dropping the happiness rank columns
first_yr.sort_values('Happiness Rank', inplace=True)
first_yr.drop('Happiness Rank', inplace=True, axis=1)
second_yr.sort_values('Happiness Rank', inplace=True)
second_yr['Standard Error'] = (second_yr['Upper Confidence Interval'] - second_yr['Lower Confidence Interval'])/2
second_yr.drop(['Happiness Rank', 'Upper Confidence Interval', 'Lower Confidence Interval'], inplace=True, axis=1)
third_yr.sort_values('Happiness.Rank', inplace=True)
third_yr['Standard Error'] = (third_yr['Whisker.high'] - third_yr['Whisker.low'])/2
third_yr.drop(['Happiness.Rank', 'Whisker.high', 'Whisker.low'], inplace=True, axis=1)
fourth_yr.sort_values('Overall rank', inplace=True)
fourth_yr.drop('Overall rank', inplace=True, axis=1)
fifth_yr.sort_values('Overall rank', inplace=True)
fifth_yr.drop('Overall rank', inplace=True, axis=1)

#Changing column names for consistency
fourth_yr.rename(columns={'Country or region':'Country'}, inplace=True)
fifth_yr.rename(columns={'Country or region':'Country'}, inplace=True)
sixth_yr.columns = ['Country', 'Region', 'Happiness Score',
       'Standard error of ladder score', 'upperwhisker', 'lowerwhisker',
       'Economy (GDP per Capita)', 'Social support', 'Health (Life Expectancy)',
       'Freedom', 'Generosity',
       'Trust (Government Corruption)', 'Dystopia Residual',
       'Explained by: Log GDP per capita', 'Explained by: Social support',
       'Explained by: Healthy life expectancy',
       'Explained by: Freedom to make life choices',
       'Explained by: Generosity', 'Explained by: Perceptions of corruption',
       'Dystopia + residual']

#Changing conflicting region names
second_yr.Region.replace({'East Asia':'Eastern Asia', 'South Asia':'Southern Asia', 'Southeast Asia':'Southeastern Asia', 'Middle East and North Africa':'Middle East and Northern Africa'}, inplace=True)
first_yr.Region.replace({'East Asia':'Eastern Asia', 'South Asia':'Southern Asia', 'Southeast Asia':'Southeastern Asia', 'Middle East and North Africa':'Middle East and Northern Africa'}, inplace=True)
sixth_yr.Region.replace({'East Asia':'Eastern Asia', 'South Asia':'Southern Asia', 'Southeast Asia':'Southeastern Asia', 'Middle East and North Africa':'Middle East and Northern Africa'}, inplace=True)


#Adding region names in 2017,2018, 2019 where they were missing
regions = pd.concat([second_yr[['Country','Region']], first_yr[['Country', 'Region']], sixth_yr[['Country','Region']]])
regions.drop_duplicates(subset='Country', keep='first')
third_yr = third_yr.join(regions.set_index('Country'), on='Country')
fourth_yr = fourth_yr.join(regions.set_index('Country'), on='Country')
fifth_yr = fifth_yr.join(regions.set_index('Country'), on='Country')

#Dropping duplicate rows
third_yr.drop_duplicates(subset='Country', inplace=True, keep='first')
fourth_yr.drop_duplicates(subset='Country', inplace=True, keep='first')
fifth_yr.drop_duplicates(subset='Country', inplace=True, keep='first')


#Filling missing Region values
third_yr.Region.fillna('Southeastern Asia',inplace=True)
fourth_yr.Region[fourth_yr.Country == 'Trinidad & Tobago'] = 'Latin America and Caribbean'
fourth_yr.Region.iloc[fourth_yr.Country == 'Northern Cyprus'] = 'Middle East and Northern Africa'
fifth_yr.Region[fifth_yr.Country == 'Trinidad & Tobago'] = 'Latin America and Caribbean'
fifth_yr.Region.iloc[fifth_yr.Country == 'Northern Cyprus'] = 'Middle East and Northern Africa'
fifth_yr.Region.iloc[fifth_yr.Country == 'North Macedonia'] = 'Western Europe'

#Filling the only missing value remaining
fourth_yr['Perceptions of corruption'].fillna(method='bfill', inplace=True)

#Making column names same
third_yr.columns=['Country', 'Happiness Score', 'Economy (GDP per Capita)',
       'Family', 'Health (Life Expectancy)', 'Freedom', 'Generosity',
       'Trust (Government Corruption)', 'Dystopia Residual',
       'Standard Error','Region']
fourth_yr.columns = ['Country', 'Happiness Score', 'Economy (GDP per Capita)','Social support', 
       'Health (Life Expectancy)', 'Freedom', 'Generosity',
       'Trust (Government Corruption)', 'Region']
fifth_yr.columns = ['Country', 'Happiness Score', 'Economy (GDP per Capita)', 'Social support',
       'Health (Life Expectancy)', 'Freedom', 'Generosity',
       'Trust (Government Corruption)',
       'Region']

Storing the rank data in a column and setting the index to the country names

In [None]:
#Aggregating ranks of countries over various years
first_yr["Rank 2015"]  = first_yr.index
second_yr["Rank 2016"] = second_yr.index
third_yr["Rank 2017"]  = third_yr.index
fourth_yr["Rank 2018"] = fourth_yr.index
fifth_yr["Rank 2019"]  = fifth_yr.index
sixth_yr["Rank 2020"]  = sixth_yr.index

#Setting the index to country names
first_yr.set_index('Country', inplace=True)
second_yr.set_index('Country', inplace=True)
third_yr.set_index('Country', inplace=True)
fourth_yr.set_index('Country', inplace=True)
fifth_yr.set_index('Country', inplace=True)
sixth_yr.set_index('Country', inplace=True)

Making a Rank dataframe which stores ranks of happiness index across the years 2015-2020

In [None]:
Rank = pd.concat([first_yr["Rank 2015"], second_yr["Rank 2016"], third_yr["Rank 2017"], fourth_yr["Rank 2018"], fifth_yr["Rank 2019"], sixth_yr["Rank 2020"]], axis=1)

## Visualization

In [None]:
plt.rc('axes', prop_cycle=(cycler('color', [(0.2,0,0),(0,0,0.2),(0,0.2,0),(0,0,0.5),(0.5,0,0),(0,0.5,0),(0,0,0.9),(0,0.9,0),(0,0.5,0.9),(0.9,0.5,0),(0.5,0,0.9),(0.5,0.9,0),(0,0.9,0.5),(0.9,0,0.5),(0,0.5,0.5),(0.5,0.5,0),(0.5,0,0.5)])))

for i in np.arange(0,11,1):
    Rank.iloc[i].plot(figsize=(15,8))
    plt.text(2.7,Rank.iloc[i,3], s=Rank.index[i])

plt.title('Top 10 happiest countries during 2015-2020', fontsize=28)

In [None]:
#Dropping columns which just numerically add up to the happiness index
dataset = sixth_yr.drop(['Explained by: Log GDP per capita', 'Explained by: Social support',
       'Explained by: Healthy life expectancy',
       'Explained by: Freedom to make life choices',
       'Explained by: Generosity', 'Explained by: Perceptions of corruption',
       'Dystopia + residual','upperwhisker','lowerwhisker','Standard error of ladder score'],axis=1)

#Renaming columns
dataset.rename(columns={'Ladder score':'Happiness index'}, inplace=True)

#Score of Dystopia is same for all countries so let's drop it
dataset.drop(['Dystopia Residual'], axis=1, inplace=True)

#Dropping the rank column as it does not contribute to the happiness index score, it is an outcome
dataset.drop(['Rank 2020'], axis=1, inplace=True)

In [None]:
#Dropping columns which just numerically add up to the happiness index
dataset = sixth_yr.drop(['Explained by: Log GDP per capita', 'Explained by: Social support',
       'Explained by: Healthy life expectancy',
       'Explained by: Freedom to make life choices',
       'Explained by: Generosity', 'Explained by: Perceptions of corruption',
       'Dystopia + residual','upperwhisker','lowerwhisker','Standard error of ladder score'],axis=1)

#Renaming columns
dataset.rename(columns={'Ladder score':'Happiness index'}, inplace=True)

#Score of Dystopia is same for all countries so let's drop it
dataset.drop(['Dystopia Residual'], axis=1, inplace=True)

#Dropping the rank column as it does not contribute to the happiness index score, it is an outcome
dataset.drop(['Rank 2020'], axis=1, inplace=True)

fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(22,8))
sns.kdeplot(first_yr["Happiness Score"], shade=True, ax=axes[0,0])
sns.kdeplot(second_yr["Happiness Score"], shade=True, ax=axes[0,1])
sns.kdeplot(third_yr["Happiness Score"], shade=True, ax=axes[0,2])
sns.kdeplot(fourth_yr["Happiness Score"], shade=True, ax=axes[1,0])
sns.kdeplot(fifth_yr["Happiness Score"], shade=True, ax=axes[1,1])
sns.kdeplot(dataset["Happiness Score"], shade=True, ax=axes[1,2])

axes[0,0].set_xlabel("Happiness score 2015")
axes[0,1].set_xlabel("Happiness score 2016")
axes[0,2].set_xlabel("Happiness score 2017")
axes[1,0].set_xlabel("Happiness score 2018")
axes[1,1].set_xlabel("Happiness score 2019")
axes[1,2].set_xlabel("Happiness score 2020")

axes[0,0].axvline(first_yr["Happiness Score"].mean(), color='black')
axes[0,1].axvline(second_yr["Happiness Score"].mean(), color='black')
axes[0,2].axvline(third_yr["Happiness Score"].mean(), color='black')
axes[1,0].axvline(fourth_yr["Happiness Score"].mean(), color='black')
axes[1,1].axvline(fifth_yr["Happiness Score"].mean(), color='black')
axes[1,2].axvline(dataset["Happiness Score"].mean(), color='black')

axes[0,0].text(first_yr["Happiness Score"].mean(),0.3, s=np.round(first_yr["Happiness Score"].mean(),2))
axes[0,1].text(second_yr["Happiness Score"].mean(),0.3, s=np.round(second_yr["Happiness Score"].mean(),2))
axes[0,2].text(third_yr["Happiness Score"].mean(),0.3, s=np.round(third_yr["Happiness Score"].mean(),2))
axes[1,0].text(fourth_yr["Happiness Score"].mean(),0.3, s=np.round(fourth_yr["Happiness Score"].mean(),2))
axes[1,1].text(fifth_yr["Happiness Score"].mean(),0.3, s=np.round(fifth_yr["Happiness Score"].mean(),2))
axes[1,2].text(dataset["Happiness Score"].mean(),0.3, s=np.round(dataset["Happiness Score"].mean(),2))


plt.suptitle("Happiness score distribution across years")

In [None]:
first_yr[pd.Series(first_yr.Region.unique()).sort_values(0)] = pd.get_dummies(first_yr.Region)
second_yr[pd.Series(second_yr.Region.unique()).sort_values(0)] = pd.get_dummies(second_yr.Region)
third_yr[pd.Series(third_yr.Region.unique()).sort_values(0)] = pd.get_dummies(third_yr.Region)
fourth_yr[pd.Series(fourth_yr.Region.unique()).sort_values(0)] = pd.get_dummies(fourth_yr.Region)
fifth_yr[pd.Series(fifth_yr.Region.unique()).sort_values(0)] = pd.get_dummies(fifth_yr.Region)
dataset[pd.Series(sixth_yr.Region.unique()).sort_values(0)] = pd.get_dummies(dataset.Region)

first_yr.drop(['Region'], axis=1, inplace=True)
second_yr.drop(['Region'], axis=1, inplace=True)
third_yr.drop(['Region'], axis=1, inplace=True)
fourth_yr.drop(['Region'], axis=1, inplace=True)
fifth_yr.drop(['Region'], axis=1, inplace=True)
dataset.drop(['Region'], axis=1, inplace=True)



from sklearn.feature_selection import SelectKBest, mutual_info_regression
fs1 = SelectKBest(score_func=mutual_info_regression, k='all')
fs2 = SelectKBest(score_func=mutual_info_regression, k='all')
fs3 = SelectKBest(score_func=mutual_info_regression, k='all')
fs4 = SelectKBest(score_func=mutual_info_regression, k='all')
fs5 = SelectKBest(score_func=mutual_info_regression, k='all')
fs6 = SelectKBest(score_func=mutual_info_regression, k='all')

fs1.fit(first_yr.drop(["Happiness Score"],axis=1),first_yr["Happiness Score"])
fs2.fit(second_yr.drop(["Happiness Score"],axis=1),second_yr["Happiness Score"])
fs3.fit(third_yr.drop(["Happiness Score"],axis=1),third_yr["Happiness Score"])
fs4.fit(fourth_yr.drop(["Happiness Score"],axis=1),fourth_yr["Happiness Score"])
fs5.fit(fifth_yr.drop(["Happiness Score"],axis=1),fifth_yr["Happiness Score"])
fs6.fit(dataset.drop(["Happiness Score"],axis=1),dataset["Happiness Score"])

In [None]:
f1s = pd.concat([pd.DataFrame(first_yr.columns), pd.DataFrame(fs1.scores_)], axis=1)
f1s.columns=["features","scores 2015"]
f1s = f1s[(f1s.features!="Happiness Score") & (f1s.features!="Standard Error")& (f1s.features!="Rank 2015")]
f2s = pd.concat([pd.DataFrame(second_yr.columns), pd.DataFrame(fs2.scores_)], axis=1)
f2s.columns=["features","scores 2016"]
f2s = f2s[(f2s.features!="Happiness Score") & (f2s.features!="Standard Error")& (f2s.features!="Rank 2016")]
f3s = pd.concat([pd.DataFrame(third_yr.columns), pd.DataFrame(fs3.scores_)], axis=1)
f3s.columns=["features","scores 2017"]
f3s = f3s[(f3s.features!="Happiness Score") & (f3s.features!="Standard Error")& (f3s.features!="Rank 2017")]
f4s = pd.concat([pd.DataFrame(fourth_yr.columns), pd.DataFrame(fs4.scores_)], axis=1)
f4s.columns=["features","scores 2018"]
f4s = f4s[(f4s.features!="Happiness Score") & (f4s.features!="Standard Error")& (f4s.features!="Rank 2018")]
f5s = pd.concat([pd.DataFrame(fifth_yr.columns), pd.DataFrame(fs5.scores_)], axis=1)
f5s.columns=["features","scores 2019"]
f5s = f5s[(f5s.features!="Happiness Score") & (f5s.features!="Standard Error")& (f5s.features!="Rank 2019")]
f6s = pd.concat([pd.DataFrame(dataset.columns), pd.DataFrame(fs6.scores_)], axis=1)
f6s.columns=["features","scores 2020"]
f6s = f6s[(f6s.features!="Happiness Score") & (f6s.features!="Standard Error")& (f6s.features!="Rank 2020")]

#Setting index to features
f1s.set_index('features',inplace=True)
f2s.set_index('features',inplace=True)
f3s.set_index('features',inplace=True)
f4s.set_index('features',inplace=True)
f5s.set_index('features',inplace=True)
f6s.set_index('features',inplace=True)

In [None]:
fs_scores=pd.concat([f1s,f2s,f3s,f4s,f5s,f6s],axis=1)

In [None]:
fs_scores = fs_scores[(fs_scores.index!='Dystopia Residual') & (fs_scores.index!='Trust (Government Corruption)')]
for i in np.arange(0,17,1):
    fs_scores.iloc[i].plot(figsize=(15,8))
plt.legend(loc='lower left')  
plt.title('Feature importance across years')

### Feature importance statistically:

In [None]:
fs_scores.transpose().describe().iloc[1,:].to_frame().sort_values('mean', ascending=False).style.background_gradient(cmap='Reds')

Important features across the years:
* Social support
* Economy (GDP per capita)
* Family
* Health (Life expectancy)
* Being in Southern Asia

Freedom is much less important than perceived

# Clustering

In [None]:
second_yr = second_yr.iloc[:,:8]

In [None]:
second_yr

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score as s

ss=[]

for i in range(2,50):

    k=KMeans(n_clusters=i,precompute_distances=True)
    k.fit(second_yr)
    ss.append(s(second_yr,k.labels_))
    ss=pd.DataFrame(ss)
ss.plot(figsize=(20,10))
plt.axvline(4)

In [None]:
ss.sort_values(0,ascending=False)

At ss's index=4, i.e. 6 clusters we are getting the sharpest change.

In [None]:
k=KMeans(n_clusters=6,precompute_distances=True)
k.fit(second_yr)
second_yr['labels'] = k.labels_

In [None]:
first_yr.iloc[:,:9]

ss=[]

for i in range(2,50):

    k=KMeans(n_clusters=i,precompute_distances=True)
    k.fit(first_yr)
    ss.append(s(first_yr,k.labels_))
    ss=pd.DataFrame(ss)
ss.plot(figsize=(20,10))
plt.axvline(4)