# Introduction

## [World happiness report](https://worldhappiness.report/ed/2020/)

The World Happiness Report is a landmark survey of the state of global happiness that ranks 156 countries by how happy their citizens perceive themselves to be. The World Happiness Report 2020 for the first time ranks cities around the world by their subjective well-being and digs more deeply into how the social, urban and natural environments combine to affect our happiness. I urge you to check out their report if you already haven't!

## Data

The data is available on Kaggle [here](https://www.kaggle.com/mathurinache/world-happiness-report)

## This notebook

In this notebook we explore what makes the citizens of this planet happy. We also try predicting the happiness score of hypothetical countries not mentioned in this dataset. This is obtained by hyper-parameter tuning of various regression models. 

## Intended viewership

I invite data enthusiasts from all regions of the world to tell me something about their countries which is not reflected in the datasets.

# Table of Contents

1. [Initialisation](#Inititialisation)
  * [Libraries](#Libraries)
  * [Data loading](#Data_loading)
2. [Exploratory data analysis](#Exploratory_data_analysis)
  * [Data preparation](#Data_preparation)
  * [Visualization](#Visualization)
3. [Model development](#Model_development)
  * [Model testing](#Model_testing)  
  * [Benchmarking models](#Benchmarking_models)  
  * [Feature importance](#Feature_importance)
4. [Conclusion](#Conclusion)  

# 1. Initialisation

## Libraries

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 500)

from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('fivethirtyeight')
from cycler import cycler # for cycling through colors in a graph

from scipy.stats import skew, kurtosis

from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn import svm
from sklearn import neighbors as neigh
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn import metrics
import warnings; warnings.filterwarnings('ignore')

## Data_loading

In [None]:
first_yr = pd.read_csv('../input/world-happiness-report/2015.csv')
second_yr = pd.read_csv('../input/world-happiness-report/2016.csv')
third_yr = pd.read_csv('../input/world-happiness-report/2017.csv')
fourth_yr = pd.read_csv('../input/world-happiness-report/2018.csv')
fifth_yr = pd.read_csv('../input/world-happiness-report/2019.csv')
sixth_yr = pd.read_csv('../input/world-happiness-report/2020.csv')

# 2. Exploratory_data_analysis

## Data_preparation

In [None]:
#Sorting data by happiness ranks and dropping the happiness rank columns
first_yr.sort_values('Happiness Rank', inplace=True)
first_yr.drop('Happiness Rank', inplace=True, axis=1)
second_yr.sort_values('Happiness Rank', inplace=True)
second_yr['Standard Error'] = (second_yr['Upper Confidence Interval'] - second_yr['Lower Confidence Interval'])/2
second_yr.drop(['Happiness Rank', 'Upper Confidence Interval', 'Lower Confidence Interval'], inplace=True, axis=1)
third_yr.sort_values('Happiness.Rank', inplace=True)
third_yr['Standard Error'] = (third_yr['Whisker.high'] - third_yr['Whisker.low'])/2
third_yr.drop(['Happiness.Rank', 'Whisker.high', 'Whisker.low'], inplace=True, axis=1)
fourth_yr.sort_values('Overall rank', inplace=True)
fourth_yr.drop('Overall rank', inplace=True, axis=1)
fifth_yr.sort_values('Overall rank', inplace=True)
fifth_yr.drop('Overall rank', inplace=True, axis=1)

#Changing column names for consistency
fourth_yr.rename(columns={'Country or region':'Country'}, inplace=True)
fifth_yr.rename(columns={'Country or region':'Country'}, inplace=True)
sixth_yr.rename(columns={'Country name':'Country', 'Regional indicator':'Region'}, inplace=True)

#Changing conflicting region names
second_yr.Region.replace({'East Asia':'Eastern Asia', 'South Asia':'Southern Asia', 'Southeast Asia':'Southeastern Asia', 'Middle East and North Africa':'Middle East and Northern Africa'}, inplace=True)
first_yr.Region.replace({'East Asia':'Eastern Asia', 'South Asia':'Southern Asia', 'Southeast Asia':'Southeastern Asia', 'Middle East and North Africa':'Middle East and Northern Africa'}, inplace=True)
sixth_yr.Region.replace({'East Asia':'Eastern Asia', 'South Asia':'Southern Asia', 'Southeast Asia':'Southeastern Asia', 'Middle East and North Africa':'Middle East and Northern Africa'}, inplace=True)


#Adding region names in 2017,2018, 2019 where they were missing
regions = pd.concat([second_yr[['Country','Region']], first_yr[['Country', 'Region']], sixth_yr[['Country','Region']]])
regions.drop_duplicates(subset='Country', keep='first')
third_yr = third_yr.join(regions.set_index('Country'), on='Country')
fourth_yr = fourth_yr.join(regions.set_index('Country'), on='Country')
fifth_yr = fifth_yr.join(regions.set_index('Country'), on='Country')

#Dropping duplicate rows
third_yr.drop_duplicates(subset='Country', inplace=True, keep='first')
fourth_yr.drop_duplicates(subset='Country', inplace=True, keep='first')
fifth_yr.drop_duplicates(subset='Country', inplace=True, keep='first')


#Filling missing Region values
third_yr.Region.fillna('Southeastern Asia',inplace=True)
fourth_yr.Region[fourth_yr.Country == 'Trinidad & Tobago'] = 'Latin America and Caribbean'
fourth_yr.Region.iloc[fourth_yr.Country == 'Northern Cyprus'] = 'Middle East and Northern Africa'
fifth_yr.Region[fifth_yr.Country == 'Trinidad & Tobago'] = 'Latin America and Caribbean'
fifth_yr.Region.iloc[fifth_yr.Country == 'Northern Cyprus'] = 'Middle East and Northern Africa'
fifth_yr.Region.iloc[fifth_yr.Country == 'North Macedonia'] = 'Western Europe'

#Filling the only missing value remaining
fourth_yr['Perceptions of corruption'].fillna(method='bfill', inplace=True)

Storing the rank data in a column and setting the index to the country names

In [None]:
#Aggregating ranks of countries over various years
first_yr["Rank 2015"]  = first_yr.index
second_yr["Rank 2016"] = second_yr.index
third_yr["Rank 2017"]  = third_yr.index
fourth_yr["Rank 2018"] = fourth_yr.index
fifth_yr["Rank 2019"]  = fifth_yr.index
sixth_yr["Rank 2020"]  = sixth_yr.index

#Setting the index to country names
first_yr.set_index('Country', inplace=True)
second_yr.set_index('Country', inplace=True)
third_yr.set_index('Country', inplace=True)
fourth_yr.set_index('Country', inplace=True)
fifth_yr.set_index('Country', inplace=True)
sixth_yr.set_index('Country', inplace=True)

Making a Rank dataframe which stores ranks of happiness index across the years 2015-2020

In [None]:
Rank = pd.concat([first_yr["Rank 2015"], second_yr["Rank 2016"], third_yr["Rank 2017"], fourth_yr["Rank 2018"], fifth_yr["Rank 2019"], sixth_yr["Rank 2020"]], axis=1)

## Visualization

In [None]:
plt.rc('axes', prop_cycle=(cycler('color', [(0,0,1),(0,1,0),(1,0,0),(0,1,1),(1,0.6,0),(0.3,0.5,1),(0,0.5,0.5),(0.3,0.3,0),(0.3,0.3,0.5),(0,0,0),(0.7,0.7,0.2)])))

for i in np.arange(0,11,1):
    Rank.iloc[i].plot(figsize=(15,8))
    plt.text(2.7,Rank.iloc[i,3], s=Rank.index[i])

plt.title('Top 10 happiest countries during 2015-2020', fontsize=28)

### For further analysis we are only going to look at the dataset of 2020

In [None]:
#Dropping columns which just numerically add up to the happiness index
dataset = sixth_yr.drop(['Explained by: Log GDP per capita', 'Explained by: Social support',
       'Explained by: Healthy life expectancy',
       'Explained by: Freedom to make life choices',
       'Explained by: Generosity', 'Explained by: Perceptions of corruption',
       'Dystopia + residual','upperwhisker','lowerwhisker','Standard error of ladder score'],axis=1)

#Renaming columns
dataset.rename(columns={'Ladder score':'Happiness index'}, inplace=True)

#Score of Dystopia is same for all countries so let's drop it
dataset.drop(['Ladder score in Dystopia'], axis=1, inplace=True)

#Dropping the rank column as it does not contribute to the happiness index score, it is an outcome
dataset.drop(['Rank 2020'], axis=1, inplace=True)

plt.figure(figsize=(15,8))
sns.kdeplot(dataset["Happiness index"], shade=True)
plt.title("Target variable distribution")
plt.text(6.6,0.3, s="Skew")
plt.text(7.2,0.3,s=np.round(skew(dataset["Happiness index"]), 2))
plt.text(6.6,0.28, s="Kurtosis")
plt.text(7.2,0.28,s=np.round(kurtosis(dataset["Happiness index"]),2))

In [None]:
#Normalising the target variable
dataset["Happiness index"] = (dataset["Happiness index"]-dataset["Happiness index"].mean())/dataset["Happiness index"].std()
plt.figure(figsize=(15,8))
sns.kdeplot(dataset["Happiness index"], shade=True, color='seagreen')
plt.title("Normalised target variable distribution")
plt.text(1.5,0.3, s="Skew")
plt.text(2,0.3,s=np.round(skew(dataset["Happiness index"]), 2))
plt.text(1.5,0.28, s="Kurtosis")
plt.text(2,0.28,s=np.round(kurtosis(dataset["Happiness index"]),2))

Hence we can conclude that our target variable roughly follows the binomial distribution

Let us see which factors have maximum correlation with happiness in 2020

In [None]:
#Corelation of each factor with "Happiness Index"
dataset.corr().iloc[0,1:].to_frame().style.background_gradient(cmap="RdBu")

As is evident from the correlation matrix, the factors are moderate->highly correlated with the "Happiness index"

### Let us visualize the relationship between each feature and the "Happiness index"

In [None]:
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(21,25))

sns.regplot(x=dataset["Happiness index"], y=dataset["Logged GDP per capita"], color='purple',ax=axes[0,0])
sns.regplot(x=dataset["Happiness index"], y=dataset["Social support"], color='magenta',ax=axes[0,1], order=5)
sns.regplot(x=dataset["Happiness index"], y=dataset["Healthy life expectancy"], color='red',ax=axes[1,0])
sns.regplot(x=dataset["Happiness index"], y=dataset["Freedom to make life choices"], color='maroon',ax=axes[1,1])
sns.regplot(x=dataset["Happiness index"], y=dataset["Generosity"], color='seagreen',ax=axes[2,0],  order=3)
sns.regplot(x=dataset["Happiness index"], y=dataset["Perceptions of corruption"], color='green',ax=axes[2,1], order=5)


axes[0,0].set_title('Regression plot of GDP per capita vs Happiness index')
axes[0,1].set_title('Regression plot of Social support vs Happiness index')
axes[1,0].set_title('Regression plot of Healthy life expectancy vs Happiness index')
axes[1,1].set_title('Regression plot of Freedom to make life choices vs Happiness index')
axes[2,0].set_title('Regression plot of Generosity vs Happiness index')
axes[2,1].set_title('Regression plot of Perceptions of corruption vs Happiness index')

plt.suptitle('Regression plots of features with the target variable', fontsize=28)

In [None]:
plt.figure(figsize=(15,8))
#sns.countplot(dataset["Region"], orient="v")
sns.boxplot(y='Region', x="Happiness index", data=dataset)
plt.title("Region wise variance in happiness index")

# Model_development

In [None]:
#Encoding the regions
dax=dataset #Keeping regions intact for feature importance analysis later
dataset[pd.Series(dataset.Region.unique())] = pd.get_dummies(dataset.Region)
dataset.drop(['Region'], axis=1, inplace=True)

### Splitting the test and train splits

In [None]:
x_tr, x_te, y_tr, y_te = train_test_split(dataset.drop(['Happiness index'], axis=1), dataset["Happiness index"])

## Model_testing

We are going to try fitting the following regression models with GridSearchCV:
    1. Linear Regression
    2. SVM
    3. Decision Trees
    4. Random Forest
    5. XGBoost
    6. CatBoost
    
    There could be other models which provide better performance, let me know if your model fits better!

### Linear Regression with/without penalization

In [None]:
lm = LinearRegression()
lm.fit(x_tr, y_tr)
lm_score = lm.score(x_te, y_te)

#Let us try with lasso and regression

las = GridSearchCV(Lasso(), param_grid={"alpha":np.arange(0,10,0.1)}, cv=5, verbose=0)
las.fit(x_tr, y_tr)
las_score = las.score(x_te,y_te)

#Let us try with ridge regression
rid = GridSearchCV(Ridge(), param_grid={"alpha":np.arange(0,10,0.1),'solver':("auto", "svd", "cholesky", "lsqr", "sparse_cg", "sag", "saga")}, cv=5, verbose=0)
rid.fit(x_tr, y_tr)
rid_score = rid.score(x_te,y_te)

#Let us try with ElasticNet
ela = GridSearchCV(ElasticNet(), param_grid={"alpha":np.arange(0,10,0.1),"l1_ratio":np.arange(0,10,0.1)}, cv=5, verbose=0)
ela.fit(x_tr, y_tr)
ela_score = ela.score(x_te,y_te)

### Support Vector Machine

In [None]:
sv = GridSearchCV(svm.SVR(),cv=5,param_grid={"kernel":["linear", "poly", "rbf", "sigmoid"], "degree":np.arange(1,10,1)}, verbose=0)
sv.fit(x_tr,y_tr)
svm_score = sv.score(x_te,y_te)

### Decision Trees

In [None]:
tr = GridSearchCV(DecisionTreeRegressor(),cv=5,param_grid={"criterion":["mse", "friedman_mse", "mae", "poisson"], "max_depth":np.arange(1,16,1)}, verbose=0)
tr.fit(x_tr,y_tr)
dt_score = tr.score(x_te,y_te)

### Random Forest

In [None]:
rf = GridSearchCV(RandomForestRegressor(verbose=0), cv=5, param_grid={"n_estimators":[1000], "criterion":["mse","mae"]}, verbose=0)
rf.fit(x_tr,y_tr)
rf_score = rf.score(x_te,y_te)

### XGBoost

In [None]:
from xgboost import XGBRegressor
xgb = GridSearchCV(XGBRegressor(), cv=5, param_grid={"booster":["gbtree"],"eta":[0.1],"max_depth":np.arange(16)}, verbose=0)
xgb.fit(x_tr,y_tr)
xgb_score = xgb.score(x_te,y_te)

### CatBoost

In [None]:
from catboost import CatBoostRegressor
ctb=GridSearchCV(CatBoostRegressor(verbose=0), cv=5,param_grid={"max_depth":np.arange(6,16,1) }, verbose=0)
ctb.fit(x_tr,y_tr)
ctb_score = ctb.score(x_te,y_te)

## Benchmarking_models

In [None]:
scores = pd.DataFrame([lm_score, las_score, rid_score, ela_score, svm_score, dt_score, rf_score, xgb_score, ctb_score])
scores.index=['Linear Reg','Lasso','Ridge','ElasticNet','SVM','Decision trees','Random Forest','XG Boost','CatBoost']
scores.columns=['Score']
scores.sort_values('Score', ascending=False)

## Feature_importance

In [None]:
from sklearn.feature_selection import SelectKBest, mutual_info_regression
fs = SelectKBest(score_func=mutual_info_regression, k="all")
fs.fit(dax.drop(['Happiness index'],axis=1), dax["Happiness index"])
X_n = fs.transform(dax.drop(['Happiness index'],axis=1))
score = pd.concat([pd.DataFrame(dax.columns),pd.DataFrame(fs.scores_)],axis=1)
score.columns = ["feature","scores"]
score = score.sort_values("scores", ascending=False)
score = score[score.feature != "Happiness index"]
sns.barplot(x=score.scores, y=score.feature)
plt.title("Importance of features")

# Conclusion

According to our trials, CatBoost model had the best score. According to feature importances, we can see that GDP per capita, followed by Social support have the maximum effect in determining the happiness of people in a country.