## Introduction

In this notebook, we are going to explore and analyze data of World Happiness Report. In this dataset, every country in the world are surveyed and ranked by their happiness score. There are 6 factors which might affect the happiness level of a country: GDP per capita, social support, healthy life expectancy, freedom to make life choices, generosity, and perceptions of corruption. 

The aim of this project is to find the factor(s) which mostly affect the happiness score of a country based on data analysis. Note that the aim is not to produce a good model to predict happiness score of a country, hence cross validation is not necessarily needed. We will use the dataset from 2019 report. 

## Importing Dataset

In [None]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df1 = pd.read_csv('../input/world-happiness/2019.csv')
df1 = df1.rename(columns={"Country or region": "Country"})
df1

The countries in this dataset are already ranked based on the happiness score. We can see that the top 5 happiest countries in the world are in the region of Europe, while the 5 lowest score countries are in the region of Africa. 

## Regional Analysis

Next, we will look the overall mean happiness score over the region where each country are belong to. The region information of each country doesn't exist in 2019 report, but it does exist in 2015 report so we can obtain the information by using outer join between those 2 datasets. 

In [None]:
df2 = pd.read_csv('../input/world-happiness/2015.csv')
df = pd.merge(df1, df2, on='Country', how='left')
df.drop(df.columns[10:],axis=1,inplace=True)
df = df.rename(columns={"Generosity_x": "Generosity"})
cols = ['Overall rank','Country','Region','Score','GDP per capita','Social support','Healthy life expectancy',
        'Freedom to make life choices','Generosity','Perceptions of corruption']
df = df[cols]
df

In [None]:
by_region = df.groupby('Region').agg(Mean_Score=('Score', 'mean')).reset_index()
plt.figure(figsize=(20,10))
sns.barplot(x='Mean_Score', y='Region', data=by_region.sort_values(by='Mean_Score',ascending=False))
plt.title('Mean Happiness Score in Each Region (2019)')

We can see that Western Europe region is on the 3rd place. Higher mean score of Australia and North America region are due to the fact that these 2 regions only consist of 2 countries, each with relatively high score.  

In [None]:
df[df['Region']=='Australia and New Zealand']

In [None]:
df[df['Region']=='North America']

## Analysis and Correlation of the Six Factors

Now we will examine the relation between the happiness score and the other 6 factors by using scatterplot. We also try to fit the relation we seek using linear regression. 

In [None]:
from sklearn.linear_model import LinearRegression
x,y = df['GDP per capita'], df['Score']
X = np.array(x).reshape(-1,1)
plt.figure(figsize=(10,7))
sns.scatterplot(x, y)
plt.title('Relation between Happiness Score and GDP per Capita')

lr = LinearRegression()
lr.fit(X,y)
plt.plot(x, y, '.')
plt.plot(x, lr.intercept_ + lr.coef_ * x, '-')
plt.text(1.4, 4.5, 'R-squared = %0.2f' % lr.score(X,y))
plt.show()

In [None]:
x,y = df['Social support'], df['Score']
X = np.array(x).reshape(-1,1)
plt.figure(figsize=(10,7))
sns.scatterplot(x, y)
plt.title('Relation between Happiness Score and Social Support')

lr = LinearRegression()
lr.fit(X,y)
plt.plot(x, y, '.')
plt.plot(x, lr.intercept_ + lr.coef_ * x, '-')
plt.text(0.3, 2.4, 'R-squared = %0.2f' % lr.score(X,y))
plt.show()

In [None]:
x,y = df['Healthy life expectancy'], df['Score']
X = np.array(x).reshape(-1,1)
plt.figure(figsize=(10,7))
sns.scatterplot(x, y)
plt.title('Relation between Happiness Score and Healthy Life')

lr = LinearRegression()
lr.fit(X,y)
plt.plot(x, y, '.')
plt.plot(x, lr.intercept_ + lr.coef_ * x, '-')
plt.text(0.9, 4, 'R-squared = %0.2f' % lr.score(X,y))
plt.show()

In [None]:
x,y = df['Freedom to make life choices'], df['Score']
X = np.array(x).reshape(-1,1)
plt.figure(figsize=(10,7))
sns.scatterplot(x, y)
plt.title('Relation between Happiness Score and Freedom of Life')

lr = LinearRegression()
lr.fit(X,y)
plt.plot(x, y, '.')
plt.plot(x, lr.intercept_ + lr.coef_ * x, '-')
plt.text(0, 4.2, 'R-squared = %0.2f' % lr.score(X,y))
plt.show()

In [None]:
x,y = df['Generosity'], df['Score']
X = np.array(x).reshape(-1,1)
plt.figure(figsize=(10,7))
sns.scatterplot(x, y)
plt.title('Relation between Happiness Score and Generosity')

lr = LinearRegression()
lr.fit(X,y)
plt.plot(x, y, '.')
plt.plot(x, lr.intercept_ + lr.coef_ * x, '-')
plt.text(0.47, 5.9, 'R-squared = %0.2f' % lr.score(X,y))
plt.show()

In [None]:
x,y = df['Perceptions of corruption'], df['Score']
X = np.array(x).reshape(-1,1)
plt.figure(figsize=(10,7))
sns.scatterplot(x, y)
plt.title('Relation between Happiness Score and Perceptions of Corruption')

lr = LinearRegression()
lr.fit(X,y)
plt.plot(x, y, '.')
plt.plot(x, lr.intercept_ + lr.coef_ * x, '-')
plt.text(0.31, 6, 'R-squared = %0.2f' % lr.score(X,y))
plt.show()

From the above results, we can see that GDP per capita, social support, and healthy life expectancy have stronger correlation to happiness score compared to the other factors. From here, we can expect that a country which has high economic growth, good social life in society, and high quality of health system tend to be more 'happy'.

Now, we will explore every correlation between the factors using heatmap of correlation coefficient. Here, we will use Pearson correlation coefficient. 

In [None]:
six_vars = df[['Score','GDP per capita','Social support','Healthy life expectancy','Freedom to make life choices','Perceptions of corruption','Generosity']]
plt.figure(figsize=(10,8))
sns.heatmap(six_vars.corr(method='pearson'), cmap = 'RdBu_r', annot = True)
plt.show()

Strong positive correlation between variables are indicated by red colors. We can see that happiness score has strong correlation with GDP per capita, social support, dan healthy life expectancy in accordance to previous analysis. Also note that GDP per capita has strong correlation with healthy life expectancy (as can be seen in the following scatterplot). This means that country with high economic growth tend to have good quality of health system. 

In [None]:
x,y = df['GDP per capita'], df['Healthy life expectancy']
X = np.array(x).reshape(-1,1)
plt.figure(figsize=(10,7))
sns.scatterplot(x, y)
plt.title('Relation between Health and GDP per Capita')

lr = LinearRegression()
lr.fit(X,y)
plt.plot(x, y, '.')
plt.plot(x, lr.intercept_ + lr.coef_ * x, '-')
plt.text(1.3, 0.5, 'R-squared = %0.2f' % lr.score(X,y))
plt.show()

## Feature Importance

We can also obtain the important factors which affect the happiness score by using feature selection method from a certain learning model. Here, we will use multiple linear regression for our model with happiness score as the dependent variable and the other 6 factors as the independent variables. By using SelectKBest from scikit-learn package, we can obtain feature importance score for each independent variable/feature. In addition, R-squared score for the multiple regression is shown. 

In [None]:
X = (six_vars.dropna().iloc[:,1:7])
y = (six_vars.dropna().iloc[:,0])
lr = LinearRegression()
lr.fit(X,y)
print("multiple linear regression R-squared: %f" % (lr.score(X,y)))

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
fs = SelectKBest(score_func=f_regression, k='all')
fs.fit(X,y)
indices = np.argsort(fs.scores_)[::-1]
for f in range(X.shape[1]):
    print("%d. feature %s (%f)" % (f + 1, six_vars.columns[indices[f]+1], fs.scores_[indices[f]]))

cc = DataFrame({'feature score':Series(fs.scores_),'features':Series(X.columns)})    
plt.figure(figsize=(17,7))
sns.barplot(y='feature score',x='features',data=cc.sort_values(by='feature score',ascending=False))

We can see that GDP per capita, healthy life expectancy, and social support have much higher feature importance score compared to the other features. Moreover, these 3 factors have importance score which relatively equal in magnitude. Once again, this confirms our previous results. 

For the last approach, we will use feature selection method by using Extra Trees (one variant of tree ensemble methods) for the learning model. By using ExtraTreeRegressor from scikit-learn package, we can obtain feature importance score for each feature. 

In [None]:
from sklearn.ensemble import ExtraTreesRegressor
forest = ExtraTreesRegressor(n_estimators=100,
                              random_state=0)
forest.fit(X, y)
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(X.shape[1]):
    print("%d. feature %s (%f)" % (f + 1, six_vars.columns[indices[f]+1], importances[f]))

cc = DataFrame({'feature score':Series(importances),'features':Series(X.columns)})    
plt.figure(figsize=(17,7))
sns.barplot(y='feature score',x='features',data=cc.sort_values(by='feature score',ascending=False))

We can see that GDP per capita, healthy life expectancy, and social support have much higher feature importance score compared to the other features. Once again, this confirms our previous results. From the 3 top factors, GDP per capita seems to be the most important factor which affects the happiness level of a country.

## Conclusion

From this all analysis of World Happiness Report using 2019 dataset, we can conclude that there are at least 3 factors which really affect the happiness level of a country: GDP per capita, social support, and healthy life expectancy. From these 3 factors, GDP per capita has the highest impact to happiness score. There is also very strong correlation between GDP per capita and healthy life expectancy.