## World Happiness Report Analysis

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import math
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

In [None]:
df15 = pd.read_csv('2015.csv')
df16 = pd.read_csv('2016.csv')
df17 = pd.read_csv('2017.csv')
df18 = pd.read_csv('2018.csv')
df19 = pd.read_csv('2019.csv')

In [None]:
df15.head()

In [None]:
df16.head()

In [None]:
df17.head()

In [None]:
df18.head()

In [None]:
df19.head()

In [None]:
sns.heatmap(df19.corr(),annot=True)

Observations from Heatmap :
* The correlation between**Score and GDP is 0.79** which means provided all other factors to be constant a unit change in GDP will account for 0.79 unit increase in the Score which is also the highest value of increase as compared to any other factors.
* The **Social Support and Health life expectancy also have a correlation value of 0.78** with the score but observe that correlation between **GDP and Health life expectancy is 0.84 which is highest value in all the permuations of correlations**.
* The "Perception of Corruption" and "Freedom of Choice" has also significant value of correlation with score.

> So Let us consider the variations of GDP Score of Countries over the last 4 years with their rank in the present year.

In [None]:
gdp =pd.DataFrame([df19['GDP per capita'],
       df18['GDP per capita'],
       df17['Economy..GDP.per.Capita.'],
       df16['Economy (GDP per Capita)'],
       df15['Economy (GDP per Capita)']])
gdpt = gdp.transpose()
gdpt.index.rename(name='OverallRank',inplace=True)
gdpt.columns = ["GDP 2019","GDP 2018","GDP 2017","GDP 2016","GDP 2015"]
gdpt.fillna(value=0,inplace=True)
gdpt['Avg GDP']= gdpt.mean(axis=1)
gdpt

The above table shows the variation of Overall Rank (index) and the GDP of the countries. It seems to be a little difficult to understand these variation in the table form. Let us try to visualize the trends of these Rank vs GDP values.

In [None]:
sns.lineplot(x=gdpt.index,y=gdpt['Avg GDP'])

* Observing the above graph we can see that, there can be a **subrange** of ranks vs values of GDP where the **variation is significant**. For example, for Ranks between 0-20 GDP values show significant variations (irregular) and so on.
* Although when plotted for all the data points **(all ranks) this variations become insignificant** and we can nearly say that it is a **consistently decreasing graph** and hence enables us to use linear regression to perform machine learning tasks. 

In [None]:
gdpt.reset_index(level=0,inplace=True)

In [None]:
x = gdpt[['GDP 2019','GDP 2018','GDP 2017','GDP 2016','GDP 2015']]
y = gdpt['OverallRank']

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=101)

In [None]:
lm = LinearRegression()
lm.fit(x_train,y_train)

In [None]:
print("Model intercept --> " +str(lm.intercept_))

In [None]:
print(lm.coef_)

In [None]:
cdf = pd.DataFrame(lm.coef_,x_train.columns,columns=['Coeff'])
cdf

#### Predictions

_We have trained our model here, its time how can we **predict a countries happiness rank** by just considering their **GDP perfromance** from past and present years_

In [None]:
y_pred = lm.predict(x_test)
pr = pd.DataFrame(y_pred,y_test,columns=['Predicted Rank'])
pr.reset_index(level=0,inplace=True)
pr.head()

In [None]:
plt.scatter(pr['OverallRank'],pr['Predicted Rank'])

Difference between Predicted Rank and Overall Rank would suggest us that, by not considering other facts such as Health life expectancy etc the overall ranks can vary about this much ranks.

In [None]:
sns.distplot((y_test-y_pred),bins=10)

In [None]:
pr['Considering other facts'] = pr['OverallRank'] - pr['Predicted Rank']
pr.head()

In [None]:
print("The Root mean square error is")
math.sqrt(mean_squared_error(y_test,y_pred))

The value of error is 13.5 which means if we don't consider GDP, it will only account for only 13.5 change in current ranks.
So what does this analysis suggests about World Happiness Report?
> Is it just a mere comparisons of how well we are utilizing our GDP in providing better health services and educational facilites (Literacy not considered in report) etc.
> Freedom of Speech, Corruption, Social support and generosity are not significant enough in happiness index.

This seems like a illusion or vague understanding of this report!