# WORLD HAPPINESS REPORT
 
Name: Tarek Saidee

NID: ts3732

The idea of happiness is subjective among everyone. Each person gets a sense of happiness from their own experiences and ideas. There are certain general metrics that can be used to predict the overall happiness of a certain community or society. This is where the dataset I'm using comes in. https://www.kaggle.com/mathurinache/world-happiness-report contains over 20 columns of data each being a metric derived from a certain country. Those metrics are used to determine the overall happiness rank for that compared to others. Some of those columns include life expectancy, generosity, GDP per capita and so on. Each country is then given a rank based on those metrics. 

The goal here is to use that data to see which metrics or features are the most important in determining the happiness rank of a country and how tweeking a certain data point such as life expectancy would have an effect on the happiness level. This could be used by leaders or politicians everywhere to see which aspects of life would make their citizens the happiest. They can use that data to build their platforms. For example, someone could input some data into the model to get a happiness level then that same person could tweek the life expectancy which should change the happiness level. By doing so, that person can determine how important focusing on each aspect of life is to the people of that country. There's many other applications that could be used here. 

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import GridSearchCV
import graphviz
from seaborn import heatmap
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
import plotly.express as px
from matplotlib import rcParams
import seaborn as sns
from sklearn.model_selection import KFold


In [None]:
df = pd.read_csv('../input/world-happiness-report/2020.csv')
df.head(10)


# Data Analysis

In [None]:
columns = df.columns
columns_df = pd.DataFrame({"names":columns})
print("All columns in our dataset")
columns_df

In [None]:
print("There are {} rows and {} columns in the dataset.".format(df.shape[0], df.shape[1]))

In [None]:
df.describe()

In [None]:
fig = px.bar(data_frame = df.nlargest(10,"Ladder score"),
             y="Country name",
             x="Ladder score",
             orientation='h',
             color="Country name",
             text="Ladder score",
             color_discrete_sequence=px.colors.qualitative.D3)
print("Top 10 happiest countries")
fig.show()

In [None]:
fig = px.bar(data_frame = df.nsmallest(10,"Ladder score"),
             y="Country name",
             x="Ladder score",
             orientation='h',
             color="Country name",
             text="Ladder score",
             color_discrete_sequence=px.colors.qualitative.D3)
print("Top 10 unhappiest countries")
fig.show()

In [None]:
rcParams["figure.figsize"] = 20,10
plt.title("Corellation between different features")
sns.heatmap(df.corr(),annot=True,cmap="YlGnBu")


In [None]:
cols=['Explained by: Log GDP per capita', 'Explained by: Social support',
       'Explained by: Healthy life expectancy',
       'Explained by: Freedom to make life choices',
       'Explained by: Generosity', 'Explained by: Perceptions of corruption',
       'Dystopia + residual']
for a in cols:
    plt.figure(figsize=(10,5))
    sns.regplot(x=a,y='Ladder score',data=df,color='b')
    plt.show()

In [None]:
 def feature_analysis(df, feature):               
    grouped_df = df.groupby(["Regional indicator"]).agg({feature : np.mean}).reset_index()
    template='%{text:0.2f}'
    tickformat = None
    if grouped_df[feature].min() < 1:
        template='%{text:0.3%}'
        tickformat = ".3%"
    
    fig = px.bar(grouped_df,
                 x="Regional indicator",
                 y=feature,
                 color="Regional indicator",
                 text=feature,
                 color_discrete_sequence=px.colors.qualitative.D3
                )
    fig.update_traces(textposition='outside') 
    fig.show()
feature_names = ["Logged GDP per capita",
                 "Social support",
                 "Healthy life expectancy",
                 "Freedom to make life choices",
                 "Generosity",
                 "Perceptions of corruption"]
for feature in feature_names:
    feature_analysis(df, feature)

# Transforming the data

In [None]:
y=df['Ladder score']
X=df[['Logged GDP per capita', 'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption', 'Ladder score in Dystopia']]

X_train,X_test,Y_train,Y_test= train_test_split(X,y,test_size=0.2,random_state=1)

scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Basic Baseline model

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(X_train,Y_train)
y_hat = lin_reg.predict(X_test)
score=lin_reg.score(X_test, Y_test)
mse = mean_squared_error(Y_test, y_hat)
mae = mean_absolute_error(Y_test, y_hat)
r2 = r2_score(Y_test, y_hat)
r2, mse

In [None]:
coefficients = pd.DataFrame({"Feature":X.columns,"Coefficients":np.transpose(lin_reg.coef_)})
fig = px.pie(data_frame = coefficients,
             names="Feature",
             values="Coefficients",
             color="Feature",
             color_discrete_sequence=px.colors.sequential.RdBu)
print("Feature weights")
fig.show()

# Ridge Regression

In [None]:
model = Ridge()
model.fit(X_train, Y_train)
y_hat = model.predict(X_test)
R2_train = model.score(X_train, Y_train)
R2_test = model.score(X_test, Y_test)
mse = mean_squared_error(Y_test, y_hat)
R2_test, mse

In [None]:
coefficients = pd.DataFrame({"Feature":X.columns,"Coefficients":np.transpose(model.coef_)})
fig = px.pie(data_frame = coefficients,
             names="Feature",
             values="Coefficients",
             color="Feature",
             color_discrete_sequence=px.colors.sequential.RdBu)
print("Feature weights")
fig.show()

# Lasso Regression

In [None]:
model = Lasso()
model.fit(X_train, Y_train)
y_hat = model.predict(X_test)
R2_train = model.score(X_train, Y_train)
R2_test = model.score(X_test, Y_test)
mse = mean_squared_error(Y_test, y_hat)
R2_test, mse

# ElasticNet Regression

In [None]:
model = ElasticNet()
model.fit(X_train, Y_train)
y_hat = model.predict(X_test)
R2_train = model.score(X_train, Y_train)
R2_test = model.score(X_test, Y_test)
mse = mean_squared_error(Y_test, y_hat)
R2_test, mse

In [None]:
coefficients = pd.DataFrame({"Feature":X.columns,"Coefficients":np.transpose(model.coef_)})
fig = px.pie(data_frame = coefficients,
             names="Feature",
             values="Coefficients",
             color="Feature",
             color_discrete_sequence=px.colors.sequential.RdBu)
print("Feature weights")
fig.show()

# DecisionTree Regression

In [None]:
dtr= DecisionTreeRegressor()
dtr.fit(X_train,Y_train)
y_pred = dtr.predict(X_test)
test_mse = mean_squared_error(Y_test, y_pred)
y_pred_train = dtr.predict(X_train)
train_mse = mean_squared_error(Y_train, y_pred_train)
dtr.score(X_test, Y_test), test_mse 

# Random Forest Regression

In [None]:
rf = RandomForestRegressor(n_estimators = 1000)
rf.fit(X_train, Y_train)
y_hat = rf.predict(X_test)
errors = abs(y_hat - Y_test)
acc = 1 - errors
rf.score(X_test, Y_test), np.mean(acc)

# SVM Regression

In [None]:
svr = SVR(kernel='linear')
svr.fit(X_train, Y_train)
y_hat = svr.predict(X_test)
print(r2_score(Y_test,y_hat))

In [None]:
coefficients = pd.DataFrame({"Feature":X.columns,"Coefficients":np.transpose(svr.coef_[0])})
fig = px.pie(data_frame = coefficients,
             names="Feature",
             values="Coefficients",
             color="Feature",
             color_discrete_sequence=px.colors.sequential.RdBu)
print("Feature weights")
fig.show()

# Results so far

|           Model          	| Accuracy 	|
|:------------------------:	|:--------:	|
|  Baseline Model (linear) 	|   0.58   	|
|     Ridge Regression     	|   0.58   	|
|     Lasso Regression     	|  -0.004  	|
|   ElasticNet Regression  	|   0.36   	|
|  DecisionTree Regression 	|   0.29   	|
| Random Forest Regression 	|   0.53   	|
|      SVM Regression      	|   0.59   	|




|           Model          	| Logged GDP 	| Social Support 	| Healthy life expectancy 	| Freedom to make life choices 	| Generosity 	| Ladder score in Dystopia 	|
|:------------------------:	|:----------:	|:--------------:	|:-----------------------:	|:----------------------------:	|:----------:	|:------------------------:	|
|  Baseline Model (linear) 	|    24.8%   	|      33.3%     	|          20.8%          	|             14.7%            	|    6.31%   	|            0%            	|
|     Ridge Regression     	|    24.9%   	|      33.1%     	|           21%           	|             14.8%            	|    6.26%   	|            0%            	|
|     Lasso Regression     	|     0%     	|       0%       	|            0%           	|              0%              	|     0%     	|            0%            	|
|   ElasticNet Regression  	|    32.9%   	|      36.4%     	|          30.7%          	|              0%              	|     0%     	|            0%            	|
|  DecisionTree Regression 	|     N/A    	|       N/A      	|           N/A           	|              N/A             	|     N/A    	|            N/A           	|
| Random Forest Regression 	|     N/A    	|       N/A      	|           N/A           	|              N/A             	|     N/A    	|            N/A           	|
|      SVM Regression      	|    21.6%   	|      32.5%     	|          24.7%          	|             16.5%            	|    4.67%   	|            0%            	|

# KFold with SVM regression

In [None]:
folds = KFold(n_splits=5, shuffle=True, random_state=13579)
score = 0
Y_train = pd.DataFrame(data=Y_train, columns = {"Ladder score"})
X_train = pd.DataFrame(data=X_train, columns = {'Logged GDP per capita', 'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption', 'Ladder score in Dystopia'})
for i, (x_index, y_index) in enumerate(folds.split(X_train, Y_train['Ladder score'])):
    print('-' * 22, i, '-' * 22)
    svr = SVR(kernel='linear')
    svr.fit(X_train.iloc[x_index], Y_train['Ladder score'].iloc[x_index])
    score += svr.score(X_train.iloc[y_index], Y_train['Ladder score'].iloc[y_index])
    print('score ', svr.score(X_train.iloc[y_index], Y_train['Ladder score'].iloc[y_index]))
    
print('Average Accuracy', score / folds.n_splits)


In [None]:
coefficients = pd.DataFrame({"Feature":X.columns,"Coefficients":np.transpose(svr.coef_[0])})
fig = px.pie(data_frame = coefficients,
             names="Feature",
             values="Coefficients",
             color="Feature",
             color_discrete_sequence=px.colors.sequential.RdBu)
print("Feature weights")
fig.show()

# KFold with Decision Tree Regression

In [None]:
folds = KFold(n_splits=5, shuffle=True, random_state=13579)
score = 0
Y_train = pd.DataFrame(data=Y_train, columns = {"Ladder score"})
X_train = pd.DataFrame(data=X_train, columns = {'Logged GDP per capita', 'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption', 'Ladder score in Dystopia'})
for i, (x_index, y_index) in enumerate(folds.split(X_train, Y_train['Ladder score'])):
    print('-' * 22, i, '-' * 22)
    dtr= DecisionTreeRegressor()
    dtr.fit(X_train.iloc[x_index], Y_train['Ladder score'].iloc[x_index])
    score += dtr.score(X_train.iloc[y_index], Y_train['Ladder score'].iloc[y_index])
    print('score ', dtr.score(X_train.iloc[y_index], Y_train['Ladder score'].iloc[y_index]))
    
print('Average Accuracy', score / folds.n_splits)


# KFold with ElasticNet Regression

In [None]:
folds = KFold(n_splits=5, shuffle=True, random_state=13579)
score = 0
Y_train = pd.DataFrame(data=Y_train, columns = {"Ladder score"})
X_train = pd.DataFrame(data=X_train, columns = {'Logged GDP per capita', 'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption', 'Ladder score in Dystopia'})
for i, (x_index, y_index) in enumerate(folds.split(X_train, Y_train['Ladder score'])):
    print('-' * 22, i, '-' * 22)
    model = ElasticNet()
    model.fit(X_train.iloc[x_index], Y_train['Ladder score'].iloc[x_index])
    score += model.score(X_train.iloc[y_index], Y_train['Ladder score'].iloc[y_index])
    print('score ', model.score(X_train.iloc[y_index], Y_train['Ladder score'].iloc[y_index]))
    
print('Average Accuracy', score / folds.n_splits)


In [None]:
coefficients = pd.DataFrame({"Feature":X.columns,"Coefficients":np.transpose(model.coef_)})
fig = px.pie(data_frame = coefficients,
             names="Feature",
             values="Coefficients",
             color="Feature",
             color_discrete_sequence=px.colors.sequential.RdBu)
print("Feature weights")
fig.show()

# KFold with Lasso Regression

In [None]:
folds = KFold(n_splits=5, shuffle=True, random_state=13579)
score = 0
Y_train = pd.DataFrame(data=Y_train, columns = {"Ladder score"})
X_train = pd.DataFrame(data=X_train, columns = {'Logged GDP per capita', 'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption', 'Ladder score in Dystopia'})
for i, (x_index, y_index) in enumerate(folds.split(X_train, Y_train['Ladder score'])):
    print('-' * 22, i, '-' * 22)
    model = Lasso()
    model.fit(X_train.iloc[x_index], Y_train['Ladder score'].iloc[x_index])
    score += model.score(X_train.iloc[y_index], Y_train['Ladder score'].iloc[y_index])
    print('score ', model.score(X_train.iloc[y_index], Y_train['Ladder score'].iloc[y_index]))
    
print('Average Accuracy', score / folds.n_splits)


# KFold with Ridge Regression

In [None]:
folds = KFold(n_splits=5, shuffle=True, random_state=13579)
score = 0
Y_train = pd.DataFrame(data=Y_train, columns = {"Ladder score"})
X_train = pd.DataFrame(data=X_train, columns = {'Logged GDP per capita', 'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption', 'Ladder score in Dystopia'})
for i, (x_index, y_index) in enumerate(folds.split(X_train, Y_train['Ladder score'])):
    print('-' * 22, i, '-' * 22)
    model = Ridge()
    model.fit(X_train.iloc[x_index], Y_train['Ladder score'].iloc[x_index])
    score += model.score(X_train.iloc[y_index], Y_train['Ladder score'].iloc[y_index])
    print('score ', model.score(X_train.iloc[y_index], Y_train['Ladder score'].iloc[y_index]))
    
print('Average Accuracy', score / folds.n_splits)


In [None]:
coefficients = pd.DataFrame({"Feature":X.columns,"Coefficients":np.transpose(model.coef_)})
fig = px.pie(data_frame = coefficients,
             names="Feature",
             values="Coefficients",
             color="Feature",
             color_discrete_sequence=px.colors.sequential.RdBu)
print("Feature weights")
fig.show()

# KFold with RandomForestRegressor

In [None]:
folds = KFold(n_splits=5, shuffle=True, random_state=13579)
score = 0
Y_train = pd.DataFrame(data=Y_train, columns = {"Ladder score"})
X_train = pd.DataFrame(data=X_train, columns = {'Logged GDP per capita', 'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption', 'Ladder score in Dystopia'})
for i, (x_index, y_index) in enumerate(folds.split(X_train, Y_train['Ladder score'])):
    print('-' * 22, i, '-' * 22)
    rfr = RandomForestRegressor(n_estimators=200, n_jobs=-1)
    rfr.fit(X_train.iloc[x_index], Y_train['Ladder score'].iloc[x_index])
    score += rfr.score(X_train.iloc[y_index], Y_train['Ladder score'].iloc[y_index])
    print('score ', rfr.score(X_train.iloc[y_index], Y_train['Ladder score'].iloc[y_index]))
    
print('Average Accuracy', score / folds.n_splits)



In [None]:
testing_vals = pd.DataFrame(data=[{2.11085868,0.94068184,1.17120687,0.30922594,-0.4856469,-0.58227342,2}], columns = {'Logged GDP per capita', 'Social support', 'Healthy life expectancy', 
                                                                                                                   'Freedom to make life choices', 'Generosity','Perceptions of corruption', 
                                                                                                                   'Ladder score in Dystopia'})

print(rfr.predict(testing_vals))


# Final Summary

|           Model          	| Accuracy 	|
|:------------------------:	|:--------:	|
|  Baseline Model (linear) 	|   0.58   	|
|     Ridge Regression     	|   0.58   	|
|     Lasso Regression     	|  -0.004  	|
|   ElasticNet Regression  	|   0.36   	|
|  DecisionTree Regression 	|   0.29   	|
| Random Forest Regression 	|   0.53   	|
|      SVM Regression      	|   0.59   	|


|     Model with KFold     	| Accuracy 	|
|:------------------------:	|:--------:	|
|     Ridge Regression     	|   0.68   	|
|     Lasso Regression     	|   -0.08  	|
|   ElasticNet Regression  	|   0.39   	|
|  DecisionTree Regression 	|   0.71   	|
| Random Forest Regression 	|   0.77   	|
|      SVM Regression      	|   0.65   	|


<br/><br/>
<br/>
<br/>


|           Model          	| Logged GDP 	| Social Support 	| Healthy life expectancy 	| Freedom to make life choices 	| Generosity 	| Ladder score in Dystopia 	|
|:------------------------:	|:----------:	|:--------------:	|:-----------------------:	|:----------------------------:	|:----------:	|:------------------------:	|
|  Baseline Model (linear) 	|    24.8%   	|      33.3%     	|          20.8%          	|             14.7%            	|    6.31%   	|            0%            	|
|     Ridge Regression     	|    24.9%   	|      33.1%     	|           21%           	|             14.8%            	|    6.26%   	|            0%            	|
|     Lasso Regression     	|     0%     	|       0%       	|            0%           	|              0%              	|     0%     	|            0%            	|
|   ElasticNet Regression  	|    32.9%   	|      36.4%     	|          30.7%          	|              0%              	|     0%     	|            0%            	|
|  DecisionTree Regression 	|     N/A    	|       N/A      	|           N/A           	|              N/A             	|     N/A    	|            N/A           	|
| Random Forest Regression 	|     N/A    	|       N/A      	|           N/A           	|              N/A             	|     N/A    	|            N/A           	|
|      SVM Regression      	|    21.6%   	|      32.5%     	|          24.7%          	|             16.5%            	|    4.67%   	|            0%            	|

<br/>

|     Model with KFold     	| Logged GDP 	| Social Support 	| Healthy life expectancy 	| Freedom to make life choices 	| Generosity 	| Ladder score in Dystopia 	|
|:------------------------:	|:----------:	|:--------------:	|:-----------------------:	|:----------------------------:	|:----------:	|:------------------------:	|
|     Ridge Regression     	|    26.9%   	|      32.1%     	|          17.2%          	|             16.1%            	|    7.73%   	|            0%            	|
|     Lasso Regression     	|     0%     	|       0%       	|            0%           	|              0%              	|     0%     	|            0%            	|
|   ElasticNet Regression  	|     29%    	|      42.5%     	|          27.1%          	|             1.47%            	|     0%     	|            0%            	|
|  DecisionTree Regression 	|     N/A    	|       N/A      	|           N/A           	|              N/A             	|     N/A    	|            N/A           	|
| Random Forest Regression 	|     N/A    	|       N/A      	|           N/A           	|              N/A             	|     N/A    	|            N/A           	|
|      SVM Regression      	|    19.8%   	|      25.5%     	|          22.7%          	|             17.8%            	|    14.2%   	|            0%            	|