# **Data Content**
* The happiness scores and rankings use data from the Gallup World Poll.
>     Gallup World Poll: In 2005, Gallup began its World Poll, which continually surveys citizens in 160 countries, representing more than 98% of the world's adult population. The Gallup World Poll consists of more than 100 global questions as well as region-specific items.
*     The columns following the happiness score estimate the extent to which each of six factors – economic production, social support, life expectancy, freedom, absence of corruption, and generosity – contribute to making life evaluations higher in each country than they are in Dystopia, a hypothetical country that has values equal to the world’s lowest national averages for each of the six factors.
    > * **Ladder score**: Happiness score or subjective well-being. This is the national average response to the question of life evaluations.
    > * **Logged GDP per capita**: The GDP-per-capita time series from 2019 to 2020 using countryspecific forecasts of real GDP growth in 2020.
    > * **Social support**: Social support refers to assistance or support provided by members of social networks to an individual.
    > * **Healthy life expectancy**: Healthy life expectancy is the average life in good health - that is to say without irreversible limitation of activity in daily life or incapacities - of a fictitious generation subject to the conditions of mortality and morbidity prevailing that year.
    >*     **Freedom to make life choices**: Freedom to make life choices is the national average of binary responses to the GWP question “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?” ... It is defined as the average of laughter and enjoyment for other waves where the happiness question was not asked
>*     **Generosity**: Generosity is the residual of regressing national average of response to the GWP question “Have you donated money to a charity in the past month?” on GDP per capita.?
>*     **Perceptions of corruption**: The measure is the national average of the survey responses to two questions in the GWP: “Is corruption widespread throughout the government or not” and “Is corruption widespread within businesses or not?”
>*     **Ladder score in Dystopia**: It has values equal to the world’s lowest national averages. Dystopia as a benchmark against which to compare contributions from each of the six factors. Dystopia is an imaginary country that has the world's least-happy people. ... Since life would be very unpleasant in a country with the world's lowest incomes, lowest life expectancy, lowest generosity, most corruption, least freedom, and least social support, it is referred to as “Dystopia,” in contrast to Utopia
* World Happiness Report Official Website: https://worldhappiness.report/*

# **Import library**

In [None]:
#Import EDA
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# import sklearn library
import sklearn.datasets as datasets
import sklearn.preprocessing as preprocessing
import sklearn.model_selection as model_selection
import sklearn.metrics as metrics
import sklearn.linear_model as linear_model
from sklearn.metrics import mean_squared_error
from sklearn.metrics import max_error

# import statistic library
from scipy import stats
import statsmodels.api as sm

#Import Graph
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px


# **READ DATA**

In [None]:
#Import Data
df = pd.read_csv("/kaggle/input/world-happiness-report-2021/world-happiness-report.csv")
df.info()
df.head()

In [None]:
#Cek Country 
df['Country name'].unique()

In [None]:
#cek how much raw and data
print('Data before cleansing:',df.shape)

In [None]:
#cek proposion country in data
df['Country name'].value_counts()

In [None]:
#Check life ladder each country
fig = px.choropleth(df.sort_values("year"), 
                    locations = "Country name", 
                    color = "Life Ladder",
                    locationmode = "country names",
                   animation_frame = "year")
fig.update_layout(title = "Ladder Score Comparison by Countries")
fig.show()

# ****Eksplore Data****

In [None]:
#split betwen variabel and target
df_numeric = ['Log GDP per capita','Social support','Healthy life expectancy at birth','Freedom to make life choices','Generosity','Perceptions of corruption']
df_target = ['Life Ladder']
df_all = df_numeric+df_target

In [None]:
#show table
df.describe()

In [None]:
#cek na data
df[df.isna().any(axis=1)].count()

In [None]:
#droping na data by row
df_clean = df.copy()
df_clean.dropna(axis=0,inplace=True)
df_clean[df_clean.isna().any(axis=1)].count()

In [None]:
#cek outlier
df_clean.boxplot(
    column=df_numeric,
    fontsize=10,
    rot=0,
    grid=False,
    figsize=(10,10),
    vert=False
    )

In [None]:
#Make filter outlier
Q1 = df_clean[df_numeric].quantile(0.25)
Q3 = df_clean[df_numeric].quantile(0.75)
IQR = Q3 - Q1
boxplot_min = Q1 - 1.5 * IQR
boxplot_max = Q3 + 1.5 * IQR
print('Q1:\n',Q1)
print('\nQ3:\n',Q3)
print('\nIQR:\n',IQR)
print('\nMin:\n',boxplot_min)
print('\nMax:\n',boxplot_max)

In [None]:
#Remove Outlier
non_outlier_df = df_clean.copy()
for x in df_numeric:
  filter_min = non_outlier_df[x]<boxplot_min[x]
  filter_max = non_outlier_df[x]>boxplot_max[x]
  non_outlier_df = non_outlier_df[~(
    filter_min|filter_max
    )]

In [None]:
#cek multicorelation
correlation_between_column = non_outlier_df[df_all].corr()
upper_triangle_corr = np.triu(correlation_between_column)

fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(correlation_between_column, annot = True, cmap="YlGnBu",ax=ax,annot_kws={"size":15},mask=upper_triangle_corr)

In [None]:
# check linearity between target vs feature
scatter_plot_between_target_feature = sns.pairplot(
    data=non_outlier_df,
    y_vars=['Life Ladder'],
    x_vars=df_numeric,
    height=5,
    kind='scatter'
    )

In [None]:
#remove multicorelation
df_var_final = ['Social support','Healthy life expectancy at birth','Freedom to make life choices','Generosity','Perceptions of corruption']
df_target_final = ['Life Ladder']
df_all_final = df_var_final+df_target_final
final_df=non_outlier_df[df_all_final].copy()

In [None]:
#cek data after cleansing
print('Data after cleansing:',final_df.shape)

# **Modeling with linear Regresion**

In [None]:
#convert to be array
var_array = final_df[df_var_final].to_numpy()
target_array = final_df[df_target_final].to_numpy()
print('shape of final feature:',var_array.shape)
print('shape of target:',target_array.shape)

In [None]:
# split the data into test and train
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    var_array,
    target_array,
    train_size=0.80,
    random_state=0
    )

In [None]:
# check the shape data
print('Shape Data X Train:')
print(X_train.shape)
print('\nShape Data X Test:')
print(X_test.shape)
print('\nShape Data y Train:')
print(y_train.shape)
print('\nShape Data y Test:')
print(y_test.shape)

In [None]:
#making model
lm = linear_model.LinearRegression()


In [None]:
#fit model
lm.fit(X_train, y_train)

In [None]:
# model result
print('Coefficients:\n Social support, Healthy life, Freedom, Generosity, Perceptions of corruption \n',lm.coef_)
print('Intercept:',lm.intercept_)

In [None]:
#predic data
y_train_pred = lm.predict(X_train)
y_test_pred = lm.predict(X_test)
target_array_pred = lm.predict(var_array)

In [None]:
# check the prediction data & real data
print('Real Data')
print(y_test[:10])
print('\n Predicted Data')
print(y_test_pred[:10])
print('\n Diff')
print(y_test[:10]-y_test_pred[:10])

In [None]:
# check the data in the form of dataframe
final_with_pred_df = final_df.copy()
# final_with_pred_df = df.copy()
final_with_pred_df['Life Ladder'] = target_array_pred.reshape(-1,)
final_with_pred_df.head(5)

# **Assumption of linear regresion**

In [None]:
#cek linearity
plt.scatter(y_test,y_test_pred)
plt.xlabel('Real data')
plt.ylabel('predicted data')
plt.title('Relationshio between predictor and real data')
plt.show()

In [None]:
# check distribution from residual using visual
sns.distplot(y_test - y_test_pred)
plt.title('Residuals', size=18)

# **Evaluation Data**

In [None]:
# check distribution from residual
residual = (y_test - y_test_pred)
sw = stats.shapiro(residual)
ks = stats.kstest(residual, 'norm')

print('Shapiro-Wilk test ---- statistic: {}, p-value: {}'.format(sw[0],sw[1]))
print('Kolmogorov-Smirnov test ---- statistic: {}, p-value: {}'.format(ks.statistic,ks.pvalue))

In [None]:
# evaluate regression model - R squared
print('R^2 score:',lm.score(X_train, y_train))

In [None]:
# evaluate regression model - RMSE
rmse_training = mean_squared_error(y_true=y_train,y_pred=y_train_pred,squared=False)
rmse_test = mean_squared_error(y_true=y_test,y_pred=y_test_pred,squared=False)

print('RMSE Training Data: {}'.format(rmse_training))
print('RMSE Test Data: {}'.format(rmse_test))

In [None]:
# compare performance between model
list_model = [['Ridge',linear_model.Ridge()],['Lasso',linear_model.Lasso()],['LassoLars',linear_model.LassoLars()],['BayessianRidge',linear_model.BayesianRidge()]]
performance_result = {}

for model_name,regression_model in list_model:
  regression_model.fit(X_train, y_train)
  y_train_pred = regression_model.predict(X_train)
  y_test_pred = regression_model.predict(X_test)

  rmse_training = mean_squared_error(y_true=y_train,y_pred=y_train_pred,squared=False)
  rmse_test = mean_squared_error(y_true=y_test,y_pred=y_test_pred,squared=False)

  r_score = regression_model.score(X_train, y_train)

  performance_result[model_name]={'training':rmse_training,'test':rmse_test,'R^2 score':r_score}

performance_result

# **predict Happiness in 2021**

In [None]:
#import data 2021
df2021 = pd.read_csv("/kaggle/input/world-happiness-report-2021/world-happiness-report-2021.csv")
df2021.info()
df2021.head()

In [None]:
#split betwen variabel and target
df_var_final = ['Social support','Healthy life expectancy','Freedom to make life choices','Generosity','Perceptions of corruption']
df_target_final = ['Ladder score']
df_all_final = df_var_final+df_target_final
final_df=df2021[df_all_final].copy()

In [None]:
#making test data
var_array = final_df[df_var_final].to_numpy()
target_array = final_df[df_target_final].to_numpy()

X_test2021 = var_array
y_test2021 = target_array

In [None]:
#predict data
y_test2021_pred = lm.predict(X_test2021)
target_array_pred = lm.predict(var_array)

print('Real Data')
print(y_test2021[:10])
print('\n Predicted Data')
print(y_test2021_pred[:10])
print('\n Diff')
print(y_test2021[:10]-y_test2021_pred[:10])

# evaluate regression model - RMSE
rmse_test2021 = mean_squared_error(y_true=y_test2021,y_pred=y_test2021_pred,squared=False)

print('RMSE Test Data: {}'.format(rmse_test2021))
print('R^2 score:',lm.score(X_test2021, y_test2021)) 

In [None]:
#cek linearity
plt.scatter(y_test2021,y_test2021_pred)
plt.xlabel('Real data')
plt.ylabel('predicted data')
plt.title('Multi Linear Regresion')
plt.show()