![](https://agora.md/cdn/p/news/big/republica-moldova--pe-locul-52-in-clasamentul-celor-mai-fericite-tari-8066.jpg)

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings 
warnings.filterwarnings('ignore')

In [None]:
df=pd.read_csv('../input/world-happiness/2015.csv')

# EDA 

In [None]:
df.head()

Country and Happiness Rank column dosent seem to be of importance

In [None]:
df.shape

Dataframe has 158 rows and 12 columns

In [None]:
df.dtypes

All the columns have numerical data except for Country and Region column

In [None]:
df.isnull().sum()

Our dataset has no missing values

In [None]:
df.nunique()

Only Region seems to be a Discrete variable, all other are Continuous

In [None]:
df.Region.unique()

Region column has the following categories

In [None]:
print('Percentage of Cardinality in Region Column')
print((df['Region'].value_counts()/df['Region'].value_counts().sum())*100)

'Sub-Saharan Africa' and 'Central and Eastern Europe' region are frequently occuring while 'Australia and New Zealand' and 'North America' regions occur rarely. 

In [None]:
df.skew()

Standard Error,Family,Health (Life Expectancy),Trust (Government Corruption),Generosity show skewness which is need to be dealt with.

In [None]:
df.describe()

Since count of each value is 158, therefore no missing values.
Data is symmetrically or normally distributed since mean and median are close to each other in all the columns.
Each column has variance close to zero except for Happiness rank and Happiness score column.
There seem to be very few or no outliers present as there not much difference in the interquartile ranges.

## Univariate Analysis

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(df['Region'])
plt.xticks(rotation = 90)
plt.title('Cardinality of Region column')

Here we can see that Region columns have imbalanced categories

In [None]:
numeric_features=['Happiness Score','Standard Error', 'Economy (GDP per Capita)', 'Family',
       'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)',
       'Generosity', 'Dystopia Residual']
print(len(numeric_features))

Here we segregate columns having numerical values to plot various graphs

In [None]:
fig,ax = plt.subplots(3,3,figsize=(12,12))
row = col = 0
for n,i in enumerate(numeric_features):
    if (n%3 == 0) & (n > 0):
        row += 1
        col = 0
    df[i].plot(kind="box",ax=ax[row,col])
    col += 1

From above boxplots we can conclude that there are outliers in Standard Error and Trust (Government Corruption) feature whereas very few in Generosity and Dystopia Residual features

In [None]:
fig,ax = plt.subplots(3,3,figsize=(12,12))
row = col = 0
for n,i in enumerate(numeric_features):
    if (n%3 == 0) & (n > 0):
        row += 1
        col = 0
    sns.histplot(df[i],kde=True,ax=ax[row,col])
    col += 1

From above histograms we can conclude that Dytopia Residual feature is normally distributed and all other a little bit right skewed or left skewed

## Bivariate Analysis

In [None]:
sns.scatterplot(x='Happiness Score',y='Generosity',data=df)

In the above scatterplot we find no correlation between Generosity and Happiness Score

In [None]:
plt.figure(figsize=(10,10))
sns.scatterplot(x='Happiness Score',y='Economy (GDP per Capita)',data=df,hue='Region')

The above scatter plot shows that there is a positive correlation between Happiness Score and Economy (GDP per Capita) i.e. as the Economy of a country increases, Happiness score of that country also increases

In [None]:
plt.figure(figsize=(10,10))
sns.lmplot(x='Happiness Score',y='Health (Life Expectancy)',data=df)

There is a positive correlation between Health of people and Happiness Score but there are outliers also present

## Multivariate Analysis

In [None]:
sns.pairplot(df)

Above Pairplot show that our target feature(Happiness Score) shows positive linear correlation with almost all features except for a few which show no correlation and a negative correlation with happiness rank 

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(df.corr(),annot=True)

Happiness Score show a strong correlation between Economy,Family,Health and Dystopia while a negative correlation with Happiness Rank

# Feature Engineering 

In [None]:
df.head()

In [None]:
print(df.shape)
df.drop(['Country','Happiness Rank'],axis=1,inplace=True)
df.shape

In [None]:
from sklearn.preprocessing import LabelEncoder
e=LabelEncoder()

importing Label Encoder and creationg an instance of it

In [None]:
df['Region']=e.fit_transform(df['Region'])
df.head()

We converted column Region from Object to int type

In [None]:
from scipy.stats import zscore
z=np.abs(zscore(df))
df=df[(z<3).all(axis=1)]

Removing Outliers present in the dataset by zscore

In [None]:
fig,ax = plt.subplots(3,3,figsize=(12,12))
row = col = 0
for n,i in enumerate(numeric_features):
    if (n%3 == 0) & (n > 0):
        row += 1
        col = 0
    df[i].plot(kind="box",ax=ax[row,col])
    col += 1

Even after removing outliers, there are some present.

In [None]:
df.skew()

After Removing Outliers, Skewness is also reduced but not completely

In [None]:
df.head()

In [None]:
from sklearn.preprocessing import PowerTransformer
pt=PowerTransformer()
dfpt=pt.fit_transform(df)
df=pd.DataFrame(dfpt,columns=df.columns)

Since negative skewness is present i.e. data is left skewed we cannot use log transformation. Hence we use Power Transformer

In [None]:
df.skew()

Now all the skew values are between -0.5 to +0.5

In [None]:
y=df['Happiness Score']
x=df.copy()
x.drop('Happiness Score',axis=1,inplace=True)

We separate are Dependent and Independet Features

In [None]:
from sklearn.preprocessing import MinMaxScaler
s=MinMaxScaler()

Even though there not much need of scaling still there is a big difference between Standard Error and Dystopia features hence we scale the values

In [None]:
xs=s.fit_transform(x)
x=pd.DataFrame(xs,columns=x.columns)
x.head()

Since there are only 9 columns we donot perform PCA

NOTE - For now I am not performing Feature Selection as the no. of features are already very low. I will perform this step if our model accuracy is low

In [None]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor

In [None]:
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.25,random_state=7)

In [None]:
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score

In [None]:
models=[KNeighborsRegressor(),SVR(),DecisionTreeRegressor(),LinearRegression(),Ridge(),Lasso(),
        RandomForestRegressor(),AdaBoostRegressor(),GradientBoostingRegressor(),XGBRegressor()]

In [None]:
maelist=[]
mselist=[]
rmselist=[]
r2list=[]
def create_model(model):
    m=model
    m.fit(xtrain,ytrain)
    p=m.predict(xtest)
    
    mae=mean_absolute_error(p,ytest)
    mse=mean_squared_error(p,ytest)
    rmse=np.sqrt(mean_squared_error(p,ytest))
    r2=r2_score(ytest,p)
    
    maelist.append(mae)
    mselist.append(mse)
    rmselist.append(rmse)
    r2list.append(r2)
    
    print(m)
    print('Mean absolute error',mae)
    print('Mean squared error',mse)
    print('Root Mean squared error',rmse)
    print('R2 Score',r2)
    print('---------------------------------------------------------------------------------------------------------')

In [None]:
for i in models:
    create_model(i)

In [None]:
print('Minimum Mean Absolute error is shown by ',models[maelist.index(min(maelist))],min(maelist))
print('Minimum Mean squared error is shown by ',models[mselist.index(min(mselist))],min(mselist))
print('Minimum Root Mean squared error is shown by ',models[rmselist.index(min(rmselist))],min(rmselist))
print('Maximun R2 Score is shown by ',models[r2list.index(max(r2list))],max(r2list))

In [None]:
from sklearn.model_selection import GridSearchCV

### We try to perform hyperparameter tuning on ridge since it is also performing well

In [None]:
ridge=Ridge()
param_grid={'alpha':[1e-15,1e-10,1e-8,1e-5,1e-3,0.1,1,5,10,15,20,30,35,45,50,55,65,100,110,150,1000]}
m=GridSearchCV(ridge,param_grid,cv=10)
m.fit(xtrain,ytrain)
p=m.predict(xtest)
mae=mean_absolute_error(p,ytest)
mse=mean_squared_error(p,ytest)
rmse=np.sqrt(mean_squared_error(p,ytest))
r2=r2_score(ytest,p)
print('Mean absolute error',mae)
print('Mean squared error',mse)
print('Root Mean squared error',rmse)
print('R2 Score',r2)


Ridge regression even after Hypertuning gives same result as Linear Regression model

### Hypertuning Adaboost classifier using base learner as Linear regression model

In [None]:
param={'n_estimators':[50, 100, 150, 200, 250, 300],'learning_rate':[0.0001,0.001,0.01,.1,1,2,5,10]}
m=GridSearchCV(AdaBoostRegressor(base_estimator=LinearRegression()),param,cv=10,n_jobs=-2)
m.fit(xtrain,ytrain)
p=m.predict(xtest)
mae=mean_absolute_error(p,ytest)
mse=mean_squared_error(p,ytest)
rmse=np.sqrt(mean_squared_error(p,ytest))
r2=r2_score(ytest,p)
print('Mean absolute error',mae)
print('Mean squared error',mse)
print('Root Mean squared error',rmse)
print('R2 Score',r2)

It is also giving the same result as a single Linear Regression model

### Hypertuning SVR model as its default kernel values is rbf maybe by changing it to linear may give us good result as Linear regression model is performing well

In [None]:
param_grid={'C':[1,20,40,60,80,100,200,300,500,1000],'kernel':['linear', 'poly', 'rbf', 'sigmoid'],'degree':[1,2,3,4,5,6]}
grid=GridSearchCV(SVR(),param_grid)
grid.fit(xtrain,ytrain)
p=grid.predict(xtest)

In [None]:
mae=mean_absolute_error(p,ytest)
mse=mean_squared_error(p,ytest)
rmse=np.sqrt(mean_squared_error(p,ytest))
r2=r2_score(ytest,p)
print('Mean absolute error',mae)
print('Mean squared error',mse)
print('Root Mean squared error',rmse)
print('R2 Score',r2)

Still not better than Linear Regression model

#### Now we see if our model can perform better if we drop Region feature from our dataset

In [None]:
x_new=x.drop('Region',axis=1)
x_new.head()

In [None]:
xtrain,xtest,ytrain,ytest=train_test_split(x_new,y,test_size=0.25,random_state=7)

In [None]:
m=LinearRegression()
m.fit(xtrain,ytrain)
p=m.predict(xtest)
mae=mean_absolute_error(p,ytest)
mse=mean_squared_error(p,ytest)
rmse=np.sqrt(mean_squared_error(p,ytest))
r2=r2_score(ytest,p)
print('Mean absolute error',mae)
print('Mean squared error',mse)
print('Root Mean squared error',rmse)
print('R2 Score',r2)

The score seems to go down even with this. Therefore we find that our simple linear model is the best model with region column included

## Finally creating the best model

In [None]:
m=LinearRegression()
m.fit(x,y)

In [None]:
import joblib
joblib.dump(m,'Happinessmodel.obj')