- [Load and Check Data](#-1)
- [Data Preparation](#0)
- [Simple Linear Regressions](#1)
- [Multiple Linear Regression](#2)
- [Ridge Regression](#4)
- [Lasso Regression](#5)
- [ElasticNet Regression](#6)

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use("seaborn-whitegrid")
import warnings            
warnings.filterwarnings("ignore") 

# Load and Check Data <a id="-1"></a>

In [None]:
y_2018 = pd.read_csv("/kaggle/input/world-happiness/2018.csv");
y_2019 = pd.read_csv("/kaggle/input/world-happiness/2019.csv");

data = pd.concat([y_2018,y_2019],sort=False)
data

# Veriable Description

1. Overall rank: Ranking of countries by happiness level
2. Country or region: Country or region names
3. Score: Happiness scores
4. GDP per capita: Value representing the country's income and expense levels
5. Social support
6. Healthy life expectancy
7. Freedom to make life choices
8. Generosity
9. Perceptions of corruption

In [None]:
data.describe().T

In [None]:
data.info()

Let's change the column names for convenience.

In [None]:
data.rename(columns={
    "Overall rank": "rank",
    "Country or region": "country",
    "Score": "score",
    "GDP per capita": "gdp",
    "Social support": "social",
    "Healthy life expectancy": "healthy",
    "Freedom to make life choices": "freedom",
    "Generosity": "generosity",
    "Perceptions of corruption": "corruption"
},inplace=True)
del data["rank"]

# Missing Value 

In [None]:
data.columns[data.isnull().any()]

There are empty elements in only one column. Let's look at how many.

In [None]:
data.isnull().sum()

In [None]:
data[data["corruption"].isnull()]

In [None]:
avg_data_corruption = data[data["score"] > 6.774].mean().corruption
data.loc[data["corruption"].isnull(),["corruption"]] = avg_data_corruption
data[data["corruption"].isnull()]

# Data Preparation <a id="0"></a>
## Inconsistent Observation
* 95% of a machine learning model is said to be preprocessing and 5% is model selection. For this we need to teach the data to the model correctly. In order to prepare the available data for machine learning, we must apply certain pre-processing methods. One of these methods is the analysis of outliers. The outlier is any data point that is substantially different from the rest of the observations in a data set. In other words, it is the observation that goes far beyond the general trend.

![](https://miro.medium.com/max/854/1*RW-vfIbKZh-UGsLfTAWpyw.png)

Outlier values behave differently from other data models and they increase the error with overfitting, so the outlier model must be detected and some operations must be performed on it.
### 1.Using Box Graph
We can see contradictory observations with many visualization techniques. One of them is the box chart. If there is an outlier, this is drawn as the point, but the other population is grouped together and displayed in boxes.

In [None]:
df = data.copy()
df = df.select_dtypes(include=["float64","int64"])
df.head()

In [None]:
column_list = ["score","gdp","social","healthy","freedom","generosity","corruption"]
for col in column_list:
    sns.boxplot(x = df[col])
    plt.xlabel(col)
    plt.show()

We have observed that there are outliers in the "social" and "corruption" column. This may cause us to negatively affect us while training our data set.

In [None]:
# for corruption
df_table = df["corruption"]

Q1 = df_table.quantile(0.25)
Q3 = df_table.quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5*IQR
upper_bound = Q3 + 1.5*IQR
print("lower bound is " + str(lower_bound))
print("upper bound is " + str(upper_bound))
print("Q1: ", Q1)
print("Q3: ", Q3)

In [None]:
outliers_vector = (df_table < (lower_bound)) | (df_table > (upper_bound))
outliers_vector

In [None]:
outliers_vector = df_table[outliers_vector]
outliers_vector.index.values

Deleting data is not suitable for this data set. That's why we will fill out the outliers with the average.

In [None]:
df_table = data.copy()
df_table["corruption"].iloc[outliers_vector.index.values] = df_table["corruption"].mean()
df_table["corruption"].iloc[outliers_vector.index.values]

In [None]:
data = df_table

# Simple Linear Regressions <a id="1"></a>
Simple linear regression is a statistical method that allows us to summarize and analyze the relationships between two continuous (quantitative) variables:

## score - gdp
Firstly let's observe the relationship between gdp and score with the help of graphics.
* independent variable : x
* dependent variable : y

In [None]:
sns.jointplot(x="gdp",y="score",data=df_table,kind="reg")
plt.show()

In [None]:
from sklearn.linear_model import LinearRegression

X = data[["gdp"]]
X.head

In [None]:
y = data[["score"]]
y.head

In [None]:
reg = LinearRegression()
model = reg.fit(X,y)
print("intercept: ", model.intercept_)
print("coef: ", model.coef_)
print("rscore. ", model.score(X,y))

rscore meaning:
* For example, the gdp argument used here describes 63% of the data.

In [None]:
# prediction
plt.figure(figsize=(12,6))
g = sns.regplot(x=data["gdp"],y=data["score"],ci=None,scatter_kws = {'color':'r','s':9})
g.set_title("Model Equation")
g.set_ylabel("score")
g.set_xlabel("gdb")
plt.show()

What we want to do here is; For example, to answer the question of what is the happiness level of a country with a gdp value of 1. In other words, to estimate the desired value with the existing data set.

In [None]:
# model.intercep_ + model.coef_ * 1
model.predict([[1]])

In [None]:
gdb_list = [[0.25],[0.50],[0.75],[1.00],[1.25],[1.50]]
model.predict(gdb_list)
for g in gdb_list:
    print("The happiness value of the country with a gdp value of ",g,": ",model.predict([g]))

Let's create a class and make the job easier.

In [None]:
def linear_reg(col,text,prdctn):
    
    sns.jointplot(x=col,y="score",data=df_table,kind="reg")
    plt.show()
    
    X = data[[col]]
    y = data[["score"]]
    reg = LinearRegression()
    model = reg.fit(X,y)
    
    # prediction
    plt.figure(figsize=(12,6))
    g = sns.regplot(x=data[col],y=data["score"],ci=None,scatter_kws = {'color':'r','s':9})
    g.set_title("Model Equation")
    g.set_ylabel("score")
    g.set_xlabel(col)
    plt.show()
    
    print(text,": ", model.predict([[prdctn]]))

## score - social

In [None]:
linear_reg("social","The happiness value of the country whose sociability value is 2:",2)

In [None]:
column_list = ["score","gdp","social","healthy","freedom","generosity","corruption"]

## score - healthy

In [None]:
linear_reg("healthy","The happiness value of the country whose healthiest value is 1.20:",1.20)

## score - freedom

In [None]:
linear_reg("freedom","The happiness value of the country whose freedom value is 0.89:",0.89)

# Multiple Linear Regression <a id="2"></a>
The main purpose is to find the linear function that expresses the relationship between dependent and independent variables.

In [None]:
import statsmodels.api as sms

X = df.drop("score",axis=1)
y = df["score"]

# OLS(dependent,independent)
lm = sms.OLS(y,X)
model = lm.fit()
model.summary()

R-squared: Percentages of independent variables that explain the change in dependent variables. <br>
F-statistic: Expresses the significance of the model. <br>
coef: refers to coefficients. <br>
std err: standard errors. <br><br>

Here we can make the following comments.
- When the gdp value is increased by 1, the score increases by 0.8114.
- When there is an increase of 1 unit from the social value, the score increases by 1.9740.
...

In [None]:
# create model with sckit learn

lm = LinearRegression()
model = lm.fit(X,y)
print("constant: ",model.intercept_)
print("coefficient: ",model.coef_)

In [None]:
# PREDICTION
# Score = 0.929921*gdp + 1.06504217*social + 0.94321492*healthy + 1.40426054*freedom + 0.52070628*generosity + 0.88114008*corruption

new_data = [[1],[2],[1.25],[1.75],[1.50],[0.75]]
new_data = pd.DataFrame(new_data).T
new_data

In [None]:
model.predict(new_data)

In [None]:
# calculating the amount of error

from sklearn.metrics import mean_squared_error

MSE = mean_squared_error(y,model.predict(X))
RMSE = np.sqrt(MSE)

print("MSE: ", MSE)
print("RMSE: ", RMSE)

# Simple Linear & Multiple Linear Regression - Model Tuning <a id="3"></a>

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop("score",axis=1)
y = df["score"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.head()

In [None]:
X_test.head()

In [None]:
y_train.head()

In [None]:
y_test.head()

In [None]:
lm = LinearRegression()
lm.fit(X_train, y_train)
print("Training error",np.sqrt(mean_squared_error(y_train,model.predict(X_train))))
print("Test error",np.sqrt(mean_squared_error(y_test,model.predict(X_test))))

Every time we change the random_state value we defined at first, a different result is returned. We need to find out which of these returns the best result. For this we need to do the following.

In [None]:
from sklearn.model_selection import cross_val_score

cross_val_score(model, X_train, y_train, cv=10, scoring="neg_mean_squared_error")

In [None]:
cvs_avg_mse = np.mean(-cross_val_score(model, X_train, y_train, cv=20, scoring="neg_mean_squared_error"))
cvs_avg_rmse = np.sqrt(cvs_avg_mse)

print("Cross Val Score MSE = ",cvs_avg_mse)
print("Cross Val Score RMSE = ",cvs_avg_rmse)

# Ridge Regression <a id="4"></a>
The aim is to find the coefficients that minimize the sum of error squares by applying a penalty to these coefficients.
* It is resistant to over learning.
* It is biased but its variance is low.
* It is better than OLS when there are too many parameters.
* Builds a model with all variables. It does not exclude the unrelated variables from the model, it approximates its coefficients to zero.

![](https://i.ibb.co/2SJtqyB/Ek-A-klama-2020-04-21-202339.jpg)

* The delta parameter that gives the smallest "cross validation" value is selected.
* With this delta selected, the model is fit for observations again.

## Ridge Regression - Model

In [None]:
# Required Libraries
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.model_selection import train_test_split
from sklearn import model_selection
from sklearn.linear_model import RidgeCV

In [None]:
X = df.drop("score",axis=1)
y = df["score"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

ridge_model = Ridge(alpha=0.1).fit(X_train, y_train)
ridge_model

In [None]:
ridge_model.coef_

An alpha value will be assigned with each coefficient. Error coefficients will be examined according to these values.

In [None]:
ridge_model.intercept_

In [None]:
lambdas = 10**np.linspace(10,-2,100)*0.5 # Creates random numbers
ridge_model =  Ridge()
coefs = []

for i in lambdas:
    ridge_model.set_params(alpha=i)
    ridge_model.fit(X_train,y_train)
    coefs.append(ridge_model.coef_)
    
ax = plt.gca()
ax.plot(lambdas, coefs)
ax.set_xscale("log")

In contrast to the different beta values, the changes in the coefficients of the variables in our data set appear in the graph above. As can be seen, as the coefficients increase, it approaches zero.

## Ridge Regression - Prediction

In [None]:
ridge_model = Ridge().fit(X_train,y_train)

y_pred = ridge_model.predict(X_train)

print("predict: ", y_pred[0:10])
print("real: ", y_train[0:10].values)

In [None]:
RMSE = np.mean(mean_squared_error(y_train,y_pred)) # rmse = square root of the mean of error squares
print("train error: ", RMSE)

In [None]:
Verified_RMSE = np.sqrt(np.mean(-cross_val_score(ridge_model, X_train, y_train, cv=20, scoring="neg_mean_squared_error")))
print("Verified_RMSE: ", Verified_RMSE)

There are two values ​​above. One of them is unverified, the other is the values ​​that represent the square root of the sum of the verified error squares. As you can see, the unverified value is almost half of the verified value. This result shows us that it is more correct to use the second method, not the first method, while taking the square root of the mean of the error squares.

In [None]:
# test error
y_pred = ridge_model.predict(X_test)
RMSE = np.mean(mean_squared_error(y_test,y_pred))
print("test error: ", RMSE)

## Ridge Model - Model Tuning

In [None]:
ridge_model = Ridge(10).fit(X_train,y_train)
y_pred = ridge_model.predict(X_test)
np.sqrt(mean_squared_error(y_test,y_pred))

In [None]:
ridge_model = Ridge(30).fit(X_train,y_train)
y_pred = ridge_model.predict(X_test)
np.sqrt(mean_squared_error(y_test,y_pred))

In [None]:
ridge_model = Ridge(90).fit(X_train,y_train)
y_pred = ridge_model.predict(X_test)
np.sqrt(mean_squared_error(y_test,y_pred))

We can find out which value will work better by trial and error. But with the method we will use below, we can find the most appropriate value more easily and quickly.

In [None]:
lambdas1 = 10**np.linspace(10,-2,100)
lambdas2 = np.random.randint(0,1000,100)

ridgeCV = RidgeCV(alphas = lambdas1,scoring = "neg_mean_squared_error", cv=10, normalize=True)
ridgeCV.fit(X_train,y_train)

We can use alpha_ feature to attract the most appropriate value.

In [None]:
ridgeCV.alpha_

In [None]:
# final model
ridge_tuned = Ridge(alpha = ridgeCV.alpha_).fit(X_train,y_train)
y_pred = ridge_tuned.predict(X_test)
np.sqrt(mean_squared_error(y_test,y_pred))

In [None]:
# for lambdas2
ridgeCV = RidgeCV(alphas = lambdas2,scoring = "neg_mean_squared_error", cv=10, normalize=True)
ridgeCV.fit(X_train,y_train)
ridge_tuned = Ridge(alpha = ridgeCV.alpha_).fit(X_train,y_train)
y_pred = ridge_tuned.predict(X_test)
np.sqrt(mean_squared_error(y_test,y_pred))

# Lasso Regression <a id="5"></a>
The aim is to find the coefficients that minimize the sum of error squares by applying a penalty to these coefficients.
* Lasso regression = L1
* Ridge regression = L2

* It has been proposed to eliminate the disadvantage of leaving the related-unrelated variables in the model of the Ridge regression.
* Coefficients near zero in Lasso.
* But when the L1 norm is big enough in lambda, some coefficients make it zero. Thus, it makes the selection of the variable.
* It is very important to choose Lambda correctly, CV is used here too.
* Ridge and Lasso methods are not superior to each other.

## Lasso Regression - Model 

In [None]:
# Required Libraries
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge,Lasso
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn import model_selection
from sklearn.linear_model import RidgeCV, LassoCV

In [None]:
X = df.drop("score",axis=1)
y = df["score"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

lasso_model = Lasso().fit(X_train,y_train)

In [None]:
print("intercept: ", lasso_model.intercept_)
print("coef: ", lasso_model.coef_)

In [None]:
# coefficients for different lambda values

alphas = np.random.randint(0,10000,10)
lasso = Lasso()
coefs = []

for a in alphas:
    lasso.set_params(alpha=a)
    lasso.fit(X_train,y_train)
    coefs.append(lasso.coef_)

In [None]:
ax = plt.gca()
ax.plot(alphas,coefs)
ax.set_xscale("log")

## Lasso Regression - Prediction

In [None]:
lasso_model

In [None]:
lasso_model.predict(X_train)[0:5]

In [None]:
lasso_model.predict(X_test)[0:5]

In [None]:
y_pred = lasso_model.predict(X_test)
np.sqrt(mean_squared_error(y_test,y_pred))

In [None]:
r2_score(y_test,y_pred)

## Lasso Regression - Model Tuning

In [None]:
lasso_cv_model = LassoCV(cv=10,max_iter=100000).fit(X_train,y_train)
lasso_cv_model

In [None]:
lasso_cv_model.alpha_

In [None]:
lasso_tuned = Lasso().set_params(alpha= lasso_cv_model.alpha_).fit(X_train,y_train)
y_pred = lasso_tuned.predict(X_test)
np.sqrt(mean_squared_error(y_test,y_pred))

# ElasticNet Regression <a id="6"></a>
* The aim is to find the coefficients that minimize the sum of error squares by applying a penalty.
* ElasticNet combines L1 and L2 approaches.The aim is to find the coefficients that minimize the sum of error squares by applying a penalty.

## ElasticNet Regression - Model & Prediction

In [None]:
# Required Libraries
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge,Lasso,ElasticNet
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn import model_selection
from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV

In [None]:
X = df.drop("score",axis=1)
y = df["score"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

enet_model = ElasticNet().fit(X_train,y_train)

In [None]:
enet_model.coef_

In [None]:
enet_model.intercept_

In [None]:
# prediction
enet_model.predict(X_train)[0:10]

In [None]:
enet_model.predict(X_test)[0:10]

In [None]:
y_pred = enet_model.predict(X_test)
np.sqrt(mean_squared_error(y_test,y_pred))