# Ridge Regression

The aim is to find the coefficients that minimize the error sum of squares, by applying a penalty to these coefficients.

<img src="https://datavedas.com/wp-content/uploads/2018/04/image001-1.png" />

The first y value in the formula is the real values and the second y value is the predicted value. After this equation is opened and betas are written in my place and solved, what remains are the coefficients.

- It is resistant to over learning.
- It is biased but its variance is low.
- Better than OLS when there are too many parameters.
- Offers a solution to the problem of multidimensionality.
- Effective when there is a problem of multiple linear connections. Multiple linear connection problem; It means that there is a high correlation between independent variables. In other words, it carries the same information that a variable carries in another variable.
- Builds a model with all variables. It does not remove unrelated variables from the model, it brings their coefficients closer to zero.
- λ is in the critical model. It allows to control the relative effects of two terms (in the formula).
- It is important to find a good value for λ. For this, the CV method is used.

<img src="https://i.ibb.co/2qMjXG8/Untitled.png" />

- The value in the left part of the formula is the classical recession.
- where λ is zero is in OLS.
- A set containing certain values ​​for λ is selected and the cross validation test error is calculated for each.
- The λ which gives the smallest cross validation is chosen as the setting parameter.

In [None]:
# import the necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.model_selection import train_test_split
from sklearn import model_selection
from scipy.stats import boxcox
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import RidgeCV

In [None]:
# load data
data = "../input/insurance/insurance.csv"
df = pd.read_csv(data)

# show data (6 row)
df.head(6)

## Ridge Regression Model

In [None]:
df_encode = pd.get_dummies(data = df, columns = ['sex','smoker','region'])
df_encode.head()

In [None]:
# normalization
y_bc,lam, ci= boxcox(df_encode['charges'],alpha=0.05)
df_encode['charges'] = np.log(df_encode['charges'])

df_encode.head()

In [None]:
X = df_encode.drop("charges",axis=1)
y = df_encode["charges"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

ridge_model = Ridge(alpha=0.1).fit(X_train, y_train)
ridge_model

In [None]:
ridge_model.coef_

In [None]:
ridge_model.intercept_

In [None]:
lambdas = 10**np.linspace(10,-2,100)*0.5 # Creates random numbers
ridge_model =  Ridge()
coefs = []

for i in lambdas:
    ridge_model.set_params(alpha=i)
    ridge_model.fit(X_train,y_train)
    coefs.append(ridge_model.coef_)
    
ax = plt.gca()
ax.plot(lambdas, coefs)
ax.set_xscale("log")

In contrast to the different beta values, the changes in the coefficients of the variables in our data set appear in the graph above. As can be seen, as the coefficients increase, it approaches zero.

## Ridge Regression - Prediction


In [None]:
ridge_model = Ridge().fit(X_train,y_train)

y_pred = ridge_model.predict(X_train)

print("Predict: ", y_pred[0:10])
print("Real: ", y_train[0:10].values)

In [None]:
RMSE = np.mean(mean_squared_error(y_train,y_pred)) # rmse = square root of the mean of error squares
print("train error: ", RMSE)

In [None]:
Verified_RMSE = np.sqrt(np.mean(-cross_val_score(ridge_model, X_train, y_train, cv=20, scoring="neg_mean_squared_error")))
print("Verified_RMSE: ", Verified_RMSE)

There are two values above. One of them is unverified, the other is the values ​​that represent the square root of the sum of the verified error squares. As you can see, the unverified value is almost half of the verified value. This result shows us that it is more correct to use the second method, not the first method, while taking the square root of the mean of the error squares.

## Model Tuning

In [None]:
ridge_model = Ridge(10).fit(X_train,y_train)
y_pred = ridge_model.predict(X_test)
np.sqrt(mean_squared_error(y_test,y_pred))

In [None]:
ridge_model = Ridge(30).fit(X_train,y_train)
y_pred = ridge_model.predict(X_test)
np.sqrt(mean_squared_error(y_test,y_pred))

In [None]:
ridge_model = Ridge(90).fit(X_train,y_train)
y_pred = ridge_model.predict(X_test)
np.sqrt(mean_squared_error(y_test,y_pred))

We can find out which value will work better by trial and error. But with the method we will use below, we can find the most appropriate value more easily and quickly.

In [None]:
lambdas1 = 10**np.linspace(10,-2,100)
lambdas2 = np.random.randint(0,10000,100)

ridgeCV = RidgeCV(alphas = lambdas1,scoring = "neg_mean_squared_error", cv=10, normalize=True)
ridgeCV.fit(X_train,y_train)

We can use alpha_ feature to attract the most appropriate value.

In [None]:
ridgeCV.alpha_

In [None]:
# final model
ridge_tuned = Ridge(alpha = ridgeCV.alpha_).fit(X_train,y_train)
y_pred = ridge_tuned.predict(X_test)
np.sqrt(mean_squared_error(y_test,y_pred))*100

In [None]:
# for lambdas2
ridgeCV = RidgeCV(alphas = lambdas2,scoring = "neg_mean_squared_error", cv=10, normalize=True)
ridgeCV.fit(X_train,y_train)
ridge_tuned = Ridge(alpha = ridgeCV.alpha_).fit(X_train,y_train)
y_pred = ridge_tuned.predict(X_test)
np.sqrt(mean_squared_error(y_test,y_pred))*100