# Ridge Regression with Python

## Intro

Ridge regression is one of several regularized linear models. Regularization is the process of penalizing coefficients of variables either by removing them and or reduce their impact. Ridge regression reduces the effect of problematic variables close to zero but never fully removes them.

We will go through an example of ridge regression using the VietNamI dataset available in the pydataset library. Our goal will be to predict expenses based on the variables available. We will complete this task using the following steps

    Data preparation
    Baseline model development
    Ridge regression model

Below is the initial code

In [1]:
from pydataset import data
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

## Data Preparation

The data preparation is simple. All we have to do is load the data and convert the sex variable to a dummy variable. We also need to set up our X and y datasets. Below is the code.

In [2]:
df=pd.DataFrame(data('VietNamI'))
df

Unnamed: 0,pharvis,lnhhexp,age,sex,married,educ,illness,injury,illdays,actdays,insurance,commune
1,0,2.730363,3.761200,male,1,2,1,0,7,0,0,192
2,0,2.737248,2.944439,female,0,0,1,0,4,0,0,167
3,0,2.266935,2.564950,male,0,4,0,0,0,0,1,76
4,1,2.392753,3.637586,female,1,3,1,0,3,0,1,123
5,1,3.105335,3.295837,male,1,3,1,0,10,0,0,148
...,...,...,...,...,...,...,...,...,...,...,...,...
27762,0,1.847290,1.609438,female,0,5,2,0,3,0,0,115
27763,0,2.461460,2.833213,female,0,6,0,0,0,0,0,115
27764,0,2.460262,2.564950,female,0,5,0,0,0,0,0,116
27765,0,1.920169,4.007333,female,1,4,2,0,20,0,1,116


In [3]:
df.loc[df.sex== 'male', 'sex'] = 0
df.loc[df.sex== 'female','sex'] = 1
df['sex'] = df['sex'].astype(int)
X=df[['pharvis','age','sex','married','educ','illness','injury',
      'illdays','actdays','insurance']]
y=df['lnhhexp']

In [4]:
df

Unnamed: 0,pharvis,lnhhexp,age,sex,married,educ,illness,injury,illdays,actdays,insurance,commune
1,0,2.730363,3.761200,0,1,2,1,0,7,0,0,192
2,0,2.737248,2.944439,1,0,0,1,0,4,0,0,167
3,0,2.266935,2.564950,0,0,4,0,0,0,0,1,76
4,1,2.392753,3.637586,1,1,3,1,0,3,0,1,123
5,1,3.105335,3.295837,0,1,3,1,0,10,0,0,148
...,...,...,...,...,...,...,...,...,...,...,...,...
27762,0,1.847290,1.609438,1,0,5,2,0,3,0,0,115
27763,0,2.461460,2.833213,1,0,6,0,0,0,0,0,115
27764,0,2.460262,2.564950,1,0,5,0,0,0,0,0,116
27765,0,1.920169,4.007333,1,1,4,2,0,20,0,1,116


## Baseline Model

The metric we are using is the mean squared error. Below is the code and output for our baseline regression model. This is a model that has no regularization to it. Below is the code.

In [5]:
regression=LinearRegression()
regression.fit(X,y)
first_model=(mean_squared_error(y_true=y,y_pred=regression.predict(X)))
print(first_model)

0.35528915032173053


In [7]:
ridge=Ridge(normalize=True)
search=GridSearchCV(estimator=ridge,param_grid={'alpha':np.logspace(-5,2,8)},
                    scoring='neg_mean_squared_error',n_jobs=1,refit=True,cv=10)

The search object has several arguments within it. Alpha is hyperparameter we are trying to set. The log space is the range of values we want to test. We want the log of -5 to 2, but we only get 8 values from within that range evenly spread out. Are metric is the mean squared error. Refit set true means to adjust the parameters while modeling and cv is the number of folds to develop for the cross-validation. We can now use the .fit function to run the model and then use the .best_params_ and .best_scores_ function to determine the model;s strength. Below is the code and .

In [8]:
search.fit(X,y)
search.best_params_

{'alpha': 0.01}

Above is the most approriate alpha and below we see the mean squared error if we use this alpha.

In [9]:
abs(search.best_score_)

0.3801325693754134

In this case, the value of 0.38 is worse than what the baseline model was. We can confirm this by  fitting our model with the ridge information and finding the mean squared error. This is done below.

In [10]:
ridge=Ridge(normalize=True,alpha=0.01)
ridge.fit(X,y)
second_model=(mean_squared_error(y_true=y,y_pred=ridge.predict(X)))
print(second_model)

0.35529321992606566


The 0.35 is lower than the 0.38. This is because the last results are not cross-validated. In addition, these results indicate that there is little difference between the ridge and baseline models. This is confirmed with the coefficients of each model found below.

In [11]:
coef_dict_baseline = {}
for coef, feat in zip(regression.coef_,data("VietNamI").columns):
    coef_dict_baseline[feat] = coef
coef_dict_baseline

{'pharvis': 0.013282050886950928,
 'lnhhexp': 0.06480086550467873,
 'age': 0.004012412278795713,
 'sex': -0.08739614349708995,
 'married': 0.07527646383836195,
 'educ': -0.0618092130060029,
 'illness': 0.04087038457896233,
 'injury': -0.002763768716569035,
 'illdays': -0.00671706331089314,
 'actdays': 0.14687843649771137}

In [12]:
coef_dict_ridge = {}
for coef, feat in zip(ridge.coef_,data("VietNamI").columns):
    coef_dict_ridge[feat] = coef
coef_dict_ridge

{'pharvis': 0.012881937698185299,
 'lnhhexp': 0.06335455237380982,
 'age': 0.0038966233212979276,
 'sex': -0.08465416379615645,
 'married': 0.07451889604357705,
 'educ': -0.060987237789927026,
 'illness': 0.03943060792205406,
 'injury': -0.002779341753010456,
 'illdays': -0.006551280792122475,
 'actdays': 0.14663287713359738}

The coefficient values are about the same. This means that the penalization made little difference with this dataset.