<a href="https://colab.research.google.com/github/swethag04/ml-projects/blob/main/linear-regression/regularization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Methods to reduce model complexity**


1.   Sequential Feature Selection
2.   Regularization


####**Sequential Feature Selection**
Sequential feature selection adds or removes features based on the model's performance until a subset of features k of the desired size is reached.


####**Regularization**

Regularization is a method to control complexity of a model. With regularization, we will:


*   Keep all the features, even if there are a huge number of them
*   Adjust complexity using a single parameter alpha
* As alpha increases, model complexity decreases
* If alpha is large, then the parameter theta are constrained and shrink closer to zero
* If alpha=0, the model is not constrained at all and we just have standard linear regression



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_selection import SequentialFeatureSelector, SelectFromModel
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from sklearn import set_config
set_config(display="diagram")

In [2]:
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

In [3]:
df = pd.read_csv('sample_data/auto.csv')
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino


In [4]:
df.shape

(392, 9)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           392 non-null    float64
 1   cylinders     392 non-null    int64  
 2   displacement  392 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        392 non-null    int64  
 5   acceleration  392 non-null    float64
 6   year          392 non-null    int64  
 7   origin        392 non-null    int64  
 8   name          392 non-null    object 
dtypes: float64(4), int64(4), object(1)
memory usage: 27.7+ KB


In [6]:
# Checking for nulls
df.isnull().sum()

mpg             0
cylinders       0
displacement    0
horsepower      0
weight          0
acceleration    0
year            0
origin          0
name            0
dtype: int64

In [7]:
X = df.drop(['mpg', 'name'], axis=1)
y = df['mpg']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### Linear regression model with Sequential feature selector

In [8]:
selector_pipe = Pipeline([('selector', SequentialFeatureSelector(LinearRegression())),
                         ('model', LinearRegression())])
selector_pipe

In [9]:
selector_pipe.get_params()

{'memory': None,
 'steps': [('selector',
   SequentialFeatureSelector(estimator=LinearRegression())),
  ('model', LinearRegression())],
 'verbose': False,
 'selector': SequentialFeatureSelector(estimator=LinearRegression()),
 'model': LinearRegression(),
 'selector__cv': 5,
 'selector__direction': 'forward',
 'selector__estimator__copy_X': True,
 'selector__estimator__fit_intercept': True,
 'selector__estimator__n_jobs': None,
 'selector__estimator__positive': False,
 'selector__estimator': LinearRegression(),
 'selector__n_features_to_select': 'warn',
 'selector__n_jobs': None,
 'selector__scoring': None,
 'selector__tol': None,
 'model__copy_X': True,
 'model__fit_intercept': True,
 'model__n_jobs': None,
 'model__positive': False}

In [10]:
param = {'selector__n_features_to_select': [2,3,4,5]}
selector_grid = GridSearchCV(selector_pipe, param)
selector_grid.fit(X_train, y_train)
selector_train_mse =  mean_squared_error(selector_grid.predict(X_train), y_train)
selector_test_mse =  mean_squared_error(selector_grid.predict(X_test), y_test)
print(f'Linear regression Train MSE: {selector_train_mse}')
print(f'Linear regression Test MSE: {selector_test_mse}')


Linear regression Train MSE: 11.678679189481475
Linear regression Test MSE: 10.07626629529782


In [11]:
best_estimator = selector_grid.best_estimator_
best_selector = best_estimator.named_steps['selector']
best_model = selector_grid.best_estimator_.named_steps['model']
feature_names = X_train.columns[best_selector.get_support()]
coefs = best_model.coef_

print(best_estimator, best_selector, best_model)
print(f'Features from best selector: {feature_names}.')
print('Coefficient values: ')
print('===================')
pd.DataFrame([coefs.T], columns = feature_names, index = ['model'])

Pipeline(steps=[('selector',
                 SequentialFeatureSelector(estimator=LinearRegression(),
                                           n_features_to_select=3)),
                ('model', LinearRegression())]) SequentialFeatureSelector(estimator=LinearRegression(), n_features_to_select=3) LinearRegression()
Features from best selector: Index(['weight', 'year', 'origin'], dtype='object').
Coefficient values: 


Unnamed: 0,weight,year,origin
model,-0.006098,0.774392,1.406635


### Ridge regression

Ridge regularization (L2) is a regularized version of linear regression. Here,a regularization term is added to the cost function that forces the learning algorithm to fit the data and keep the model weights as small as possible. Ridge regression uses the L2 norm of the feature weights vector. (sum of the squares of the features)


In [12]:
ridge_param = {'ridge__alpha': np.logspace(0,10,50)}
ridge_pipe = Pipeline([('scaler', StandardScaler()),
                       ('ridge', Ridge())])

ridge_grid = GridSearchCV(ridge_pipe, ridge_param)
ridge_grid.fit(X_train, y_train)
ridge_coefs = ridge_grid.best_estimator_.named_steps['ridge'].coef_
print(ridge_coefs)

[-0.51336281  1.22523874 -1.10779273 -4.7257141   0.06113437  2.64512492
  1.33540885]


In [13]:
ridge_train_mse =  mean_squared_error(ridge_grid.predict(X_train), y_train)
ridge_test_mse =  mean_squared_error(ridge_grid.predict(X_test), y_test)
print(f'Ridge regression Train MSE: {ridge_train_mse}')
print(f'Ridge regression Test MSE: {ridge_test_mse}')

Ridge regression Train MSE: 11.401707362625789
Ridge regression Test MSE: 10.163697964588271


### Lasso Regression
Lasso regularization (L1) adds a regularization term to the cost funcion, but uses the L1 norm of the feature weights vector (sum of the absolute values of the features). Lasso forces many of the coeffecients to be zero

In [14]:
# Lasso pipeline
lasso_pipe = Pipeline([('polyfeatures', PolynomialFeatures(degree=3, include_bias=False)),
                       ('scaler', StandardScaler()),
                       ('lasso', Lasso(random_state=42))])
lasso_pipe.fit(X_train, y_train)
lasso_coefs = lasso_pipe.named_steps['lasso'].coef_
print(lasso_coefs)


[-0.         -0.         -0.         -3.06660503  0.          0.
  0.         -0.         -0.         -0.         -0.         -0.
 -0.          0.         -0.         -0.         -0.         -0.0880862
 -0.         -0.         -0.         -0.         -1.42250731 -0.
 -0.         -0.         -0.         -0.          0.          0.
  0.          0.          0.          0.          0.         -0.
 -0.         -0.         -0.         -0.         -0.         -0.
 -0.         -0.         -0.         -0.         -0.         -0.
 -0.         -0.         -0.         -0.         -0.         -0.
 -0.         -0.         -0.         -0.         -0.          0.
 -0.          0.          0.         -0.         -0.         -0.
 -0.         -0.         -0.         -0.         -0.         -0.
 -0.         -0.         -0.         -0.         -0.         -0.
 -0.         -0.         -0.         -0.         -0.          0.
 -0.         -0.         -0.         -0.         -0.         -0.
 -0.         -0.  

In [15]:
lasso_train_mse =  mean_squared_error(lasso_pipe.predict(X_train), y_train)
lasso_test_mse =  mean_squared_error(lasso_pipe.predict(X_test), y_test)
print(f'Lasso regression Train MSE: {lasso_train_mse}')
print(f'Lasso regression Test MSE: {lasso_test_mse}')

Lasso regression Train MSE: 11.860728888695974
Lasso regression Test MSE: 8.984776169896323


In [16]:
feature_names = lasso_pipe.named_steps['polyfeatures'].get_feature_names_out()
lasso_df = pd.DataFrame({'feature': feature_names,
                         'coef': lasso_coefs})
lasso_df.loc[lasso_df['coef']!=0]

Unnamed: 0,feature,coef
3,weight,-3.066605
17,displacement acceleration,-0.088086
22,horsepower acceleration,-1.422507
111,acceleration^2 origin,0.689516
113,acceleration year origin,0.423159
115,year^3,1.87257


### Lasso as feature selector

Rather than using `Lasso` as the estimator, it can be used to select features that are subsequently used on a `LinearRegression` estimator.

In [17]:
model_selector_pipe = Pipeline([('poly_features', PolynomialFeatures(degree = 3, include_bias = False)),
                                ('scaler', StandardScaler()),
                                ('selector', SelectFromModel(Lasso())),
                                    ('linreg', LinearRegression())])
model_selector_pipe.fit(X_train, y_train)

In [18]:
selector_train_mse = mean_squared_error(model_selector_pipe.predict(X_train), y_train)
selector_test_mse = mean_squared_error(model_selector_pipe.predict(X_test), y_test)
print('train mse: ', selector_train_mse)
print('test mse: ', selector_test_mse)

train mse:  9.93192512323042
test mse:  8.899543694287043
