___

<a href='http://www.pieriandata.com'><img src='../Pierian_Data_Logo.png'/></a>
___
<center><em>Copyright by Pierian Data Inc.</em></center>
<center><em>For more information, visit us at <a href='http://www.pieriandata.com'>www.pieriandata.com</a></em></center>

# Regularization with SciKit-Learn

Previously we created a new polynomial feature set and then applied our standard linear regression on it, but we can be smarter about model choice and utilize regularization.

Regularization attempts to minimize the RSS (residual sum of squares) *and* a penalty factor. This penalty factor will penalize models that have coefficients that are too large. Some methods of regularization will actually cause non useful features to have a coefficient of zero, in which case the model does not consider the feature.

Let's explore two methods of regularization, Ridge Regression and Lasso. We'll combine these with the polynomial feature set (it wouldn't be as effective to perform regularization of a model on such a small original feature set of the original X).

## Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Data and Setup

In [2]:
df = pd.read_csv("Advertising.csv")
X = df.drop('sales',axis=1)
y = df['sales']

### Polynomial Conversion

In [3]:
from sklearn.preprocessing import PolynomialFeatures

In [4]:
polynomial_converter = PolynomialFeatures(degree=3,include_bias=False)

In [5]:
poly_features = polynomial_converter.fit_transform(X)

### Train | Test Split

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=101)

----
----

## Scaling the Data

While our particular data set has all the values in the same order of magnitude ($1000s of dollars spent), typically that won't be the case on a dataset, and since the mathematics behind regularized models will sum coefficients together, its important to standardize the features. Review the theory videos for more info, as well as a discussion on why we only **fit** to the training data, and **transform** on both sets separately.

In [8]:
from sklearn.preprocessing import StandardScaler

In [9]:
# help(StandardScaler)

In [10]:
scaler = StandardScaler()

In [11]:
scaler.fit(X_train)

StandardScaler()

In [12]:
X_train = scaler.transform(X_train)

In [13]:
X_test = scaler.transform(X_test)

## Ridge Regression

Make sure to view video lectures for full explanation of Ridge Regression and choosing an alpha.

In [14]:
from sklearn.linear_model import Ridge

In [15]:
ridge_model = Ridge(alpha=10)

In [16]:
ridge_model.fit(X_train,y_train)

Ridge(alpha=10)

In [17]:
test_predictions = ridge_model.predict(X_test)

In [18]:
from sklearn.metrics import mean_absolute_error,mean_squared_error

In [19]:
MAE = mean_absolute_error(y_test,test_predictions)
MSE = mean_squared_error(y_test,test_predictions)
RMSE = np.sqrt(MSE)

In [20]:
MAE

0.5774404204714181

In [21]:
RMSE

0.8946386461319672

How did it perform on the training set? (This will be used later on for comparison)

In [22]:
# Training Set Performance
train_predictions = ridge_model.predict(X_train)
MAE = mean_absolute_error(y_train,train_predictions)
MAE

0.5288348183025319

### Choosing an alpha value with Cross-Validation

Review the theory video for full details.

In [23]:
from sklearn.linear_model import RidgeCV

In [24]:
# help(RidgeCV)

In [25]:
# Choosing a scoring: https://scikit-learn.org/stable/modules/model_evaluation.html
# Negative RMSE so all metrics follow convention "Higher is better"

# See all options: sklearn.metrics.SCORERS.keys()
ridge_cv_model = RidgeCV(alphas=(0.1, 1.0, 10.0),scoring='neg_mean_absolute_error')

In [26]:
# The more alpha options you pass, the longer this will take.
# Fortunately our data set is still pretty small
ridge_cv_model.fit(X_train,y_train)

RidgeCV(alphas=array([ 0.1,  1. , 10. ]), scoring='neg_mean_absolute_error')

In [27]:
ridge_cv_model.alpha_

0.1

In [28]:
test_predictions = ridge_cv_model.predict(X_test)

In [29]:
MAE = mean_absolute_error(y_test,test_predictions)
MSE = mean_squared_error(y_test,test_predictions)
RMSE = np.sqrt(MSE)

In [30]:
MAE

0.4273774884329612

In [31]:
RMSE

0.618071992693697

In [32]:
# Training Set Performance
# Training Set Performance
train_predictions = ridge_cv_model.predict(X_train)
MAE = mean_absolute_error(y_train,train_predictions)
MAE

0.30941321056334764

In [33]:
ridge_cv_model.coef_

array([ 5.40769392,  0.5885865 ,  0.40390395, -6.18263924,  4.59607939,
       -1.18789654, -1.15200458,  0.57837796, -0.1261586 ,  2.5569777 ,
       -1.38900471,  0.86059434,  0.72219553, -0.26129256,  0.17870787,
        0.44353612, -0.21362436, -0.04622473, -0.06441449])


-----

## Lasso Regression

In [34]:
from sklearn.linear_model import LassoCV

In [35]:
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html
lasso_cv_model = LassoCV(eps=0.1,n_alphas=100,cv=5)

In [36]:
lasso_cv_model.fit(X_train,y_train)

LassoCV(cv=5, eps=0.1)

In [37]:
lasso_cv_model.alpha_

0.49430709092258285

In [38]:
test_predictions = lasso_cv_model.predict(X_test)

In [39]:
MAE = mean_absolute_error(y_test,test_predictions)
MSE = mean_squared_error(y_test,test_predictions)
RMSE = np.sqrt(MSE)

In [40]:
MAE

0.6541723161252854

In [41]:
RMSE

1.1308001022762533

In [42]:
# Training Set Performance
# Training Set Performance
train_predictions = lasso_cv_model.predict(X_train)
MAE = mean_absolute_error(y_train,train_predictions)
MAE

0.6912807140820697

In [43]:
lasso_cv_model.coef_

array([1.002651  , 0.        , 0.        , 0.        , 3.79745279,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        ])

## Elastic Net

Elastic Net combines the penalties of ridge regression and lasso in an attempt to get the best of both worlds!

In [44]:
from sklearn.linear_model import ElasticNetCV

In [45]:
elastic_model = ElasticNetCV(l1_ratio=[.1, .5, .7,.9, .95, .99, 1],tol=0.01)

In [46]:
elastic_model.fit(X_train,y_train)

ElasticNetCV(l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1], tol=0.01)

In [47]:
elastic_model.l1_ratio_

1.0

In [48]:
test_predictions = elastic_model.predict(X_test)

In [49]:
MAE = mean_absolute_error(y_test,test_predictions)
MSE = mean_squared_error(y_test,test_predictions)
RMSE = np.sqrt(MSE)

In [50]:
MAE

0.5663262117569451

In [51]:
RMSE

0.7485546215633726

In [52]:
# Training Set Performance
# Training Set Performance
train_predictions = elastic_model.predict(X_train)
MAE = mean_absolute_error(y_train,train_predictions)
MAE

0.4307582990472369

In [53]:
elastic_model.coef_

array([ 3.78993643,  0.89232919,  0.28765395, -1.01843566,  2.15516144,
       -0.3567547 , -0.271502  ,  0.09741081,  0.        , -1.05563151,
        0.2362506 ,  0.07980911,  1.26170778,  0.01464706,  0.00462336,
       -0.39986069,  0.        ,  0.        , -0.05343757])

-----
---

In [54]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Data and Setup

In [55]:
df = pd.read_csv("Advertising.csv")
X = df.drop('sales',axis=1)
y = df['sales']

In [56]:
from sklearn.preprocessing import PolynomialFeatures

In [57]:
poly_converter=PolynomialFeatures(degree=3, include_bias=False)

In [58]:
poly_features=poly_converter.fit_transform(X)

In [59]:
poly_features.shape

(200, 19)

In [60]:
from sklearn.model_selection import train_test_split

In [61]:
help(train_test_split)

Help on function train_test_split in module sklearn.model_selection._split:

train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)
    Split arrays or matrices into random train and test subsets.
    
    Quick utility that wraps input validation and
    ``next(ShuffleSplit().split(X, y))`` and application to input data
    into a single call for splitting (and optionally subsampling) data in a
    oneliner.
    
    Read more in the :ref:`User Guide <cross_validation>`.
    
    Parameters
    ----------
    *arrays : sequence of indexables with same length / shape[0]
        Allowed inputs are lists, numpy arrays, scipy-sparse
        matrices or pandas dataframes.
    
    test_size : float or int, default=None
        If float, should be between 0.0 and 1.0 and represent the proportion
        of the dataset to include in the test split. If int, represents the
        absolute number of test samples. If None, the value is set to 

In [62]:
X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=101)

In [63]:
from sklearn.preprocessing import StandardScaler

In [64]:
standard_scaler=StandardScaler()

In [65]:
# only fit on train to avoid data leakage, fit gets statistical in formation like mean and std from x train for standardization/standard normal form conversion/z-score normalization

In [66]:
standard_scaler.fit(X_train)

StandardScaler()

In [67]:
X_train=standard_scaler.transform(X_train)

In [68]:
X_test=standard_scaler.transform(X_test)

In [69]:
X_train[0]

array([ 0.49300171, -0.33994238,  1.61586707,  0.28407363, -0.02568776,
        1.49677566, -0.59023161,  0.41659155,  1.6137853 ,  0.08057172,
       -0.05392229,  1.01524393, -0.36986163,  0.52457967,  1.48737034,
       -0.66096022, -0.16360242,  0.54694754,  1.37075536])

In [70]:
poly_features[0]

array([2.30100000e+02, 3.78000000e+01, 6.92000000e+01, 5.29460100e+04,
       8.69778000e+03, 1.59229200e+04, 1.42884000e+03, 2.61576000e+03,
       4.78864000e+03, 1.21828769e+07, 2.00135918e+06, 3.66386389e+06,
       3.28776084e+05, 6.01886376e+05, 1.10186606e+06, 5.40101520e+04,
       9.88757280e+04, 1.81010592e+05, 3.31373888e+05])

In [71]:
from sklearn.linear_model import Ridge

In [97]:
help(Ridge)
#  ||y - Xw||^2_2 + alpha * ||w||^2_2
# everything same as linear regression except need to choose alpha which is lambda for penalty/
# choose alpha laret on we use cross validation to choose alpha value(tune or adjust hyper param)

Help on class Ridge in module sklearn.linear_model._ridge:

class Ridge(sklearn.base.MultiOutputMixin, sklearn.base.RegressorMixin, _BaseRidge)
 |  Ridge(alpha=1.0, *, fit_intercept=True, normalize='deprecated', copy_X=True, max_iter=None, tol=0.001, solver='auto', positive=False, random_state=None)
 |  
 |  Linear least squares with l2 regularization.
 |  
 |  Minimizes the objective function::
 |  
 |  ||y - Xw||^2_2 + alpha * ||w||^2_2
 |  
 |  This model solves a regression model where the loss function is
 |  the linear least squares function and regularization is given by
 |  the l2-norm. Also known as Ridge Regression or Tikhonov regularization.
 |  This estimator has built-in support for multi-variate regression
 |  (i.e., when y is a 2d-array of shape (n_samples, n_targets)).
 |  
 |  Read more in the :ref:`User Guide <ridge_regression>`.
 |  
 |  Parameters
 |  ----------
 |  alpha : {float, ndarray of shape (n_targets,)}, default=1.0
 |      Regularization strength; must be 

In [72]:
ridge_instance=Ridge(alpha=10)

In [73]:
model=ridge_instance.fit(X_train, y_train)

In [74]:
predict_test=model.predict(X_test)

In [75]:
from sklearn.metrics import mean_squared_error, mean_absolute_error 

In [76]:
mean_absolute_error(y_test, predict_test)

0.5774404204714181

In [77]:
np.sqrt(mean_squared_error(y_test, predict_test))

0.8946386461319672

In [104]:
# L1 regulkarization ridge regression with cross validation
from sklearn.linear_model import RidgeCV
# cross validation get average error metrices for all model with different alpha value and select one with small error
# CV+score metrics on different alpha to choose beat one
# cv parameter is integer of how many foldes for cross validation by default is leave-one -out cross validation assigned as none. takes lot of time for huge data set if it is default.
# recall cross validation and hold out test set or rewatch video
# here X_test is already hold out because we have X_train and y_train for cross validation to choose alpha. so, hyperparameter tuning using cross validation and hold out for final reporting.
# validation is used to validate alpha with less error thus used for hyperparameter tuning/adjusting thous hold out is needed which is not used for anything.
# here i use only training set for hyper parameter tuning. as training set + small portion of it goes to validation set when we use ridgeCV without hampering test set.

In [106]:
ridge_instance=RidgeCV(alphas=(0.1, 0.5, 1.0, 10.0), scoring='neg_mean_absolute_error')
# to choose scoring metrics model has SCORE run it and copy one appropriate for your model
# scikit learn refers lambda as alpha
# it inside uses cross validation to choose best alpha(best parameter value) among given alphas and sklearn uses something called "scorer object"='neg_mean_absolute_error' in this case
# all score object follow the convension that high return values are better than lower return values. if accuracy test it holds good that higher accuracy is better. but what abut mean absolute error,
# if MAE is higher it is not better so, it uses negative of it in scoring to have same convension among all models and all tasks. eg higher negative root mean square is better it is.
# fixes that issue for certain error metrices by reporting negative error back. higher accuracy in classification task type is better. to make unoiform framework.

In [107]:
model=ridge_instance.fit(X_train, y_train)

In [109]:
model.alpha_
#gives better alpha among we gave by scoing we provided...

0.1

In [115]:
predict_test=model.predict(X_test)

In [116]:
from sklearn.metrics import mean_squared_error, mean_absolute_error 

In [117]:
mean_absolute_error(y_test, predict_test)

0.4273774884329612

In [118]:
np.sqrt(mean_squared_error(y_test, predict_test))

0.618071992693697

In [128]:
model.coef_
# for l2 regularization, none of this coef are zero, or close to zero
# but will be zero and close to zero cofficient  in l1 regularizatin=LASSO regression can yeald sparse model

array([ 5.40769392,  0.5885865 ,  0.40390395, -6.18263924,  4.59607939,
       -1.18789654, -1.15200458,  0.57837796, -0.1261586 ,  2.5569777 ,
       -1.38900471,  0.86059434,  0.72219553, -0.26129256,  0.17870787,
        0.44353612, -0.21362436, -0.04622473, -0.06441449])

In [127]:
# model.best_score_
# highest negative mean qbsolute error

In [119]:
from sklearn.metrics import SCORERS

In [120]:
SCORERS

{'explained_variance': make_scorer(explained_variance_score),
 'r2': make_scorer(r2_score),
 'max_error': make_scorer(max_error, greater_is_better=False),
 'neg_median_absolute_error': make_scorer(median_absolute_error, greater_is_better=False),
 'neg_mean_absolute_error': make_scorer(mean_absolute_error, greater_is_better=False),
 'neg_mean_absolute_percentage_error': make_scorer(mean_absolute_percentage_error, greater_is_better=False),
 'neg_mean_squared_error': make_scorer(mean_squared_error, greater_is_better=False),
 'neg_mean_squared_log_error': make_scorer(mean_squared_log_error, greater_is_better=False),
 'neg_root_mean_squared_error': make_scorer(mean_squared_error, greater_is_better=False, squared=False),
 'neg_mean_poisson_deviance': make_scorer(mean_poisson_deviance, greater_is_better=False),
 'neg_mean_gamma_deviance': make_scorer(mean_gamma_deviance, greater_is_better=False),
 'accuracy': make_scorer(accuracy_score),
 'top_k_accuracy': make_scorer(top_k_accuracy_score, ne

In [121]:
SCORERS.keys()
# every score metrics is transformed so that higher is better
# all different type of error you can try to work with as scoring metrices

dict_keys(['explained_variance', 'r2', 'max_error', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_root_mean_squared_error', 'neg_mean_poisson_deviance', 'neg_mean_gamma_deviance', 'accuracy', 'top_k_accuracy', 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_ovr_weighted', 'roc_auc_ovo_weighted', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'neg_brier_score', 'adjusted_rand_score', 'rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'jaccard', 'jaccard_macro', 'jaccard_micro', 'jaccard_samples', 'jaccard_wei

In [130]:
# LASSO

In [134]:
# lasso can for some coefficient estimates to exactly zero  when tuning parameter lambda(in practical alpha) is extremely large.
# similar to subset selection, LASSO performs variable selection because if you setssome coeff to zero, that means you not considering that particular feature. allows model gaenerated from lasso much easier to interpret


In [136]:
# LassoCV with scikitlearn operates on checking a number of alphas within a range instead of providing alpha directly as in case of RidgeCV
# LASSO=least absolute srinkage and selection operator. selection operator for making zero coff for some selecting few features

In [138]:
from sklearn.linear_model import Lasso # we are not using this but easy

In [139]:
from sklearn.linear_model import LassoCV

In [140]:
help(LassoCV)

Help on class LassoCV in module sklearn.linear_model._coordinate_descent:

class LassoCV(sklearn.base.RegressorMixin, LinearModelCV)
 |  LassoCV(*, eps=0.001, n_alphas=100, alphas=None, fit_intercept=True, normalize='deprecated', precompute='auto', max_iter=1000, tol=0.0001, copy_X=True, cv=None, verbose=False, n_jobs=None, positive=False, random_state=None, selection='cyclic')
 |  
 |  Lasso linear model with iterative fitting along a regularization path.
 |  
 |  See glossary entry for :term:`cross-validation estimator`.
 |  
 |  The best model is selected by cross-validation.
 |  
 |  The optimization objective for Lasso is::
 |  
 |      (1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1
 |  
 |  Read more in the :ref:`User Guide <lasso>`.
 |  
 |  Parameters
 |  ----------
 |  eps : float, default=1e-3
 |      Length of the path. ``eps=1e-3`` means that
 |      ``alpha_min / alpha_max = 1e-3``.
 |  
 |  n_alphas : int, default=100
 |      Number of alphas along the regulariz

In [161]:
# also has alphas as in rigge as list of alpha value to use keeping none alpa set automatically using eps and n_alpha
# n_alpha uses linspace along regularization path defined by eps.
# go to lasso docs in website of sklearn, differet types of lasso
lassocv_instance=LassoCV(eps=0.001, n_alphas=100, cv=5 )

In [162]:
lassocv_instance.fit(X_train, y_train)
#  to solve convergence warning(which means stocastic search of alpha value never converse) decrease search parameter eps or give high max_iter

  model = cd_fast.enet_coordinate_descent(


LassoCV(cv=5)

In [163]:
lassocv_instance=LassoCV(eps=0.001, n_alphas=100, cv=5, max_iter=10000000 )

In [174]:
lassocv_instance_model=LassoCV(eps=0.01, n_alphas=100, cv=5, )

In [175]:
lassocv_instance_model.fit(X_train, y_train)


LassoCV(cv=5, eps=0.01)

In [176]:
lassocv_instance_model.alpha_

0.049430709092258295

In [177]:
test_predict=lasso_cv_model.predict(X_test)

In [179]:
mean_absolute_error(y_test, test_predict)

0.6541723161252854

In [180]:
np.sqrt(mean_squared_error(y_test, test_predict))

1.1308001022762533

In [182]:
lasso_cv_model.coef_
# vast majority is zero means only considering two features which is simple to understand which features has greater change. reasonable error by only considering two features.

array([1.002651  , 0.        , 0.        , 0.        , 3.79745279,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        ])

In [235]:
lassocv_instance=LassoCV(eps=0.001, n_alphas=100, cv=5, max_iter=1000000)

In [236]:
lassocv_instance.fit(X_train, y_train)


LassoCV(cv=5, max_iter=1000000)

In [237]:
lassocv_instance.alpha_

0.004943070909225827

In [238]:
test_predict=lassocv_instance.predict(X_test)

In [239]:
mean_absolute_error(y_test, test_predict)

0.43350346185900707

In [240]:
np.sqrt(mean_squared_error(y_test, test_predict))

0.6063140748984027

In [241]:
lasso_cv_model.coef_
# vast majority is zero means only considering two features which is simple to understand which features has greater change. reasonable error by only considering two features.

array([1.002651  , 0.        , 0.        , 0.        , 3.79745279,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        ])

In [217]:
# it droped many features and exact better performasnce/same performance as ridge regression. lasso increases model performance by this.

In [218]:
from sklearn.linear_model import ElasticNetCV

In [219]:
help(ElasticNetCV)

Help on class ElasticNetCV in module sklearn.linear_model._coordinate_descent:

class ElasticNetCV(sklearn.base.RegressorMixin, LinearModelCV)
 |  ElasticNetCV(*, l1_ratio=0.5, eps=0.001, n_alphas=100, alphas=None, fit_intercept=True, normalize='deprecated', precompute='auto', max_iter=1000, tol=0.0001, cv=None, copy_X=True, verbose=0, n_jobs=None, positive=False, random_state=None, selection='cyclic')
 |  
 |  Elastic Net model with iterative fitting along a regularization path.
 |  
 |  See glossary entry for :term:`cross-validation estimator`.
 |  
 |  Read more in the :ref:`User Guide <elastic_net>`.
 |  
 |  Parameters
 |  ----------
 |  l1_ratio : float or list of float, default=0.5
 |      Float between 0 and 1 passed to ElasticNet (scaling between
 |      l1 and l2 penalties). For ``l1_ratio = 0``
 |      the penalty is an L2 penalty. For ``l1_ratio = 1`` it is an L1 penalty.
 |      For ``0 < l1_ratio < 1``, the penalty is a combination of L1 and L2
 |      This parameter can 

In [220]:
elastic_model=ElasticNetCV( l1_ratio=[.1, .5, .7, .9, .95, .99, 1], eps=0.001,n_alphas=100,max_iter=1000000)

In [221]:
elastic_model.fit(X_train,y_train)

ElasticNetCV(l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1], max_iter=1000000)

In [222]:
elastic_model.l1_ratio

[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1]

In [229]:
elastic_model.l1_ratio_
# underscore gives better one after finding out it
# here disregard ridge by giving 1 for alpha = ratio here and only lasso is the way to go

1.0

In [233]:
elastic_model.alpha_

0.004943070909225827

In [243]:
lassocv_instance.alpha_
# same for both elastic and lasso because elastic ompletely uses lasso here

0.004943070909225827

In [244]:
predict_test=elastic_model.predict(X_test)

In [245]:
mean_absolute_error(y_test, predict_test)

0.43350346185900707

In [246]:
np.sqrt(mean_squared_error(y_test, predict_test))

0.6063140748984027

In [247]:
# hence without bothering lasso or ridge its better to go elastic net as it chooses both.