<img src='rreg13.png' width=600>

__LASSO - least absolute shrinkage and selection operator__

- the most important difference and the key qualifier of L1 regularization is that it can yield sparse models where some coefficients can actually become zero.

- When the tuning parameter lambda is sufficiently large, similar to subset selection, what the lasso regression can do is actually perform something very similar to variable selection because if you set some coefficients equal to zero, that means you're not even considering that particular feature.

----

LassoCV with sklearn operates on checking __a number of alphas within a range__, instead of providing the alphas directly.

---

In [2]:
import numpy as np
import pandas as pd

In [3]:
df = pd.read_csv('Advertising.csv')

In [4]:
df.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [5]:
X = df.drop('sales', axis=1) 

In [6]:
y = df['sales']

---

In [8]:
from sklearn.preprocessing import PolynomialFeatures

In [9]:
poly_converter = PolynomialFeatures(degree=3, include_bias=False)

In [10]:
poly_features = poly_converter.fit_transform(X)

In [12]:
poly_features.shape

(200, 19)

In [14]:
poly_converter.get_feature_names_out()

array(['TV', 'radio', 'newspaper', 'TV^2', 'TV radio', 'TV newspaper',
       'radio^2', 'radio newspaper', 'newspaper^2', 'TV^3', 'TV^2 radio',
       'TV^2 newspaper', 'TV radio^2', 'TV radio newspaper',
       'TV newspaper^2', 'radio^3', 'radio^2 newspaper',
       'radio newspaper^2', 'newspaper^3'], dtype=object)

---

In [15]:
from sklearn.model_selection import train_test_split

In [16]:
X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=101)

---

In [17]:
from sklearn.preprocessing import StandardScaler

In [18]:
scaler = StandardScaler()

In [19]:
scaler.fit(X_train)

StandardScaler()

In [20]:
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

---

In [21]:
from sklearn.linear_model import Lasso

In [22]:
# Lasso(alpha=)

Here you would have to provide some distinct alpha value.

---

Realistically though, it's not too convenient to have to know the correct alpha beforehand.
<br>So instead we perform cross validation in order to figure out the best tunable hyperparameter.
And this is where LassoCV is going to be slightly different than RidgeCV as far as providing the alpha
values.

If we take a look at the documentation for LassoCV, there's two different ways of figuring out that best alpha tunable parameter.
- One way is to do the same thing we did last time, which is just provide that list of alphas.
- Another way is specifying eps, n_alphas

In [23]:
from sklearn.linear_model import LassoCV

In [25]:
lasso_cv_model = LassoCV(eps=.001, n_alphas=100, cv=5) # parameters with their default values

by default cv is equal to 5 - 5 fold cross-validation

In [26]:
lasso_cv_model.fit(X_train, y_train)

  model = cd_fast.enet_coordinate_descent(


LassoCV(cv=5)

this warning essentially says that stochastic search for the best alpha value never actually converged.

Let's increase max_iter parameter value.

In [27]:
lasso_cv_model = LassoCV(eps=.001, n_alphas=100, cv=5, max_iter=1000000) # parameters with their default values

In [28]:
lasso_cv_model.fit(X_train, y_train)

LassoCV(cv=5, max_iter=1000000)

or to fix that warning


In [29]:
lasso_cv_model = LassoCV(eps=0.1, n_alphas=100, cv=5) # parameters with their default values

In [30]:
lasso_cv_model.fit(X_train, y_train)

LassoCV(cv=5, eps=0.1)

In [31]:
lasso_cv_model.alpha_

0.4943070909225828

In [32]:
test_predictions = lasso_cv_model.predict(X_test)

In [33]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [34]:
MAE = mean_absolute_error(y_test, test_predictions)

In [35]:
MAE

0.6541723161252854

In [36]:
RMSE = np.sqrt(mean_squared_error(y_test, test_predictions))

In [37]:
RMSE

1.130800102276253

comparing these results to just l2 regularization of Ridge is actually not performing as good as this other model.

So what benefit do we have from Lasso?
<br>I think it'll be more clear if we actually check out the coefficients to our LassoCV model.


In [38]:
lasso_cv_model.coef_

array([1.002651  , 0.        , 0.        , 0.        , 3.79745279,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        ])

The vast majority of them are actually zero.
And that means it's only considering two features, everything else, it's not even considering.

---

In [39]:
lasso_cv_model = LassoCV(eps=0.001, n_alphas=100, cv=5, max_iter=1000000)

In [40]:
lasso_cv_model.fit(X_train, y_train)

LassoCV(cv=5, max_iter=1000000)

In [41]:
test_predictions = lasso_cv_model.predict(X_test)

In [42]:
MAE = mean_absolute_error(y_test, test_predictions)

In [43]:
MAE

0.43350346185900673

In [44]:
RMSE = np.sqrt(mean_squared_error(y_test, test_predictions))

In [45]:
RMSE

0.6063140748984039

In [46]:
lasso_cv_model.alpha_

0.004943070909225827

In [48]:
lasso_cv_model.coef_ # a slightly more complex model

array([ 4.86023329,  0.12544598,  0.20746872, -4.99250395,  4.38026519,
       -0.22977201, -0.        ,  0.07267717, -0.        ,  1.77780246,
       -0.69614918, -0.        ,  0.12044132, -0.        , -0.        ,
       -0.        ,  0.        ,  0.        , -0.        ])