# Reducing Features with Lasso Regression

In [15]:
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston
import numpy as np

In [16]:
def pretty_print_linear(coefs, names = None, sort = False):
    if any(names) == None:
        names = ["X%s" % x for x in range(len(coefs))]
    lst = zip(coefs, names)
    if sort:
        lst = sorted(lst,  key = lambda x:-np.abs(x[0]))
    return " + ".join("%s * %s" % (round(coef, 3), name)
                                   for coef, name in lst)

In [17]:
boston = load_boston()
scaler = StandardScaler()

X = scaler.fit_transform(boston["data"])
y = boston["target"]
names = boston["feature_names"]

In [18]:
reg = LassoCV(cv=5, alphas=np.linspace(0.01,1,2000), random_state=0).fit(X, y)

reg.alpha_
reg.score(X, y)

0.7281616841537072

In [19]:
lasso1 = Lasso(alpha=reg.alpha_)

lasso1.fit(X, y)

lasso1.score(X, y)

mean_squared_error(lasso1.predict(X),y)

22.948469969983584

In [20]:
lasso2 = Lasso(alpha=.3)

lasso2.fit(X, y)

lasso2.score(X, y)

mean_squared_error(lasso2.predict(X),y)

24.287137191247194

In [21]:
print("Lasso model: ", pretty_print_linear(lasso1.coef_, names, sort = True))

Lasso model:  -3.72 * LSTAT + 2.906 * RM + -2.055 * DIS + -1.851 * PTRATIO + -1.324 * NOX + 0.716 * B + 0.644 * CHAS + 0.511 * ZN + -0.475 * CRIM + 0.422 * RAD + -0.198 * TAX + -0.069 * INDUS + -0.0 * AGE


In [12]:
print("Lasso model: ", pretty_print_linear(lasso2.coef_, names, sort = True))

Lasso model:  -3.705 * LSTAT + 2.993 * RM + -1.756 * PTRATIO + -1.081 * DIS + -0.699 * NOX + 0.628 * B + 0.54 * CHAS + -0.242 * CRIM + 0.082 * ZN + -0.0 * INDUS + -0.0 * AGE + 0.0 * RAD + -0.0 * TAX


This output is immediately valuable. It's obvious that many of the weather features (either through not showing up sufficiently often or not telling us anything useful when they do) are adding nothing to our model and should be removed. In addition, we're not getting a lot of value from our traffic aggregates. While these can remain in for the moment (in the hope that gathering more data will improve their usefulness), for our next pass we'll rerun our model without the poorly-scoring features that our use of LASSO has revealed.

We see that a number of features have coefficient 0. If we increase α further, the solution would be sparser and sparser, i.e. more and more features would have 0 as coefficients.

Note however that L1 regularized regression is unstable in a similar way as unregularized linear models are, meaning that the coefficients (and thus feature ranks) can vary significantly even on small data changes when there are correlated features in the data. Which brings us to L2 regularization.

The practical benefit of this effect is that it means that we could include 100 features in our feature matrix and then, through adjusting lasso’s α hyperparameter, produce a model that uses only 10 (for instance) of the most important features. This lets us reduce variance while improving the interpretability of our model (since fewer features is easier to explain).