**Regularized Linear Models: Lasso**

To test kaggle datasets and kernels we will briefly explore the dataset "Hitters" and use the sklearn package to fit some Lasso (or LASSO, least absolute shrinkage and selection operator) models in order to predict the salary of baseball players. 

This kernel is partly based on R. Jordan Crouser's Python adaptation (Smith College, Spring 2016) of page 251-254 (Ridge Regression) of “Introduction to Statistical Learning with Applications in R” (ISLR) by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. 




**1. Read data set and explore data structure**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed

# import necessary functions
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
from sklearn.linear_model import Lasso, LassoCV, Ridge, RidgeCV
from sklearn import cross_validation
from sklearn.metrics import mean_squared_error

# read data set and drop missing values 
Hitters = pd.read_csv("../input/Hitters.csv").dropna() 
Hitters.info()


**2. Prepare sklearn-style objects  X (features) and y (target) **

Here we adopt to the model definition style which is commonly used in Python and more generally in Machine Learning.

In [None]:
y = Hitters.Salary
# Drop Salary (target) and columns for which we created dummy variables
X_ = Hitters.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64')
# Define the feature set X.
dummies = pd.get_dummies(Hitters[['League', 'Division', 'NewLeague']])
X = pd.concat([X_, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis=1)
X.info()

**3. Fit a Set of Lasso Models **

We use the Lasso() function to perform regularized liear regression. The Lasso() function has an alpha argument (elsewhere called λ) that is used to tune the model. 

We’ll generate an array of alpha values ranging from very big to very small, essentially covering the full range of scenarios from the null model containing only the intercept, to the least squares fit.

We standardize the data and fit Lasso models for each value of alpha.

In [None]:
alphas = 10**np.linspace(6,-2,50)*0.5
alphas

In [None]:
lasso = Lasso(max_iter=10000, normalize=True)
coefs = []

for a in alphas:
    lasso.set_params(alpha=a)
    lasso.fit(X, y)
    coefs.append(lasso.coef_)
    
np.shape(coefs)

**4. Plot Lasso tuning parameter alpha **

Now we plot the relationship between alpha and the weights (regression parameters), a line for each features.


In [None]:
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale('log')
plt.axis('tight')
plt.xlabel('alpha')
plt.ylabel('weights')

On the right hand side we can see the null model, containing only the intercept. This is caused by the very high penalty. On the left hand side there is almost no penalty. Notice that in the coefficient plot (depending on the choice of tuning parameter) some of the coefficients are exactly equal to zero. 


**5. Cross Validation Lasso **

We now split the samples into a training set and a test set in order to estimate the test error. We  perform 10-fold cross-validation to choose the best alpha, refit the mode, compute the associated test error and print the best models coefficients .

In [None]:
# Use the cross-validation package to split data into training and test sets
X_train, X_test , y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.5, random_state=1)

lassocv = LassoCV(alphas=None, cv=10, max_iter=100000, normalize=True)
lassocv.fit(X_train, y_train)
lasso.set_params(alpha=lassocv.alpha_)
print("Alpha=", lassocv.alpha_)
lasso.fit(X_train, y_train)
print("mse = ",mean_squared_error(y_test, lasso.predict(X_test)))
print("best model coefficients:")
pd.Series(lasso.coef_, index=X.columns)

We notice that 13 of the 19 coefficient estimates are exactly zero.  The Lasso thus has a substantial advantage over ridge regression in that the resulting coefficient estimates are sparse. To complete the picture we need the  coresponding cross validation results of ridge regression.

**6. Ridge Regression, Cross Validated **

Again we perform 10-fold cross-validation to choose the best alpha, refit the mode, compute the associated test error and print the best models coefficients .


In [None]:
ridgecv = RidgeCV(alphas=alphas, normalize=True)
ridgecv.fit(X_train, y_train)
print("Alpha=", ridgecv.alpha_)
ridge6 = Ridge(alpha=ridgecv.alpha_, normalize=True)
ridge6.fit(X_train, y_train)
print("mse = ",mean_squared_error(y_test, ridge6.predict(X_test)))
print("best model coefficients:")
pd.Series(ridge6.coef_, index=X.columns)

The models performance at a glance:

In [None]:
print("mse ridge = ",mean_squared_error(y_test, ridge6.predict(X_test)))
print("mse lasso = ",mean_squared_error(y_test, lasso.predict(X_test)))


** Summary **

This is an adaption of a fairly popular R-code (and text book). It is shown how the Lasso and cross validation can be performed on the Hitters dataset using a kaggle kernel.

With alpha chosen by cross-validation in this example the test MSE of the Lasso is a litte worse than the test MSE of ridge regression. This should not be generalized,  see discussion in the forementioned book "ISLR" , p. 223-224.

The lasso has a major advantage over ridge regression, in that it produces simpler and more interpretable models that involve only a subset of the predictors. 


Please feel free to fork this kernel and play around with different parameters. 
