Statsmodels regularized linear regression (lasso)
=================================================

This notebook demonstrates how to fit a linear regression model using L1 regularization (the "lasso").

In [0]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

The data consist of measurements made on people sampled from a population that is at high risk for developing diabetes.  The data are here:

http://vincentarelbundock.github.io/Rdatasets/csv/MASS/Pima.te.csv

We will be regressing serum glucose on several factors that may influence it.

In [0]:
data = pd.read_csv("Pima.te.csv")
print data.columns

The glucose data are slightly right-skewed, so following convention, we log transform it.

In [0]:
data["log_glu"] = np.log(data["glu"])

Below we will be fitting models with interactions.  In general, it is easier to interpret the effects of an interaction if we center or standardize the components of the interaction before forming the product that defines the interaction.  The next cell standardizes all of the variables that will be used below.

In [0]:
data["type"] = 1.0*(data["type"] == "Yes")

for vname in "type", "age", "bmi", "skin", "log_glu", "bp":
    data[vname] = (data[vname] - data[vname].mean()) / data[vname].std()

In the previous cell, we standardized each individual variable.  We will also want to standardize the interaction products.  When using the lasso, it is more straightforward to standardize all variables (including the outcome variable).  This has two benefits.  First, it means we don't need to include an intercept term, which would need to be handled differently than the other variables in terms of penalization.  Second, it means that we can use the same penalty weight for all of the covariates.  The following cell creates all the interaction products and standardizes them.

In [0]:
vnames = ["age", "bmi", "skin", "bp", "type"]

for j1,v1 in enumerate(vnames):
    for j2,v2 in enumerate(vnames[0:j1]):
        x = data[v1] * data[v2]
        x = (x - x.mean()) / x.std()
        data[v1 + "_x_" + v2] = x

We save the data, so we can compare to R below.

In [0]:
data.to_csv("dibetes.csv", float_format="%.4f")

We start with a big model that contains the main effects and all interactions with the `type` variable.  This model is fit using unregularized OLS.

In [0]:
print data.columns
fml = "log_glu ~ 0 + bp + skin + age + bmi + type + type_x_bmi + type_x_age + type_x_skin + type_x_bp"
mod = sm.OLS.from_formula(fml, data)
rslt = mod.fit()
print rslt.summary()

Next we fit a sequence of models using the lasso.  We collect the parameter estimates and display them as a table.  This is a crude way to obtain the "solution path".  In the near future, we will provide a function to do this for you.

In [0]:
mat = []
alphas = np.arange(0, 0.26, 0.05)
for alpha in alphas:
    mod = sm.OLS.from_formula(fml, data)
    rslt = mod.fit_regularized(alpha=alpha)
    mat.append(rslt.params)
mat = pd.DataFrame(mat, index=[str(x) for x in alphas])
print mat.T

Here is R code that performs the same analysis using the glmnet library:

```
library(glmnet)

# The data set created above.
data = read.csv("diabetes.csv")

x = cbind(data$bp, data$skin, data$age, data$bmi, data$type, data$type_x_bmi, data$type_x_age,
    data$type_x_skin, data$type_x_bp)
y = data$log_glu

model = glmnet(x, y, lambda=c(0, 0.05, 0.1, 0.15, 0.2, 0.25))

print(model$beta[,6])
print(model$beta[,1])
```

The results below agree with what we obtained above for alpha=0 and alpha=0.25, respectively:

```
          V1           V2           V3           V4           V5           V6 
0.059751856  0.050783199  0.064140463  0.072454663  0.458089343 -0.009592389 
          V7           V8           V9 
-0.025364554 -0.065635870  0.016002218 
       V1        V2        V3        V4        V5        V6        V7        V8 
0.0000000 0.0000000 0.0000000 0.0000000 0.2536029 0.0000000 0.0000000 0.0000000 
       V9 
0.0000000 
```