# LASSO Regression

# Firing up graphlab 

In [3]:
import graphlab

# Loading house sales data



In [4]:
sales = graphlab.SFrame('kc_house_data.gl/')

# Create new features

As in Week 2, we consider features that are some transformations of inputs.

In [5]:
from math import log, sqrt
sales['sqft_living_sqrt'] = sales['sqft_living'].apply(sqrt)
sales['sqft_lot_sqrt'] = sales['sqft_lot'].apply(sqrt)
sales['bedrooms_square'] = sales['bedrooms']*sales['bedrooms']

# In the dataset, 'floors' was defined with type string, 
# so we'll convert them to float, before creating a new feature.
sales['floors'] = sales['floors'].astype(float) 
sales['floors_square'] = sales['floors']*sales['floors']

* Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this variable will mostly affect houses with many bedrooms.
* On the other hand, taking square root of sqft_living will decrease the separation between big house and small house. The owner may not be exactly twice as happy for getting a house that is twice as big.

# Learn regression weights with L1 penalty

Let us fit a model with all the features available, plus the features we just created above.

In [6]:
all_features = ['bedrooms', 'bedrooms_square',
            'bathrooms',
            'sqft_living', 'sqft_living_sqrt',
            'sqft_lot', 'sqft_lot_sqrt',
            'floors', 'floors_square',
            'waterfront', 'view', 'condition', 'grade',
            'sqft_above',
            'sqft_basement',
            'yr_built', 'yr_renovated']

Lasso regression model can be implemented just by applying L1 penalty which requires adding an extra parameter (`l1_penalty`) to the linear regression call in a graphlab's linearr_regression.create function.

In [7]:
model_all = graphlab.linear_regression.create(sales, target='price', features=all_features,
                                              validation_set=None, 
                                              l2_penalty=0., l1_penalty=1e10)

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 21613
PROGRESS: Number of features          : 17
PROGRESS: Number of unpacked features : 17
PROGRESS: Number of coefficients    : 18
PROGRESS: Starting Accelerated Gradient (FISTA)
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+--------------------+---------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-max_error | Training-rmse |
PROGRESS: +-----------+----------+-----------+--------------+--------------------+---------------+
PROGRESS: Tuning step size. First iteration could take longer than subsequent iterations.
PROGRESS: | 1         | 2        | 0.000002  | 1.323708     | 6962915.603493     | 426631.749026 |
PROGRESS: | 2         | 3        | 0.000002  | 1.354911     | 6843144.200219     | 392488.929838 |
PROGRESS: | 3         | 4      

In [8]:
# Finding features with non-zero weights
# non_zero_weight = model_all.get("coefficients")["value"]
non_zero_weight = model_all["coefficients"][model_all["coefficients"]["value"] > 0]
non_zero_weight.print_rows(num_rows=20)

+------------------+-------+---------------+
|       name       | index |     value     |
+------------------+-------+---------------+
|   (intercept)    |  None |  274873.05595 |
|    bathrooms     |  None | 8468.53108691 |
|   sqft_living    |  None | 24.4207209824 |
| sqft_living_sqrt |  None | 350.060553386 |
|      grade       |  None | 842.068034898 |
|    sqft_above    |  None | 20.0247224171 |
+------------------+-------+---------------+
[6 rows x 3 columns]



So, a majority of the weights have been set to zero. So by setting an L1 penalty that's large enough, we are performing a subset selection. So, implementing L1 penalty ignores the less important features and makes the computation task easier and faster.


# Selecting an L1 penalty

To find a good L1 penalty, we will explore multiple values using a validation set. Let us do three way split into train, validation, and test sets:
* Split our sales data into 2 sets: training and test
* Further split our training data into two sets: train, validation

In [9]:
(training_and_validation, testing) = sales.random_split(.9,seed=1) # initial (train+valid) and test split
(training, validation) = training_and_validation.random_split(0.5, seed=1) # splitting equally into train and validate

Let's fit a regression model on different values of L1 penalty and i will compute their RSS on validation data i declared above. The L1 penalty value which produces the lowest RSS on validation data is the best value.

In [46]:
import numpy as np
import pprint 

validation_rss = {}
for l1_penalty in np.logspace(1, 7, num=13):
    model = graphlab.linear_regression.create(training, target='price', features=all_features,
                                              validation_set=None, verbose = False,
                                              l2_penalty=0., l1_penalty=l1_penalty)
    predictions = model.predict(validation)
    residuals = validation['price'] - predictions
    rss = sum(residuals**2)
    validation_rss[l1_penalty] = rss

print min(validation_rss.items(), key=lambda x: x[1]) 

(10.0, 625766285142461.2)


So best value of L1 is 10 .

In [47]:
model_test = graphlab.linear_regression.create(training, target='price', features=all_features,
                                              validation_set=None, verbose = False,
                                              l2_penalty=0., l1_penalty=10.0)
predictions_test = model.predict(testing)
residuals_test = testing['price'] - predictions_test
rss_test = sum(residuals_test**2)
print rss_test

1.56972779669e+14


In [49]:
non_zero_weight_test = model_test["coefficients"][model_test["coefficients"]["value"] > 0]
print model_test["coefficients"]["value"].nnz()
non_zero_weight_test.print_rows(num_rows=20)

18
+------------------+-------+------------------+
|       name       | index |      value       |
+------------------+-------+------------------+
|   (intercept)    |  None |  18993.4272128   |
|     bedrooms     |  None |  7936.96767903   |
| bedrooms_square  |  None |  936.993368193   |
|    bathrooms     |  None |  25409.5889341   |
|   sqft_living    |  None |  39.1151363797   |
| sqft_living_sqrt |  None |  1124.65021281   |
|     sqft_lot     |  None | 0.00348361822299 |
|  sqft_lot_sqrt   |  None |  148.258391011   |
|      floors      |  None |   21204.335467   |
|  floors_square   |  None |  12915.5243361   |
|    waterfront    |  None |  601905.594545   |
|       view       |  None |  93312.8573119   |
|    condition     |  None |  6609.03571245   |
|      grade       |  None |  6206.93999188   |
|    sqft_above    |  None |  43.2870534193   |
|  sqft_basement   |  None |  122.367827534   |
|     yr_built     |  None |  9.43363539372   |
|   yr_renovated   |  None |  56.0720

Using this value of L1 penalty, there are 18 non-zero weights. 