In [1]:
%matplotlib inline

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
from sklearn.linear_model import Lasso

---

In [4]:
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 
              'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 
              'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':float, 'condition':int, 
              'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 
              'view':int}

In [5]:
sales = pd.read_csv('./data/kc_house_data.csv', dtype=dtype_dict, index_col=0)

---

Create new features by performing following transformation on inputs:

In [6]:
sales['sqft_living_sqrt'] = sales['sqft_living'].apply(np.sqrt)
sales['sqft_lot_sqrt'] = sales['sqft_lot'].apply(np.sqrt)
sales['bedrooms_square'] = sales['bedrooms']*sales['bedrooms']
sales['floors_square'] = sales['floors']*sales['floors']

- Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this variable will mostly affect houses with many bedrooms.
- On the other hand, taking square root of sqft_living will decrease the separation between big house and small house. The owner may not be exactly twice as happy for getting a house that is twice as big.

---

Using the entire house dataset, learn regression weights using an L1 penalty of 5e2. Make sure to add "normalize=True" when creating the Lasso object. Refer to the following code snippet for the list of features.

In [7]:
all_features = ['bedrooms', 'bedrooms_square',
                'bathrooms',
                'sqft_living', 'sqft_living_sqrt',
                'sqft_lot', 'sqft_lot_sqrt',
                'floors', 'floors_square',
                'waterfront', 'view', 'condition', 'grade',
                'sqft_above',
                'sqft_basement',
                'yr_built', 'yr_renovated']

model_all = Lasso(alpha=5e2, normalize=True).fit(sales[all_features], sales['price'])

In [8]:
np.array(all_features)[model_all.coef_ != 0]

array(['sqft_living', 'view', 'grade'], dtype='<U16')

---

To find a good L1 penalty, we will explore multiple values using a validation set. Let us do three way split into train, validation, and test sets. Download the provided csv files containing training, validation and test sets.

In [9]:
testing = pd.read_csv('./data/wk3_kc_house_test_data.csv', dtype=dtype_dict, index_col=0)
training = pd.read_csv('./data/wk3_kc_house_train_data.csv', dtype=dtype_dict, index_col=0)
validation = pd.read_csv('./data/wk3_kc_house_valid_data.csv', dtype=dtype_dict, index_col=0)

In [10]:
testing['sqft_living_sqrt'] = testing['sqft_living'].apply(np.sqrt)
testing['sqft_lot_sqrt'] = testing['sqft_lot'].apply(np.sqrt)
testing['bedrooms_square'] = testing['bedrooms']*testing['bedrooms']
testing['floors_square'] = testing['floors']*testing['floors']

training['sqft_living_sqrt'] = training['sqft_living'].apply(np.sqrt)
training['sqft_lot_sqrt'] = training['sqft_lot'].apply(np.sqrt)
training['bedrooms_square'] = training['bedrooms']*training['bedrooms']
training['floors_square'] = training['floors']*training['floors']

validation['sqft_living_sqrt'] = validation['sqft_living'].apply(np.sqrt)
validation['sqft_lot_sqrt'] = validation['sqft_lot'].apply(np.sqrt)
validation['bedrooms_square'] = validation['bedrooms']*validation['bedrooms']
validation['floors_square'] = validation['floors']*validation['floors']

---

Now for each l1_penalty in \[10^1, 10^1.5, 10^2, 10^2.5, ..., 10^7\] (to get this in Python, type np.logspace(1, 7, num=13).)

- Learn a model on TRAINING data using the specified l1_penalty. Make sure to specify normalize=True in the constructor.
- Compute the RSS on VALIDATION for the current model (print or save the RSS)

In [11]:
l1_penalty_values = np.logspace(1, 7, num=13)
RSS_validation = np.zeros_like(l1_penalty_values)

for i, l1_penalty in enumerate(l1_penalty_values):
    model = Lasso(alpha=l1_penalty, normalize=True).fit(training[all_features], training.price)
    yhat = model.predict(validation[all_features])
    RSS_validation[i] = np.sum((validation.price - yhat)**2)

In [12]:
l1_penalty_v = l1_penalty_values[np.argmin(RSS_validation)]
l1_penalty_v

10.0

---

Now that you have selected an L1 penalty, compute the RSS on TEST data for the model with the best L1 penalty.

In [13]:
model_v = Lasso(alpha=l1_penalty_v, normalize=True).fit(training[all_features], training.price)

In [14]:
np.count_nonzero(model_v.coef_) + np.count_nonzero(model_v.intercept_)

15

---

What if we absolutely wanted to limit ourselves to, say, 7 features? This may be important if we want to derive "a rule of thumb" --- an interpretable model that has only a few features in them.

You are going to implement a simple, two phase procedure to achieve this goal:

- Explore a large range of ‘l1_penalty’ values to find a narrow region of ‘l1_penalty’ values where models are likely to have the desired number of non-zero weights.
- Further explore the narrow region you found to find a good value for ‘l1_penalty’ that achieves the desired sparsity. Here, we will again use a validation set to choose the best value for ‘l1_penalty’.

In [15]:
max_nonzeros = 7
l1_penalty_values = np.logspace(1, 4, num=20)

- Fit a regression model with a given l1_penalty on TRAIN data. Add "alpha=l1_penalty" and "normalize=True" to the parameter list.
- Extract the weights of the model and count the number of nonzeros. Take account of the intercept as we did in #8, adding 1 whenever the intercept is nonzero. Save the number of nonzeros to a list.

In [16]:
nonzeros_values = np.zeros_like(l1_penalty_values)

for i, l1_penalty in enumerate(l1_penalty_values):
    model = Lasso(alpha=l1_penalty, normalize=True).fit(training[all_features], training.price)
    nonzeros_values[i] = np.count_nonzero(model.coef_) + np.count_nonzero(model.intercept_)

---

Out of this large range, we want to find the two ends of our desired narrow range of l1_penalty. At one end, we will have l1_penalty values that have too few non-zeros, and at the other end, we will have an l1_penalty that has too many non-zeros.

More formally, find:

- The largest l1_penalty that has more non-zeros than ‘max_nonzeros’ (if we pick a penalty smaller than this value, we will definitely have too many non-zero weights)Store this value in the variable ‘l1_penalty_min’ (we will use it later)
- The smallest l1_penalty that has fewer non-zeros than ‘max_nonzeros’ (if we pick a penalty larger than this value, we will definitely have too few non-zero weights)Store this value in the variable ‘l1_penalty_max’ (we will use it later)

Hint: there are many ways to do this, e.g.:

- Programmatically within the loop above
- Creating a list with the number of non-zeros for each value of l1_penalty and inspecting it to find the appropriate boundaries.

In [17]:
idx_min = np.where(nonzeros_values > max_nonzeros)[0][-1]
idx_max = np.where(nonzeros_values < max_nonzeros)[0][0]

In [18]:
l1_penalty_min = l1_penalty_values[idx_min]
l1_penalty_max = l1_penalty_values[idx_max]

In [19]:
l1_penalty_min, l1_penalty_max

(127.42749857031335, 263.6650898730358)

In [20]:
l1_penalty_values[nonzeros_values == max_nonzeros]

array([183.29807108])

---

Exploring narrower range of l1_penalty

We now explore the region of l1_penalty we found: between ‘l1_penalty_min’ and ‘l1_penalty_max’. We look for the L1 penalty in this range that produces exactly the right number of nonzeros and also minimizes RSS on the VALIDATION set.

For l1_penalty in np.linspace(l1_penalty_min,l1_penalty_max,20):

- Fit a regression model with a given l1_penalty on TRAIN data. As before, use "alpha=l1_penalty" and "normalize=True".
- Measure the RSS of the learned model on the VALIDATION set

Find the model that the lowest RSS on the VALIDATION set and has sparsity equal to ‘max_nonzeros’. (Again, take account of the intercept when counting the number of nonzeros.)

In [21]:
l1_penalty_values = np.linspace(l1_penalty_min, l1_penalty_max, 20)
RSS_validation = np.zeros_like(l1_penalty_values)
nonzeros_values = np.zeros_like(l1_penalty_values)

for i, l1_penalty in enumerate(l1_penalty_values):
    model = Lasso(alpha=l1_penalty, normalize=True).fit(training[all_features], training.price)
    yhat = model.predict(validation[all_features])
    RSS_validation[i] = np.sum((validation.price - yhat)**2)
    nonzeros_values[i] = np.count_nonzero(model.coef_) + np.count_nonzero(model.intercept_)

In [22]:
idx_targets = np.where(nonzeros_values == max_nonzeros)[0]
idx_final = idx_targets[np.argmin(RSS_validation[idx_targets])]
l1_penalty_final = l1_penalty_values[idx_final]
l1_penalty_final

156.10909673930755

In [23]:
model_final = Lasso(alpha=l1_penalty_final, normalize=True).fit(training[all_features], training.price)

In [24]:
np.array(all_features)[model_final.coef_ != 0]

array(['bathrooms', 'sqft_living', 'waterfront', 'view', 'grade',
       'yr_built'], dtype='<U16')

---

# LASSO Solver via Coordinate Descent