# Homework 3

### Due: Tues Nov. 20 @ 9pm

In this homework we will be performing model evaluation, model selection and feature selection in both a regression and classification setting.

The data we will be looking at are a subset of home sales data from King County, Washington, as we might see on a real-estate website.


## Instructions

Follow the comments below and fill in the blanks (____) to complete.

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pylab as plt
import seaborn as sns

%matplotlib inline

## Part 1: Regression

Here we try to build a model to predict adjusted sales price from a set of building features.

### Load data

In [None]:
# Load data from file
# DO NOT CHANGE THIS (needed for grading)
infile_name = '../data/house_sales_subset_normed.csv'
df = pd.read_csv(infile_name)

# Use a subset of the columns as features
X = df[['SqFtTotLiving_norm','SqFtLot_norm','Bathrooms','Bedrooms','TrafficNoise']]

# Extract the target, adjusted sale price, in values of $100,000
# Note: the '_r' here is denote the different targets for regression and classification
y_r = df.AdjSalePrice / 100000

### Create a held-aside set

In [None]:
# Split into 80% train and 20% test using train_test_split and random_state=42
from sklearn.model_selection import train_test_split
X_train_r, X_test_r, y_train_r, y_test_r = ____

### Measure baseline performance

In [None]:
# Instantiate and train a dummy model on the training set using DummyRegressor
from sklearn.dummy import DummyRegressor
dummy_r = ____

In [None]:
# Calculate and print RMSE training set error of the dummy model
from sklearn.metrics import mean_squared_error
dummy_r_training_rsme = ____
print('dummy RMSE: {:.3f}'.format(dummy_r_training_rsme))

In [None]:
# Calculate and print the R2 training set score of the dummy model
# hint: can use models 'score' function
dummy_r_training_r2 = ____
print('dummy R2: {:.3f}'.format(dummy_r_training_r2))

In [None]:
# Calculate and print the mean 5-fold cross valication R2 score of the dummy model
from sklearn.model_selection import cross_val_score
dummy_r_cv = ____
print('dummy mean cv R2: {:.3f}'.format(____))

### Measure performance of Linear Regression

In [None]:
# Instantiate and train a LinearRegression model on the training set
from sklearn.linear_model import LinearRegression
lr = ____

In [None]:
# Calculate RMSE training set error of the linear model
# There should be an improvement over the dummy model
lr_rmse = ____
print('lr RMSE: {:.3f}'.format(lr_rmse))

In [None]:
# Calculate and print the R2 training set score of the linear model
lr_r2 = ____
print('lr R2: {:.4f}'.format(lr_r2))

In [None]:
# Calculate mean 5-fold Cross Validation R2 score of the linear model on the training set using cross_val_score
from sklearn.model_selection import cross_val_score
scores = ____
print('lr mean cv R2: {:.4f}'.format(____))

### Model selection

In [None]:
# We'll also train an Elastic Net model using regularization
# Perform GridSearch over different proportions of the l1_ratio = [.1,.5,.9,1] using the training set
# The only parameter in our search is this l1_ratio
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import ElasticNet
params = ____
gs = ____

In [None]:
# Print out the best R2 score found using grid search and the best parameter setting found
print('gs best R2 score : {:.4f}'.format(____))
print('gs best params: {}'.format(____))

In [None]:
# Using the best parameter setting found via cross validation in the step before
#   calculate and print the mean 5-fold cv R2 score on the training set
en = ____
scores = ____
print('en mean cv R2  : {:.4f}'.format(____))

In [None]:
# Retrain the ElasticNet model on the full training set and get predictions on the full training set
y_hat = ____

In [None]:
# Plot predictions (x-axis) vs residuals (y-axis) using plt.scatter() with alpha=0.2
# Set axis names appropriately ('y_hat' and 'residuals')
# recall: residual = y_hat - y
residuals = ____
_ = ____
_ = ____
_ = ____

### Evaluate trained models on Test

In [None]:
# Using our trained models, calculate RMSE on the test set
print('dummy_r test RMSE  : {:.3f}'.format(____))
print('lr test RMSE       : {:.3f}'.format(____))
print('en test RMSE       : {:.3f}'.format(____))

### Feature selection

In [None]:
# Using the ElasticNet model we trained before, what features have a non-zero coefficient?
print('kept columns: {}'.format(____))

In [None]:
# Now, select the top 3 most informative features from the trained model 
#   using SelectKBest and the f_regression metric
# First, instantiate and fit SelectKbest on the training set
from sklearn.feature_selection import SelectKBest, f_regression
skb = ____

In [None]:
# Print out the selected features using skb.get_support() and the column names from X_train_r
# In this case, they should match the features kept by the ElasticNet model
kept_columns = ____
print('kept columns: {}'.format(kept_columns))

---

## Part 2: Classification

Here we try to build a model to predict low vs. high adjusted sales price.

### Create classification target

In [None]:
# First, we create a binary target by thresholding at the median of our AdjSalePrice in $100,000
y_c = (df.AdjSalePrice > df.AdjSalePrice.median()).astype(int)

In [None]:
# What is the proportion of 'high' labels in our dataset
print('proportion of high to low: {:.3f}'.format(____))

### Create a held-aside set

In [None]:
# Split into 80% train and 20% test using train_test_split with random_state=42
# Use our new y_c target and the same X we used for regression
X_train_c, X_test_c, y_train_c, y_test_c = ____

### Measure baseline performance

In [None]:
# Train a dummy classification model on the training set
from sklearn.dummy import DummyClassifier
dummy_c = ____

In [None]:
# Calculate training set Accuracy of the dummy classifier
# This should be close to the original proportion of low to high
dummy_c_acc = ____
print('dummy accuracy: {:.3f}'.format(dummy_c_acc))

In [None]:
# Get P(y=1|x) for the test set using the dummy model (we'll use this later)
# Note: we only want P(y=1|x) even though predict_proba returns two columns
pypos_dummy = ____

### Measure performance of a Logistic Regression model

In [None]:
# Instantiate and train a logistic regression model using default hyperparameters
from sklearn.linear_model import LogisticRegression
logr = ____

In [None]:
# What is the training set accuracy of our logistic regression model?
trainset_acc = ____
print('logr training set accuracy: {:.3f}'.format(trainset_acc))

In [None]:
# What is the 5 fold cross-validation accuracy of the logistic regression model on the training set?
scores = ____
print('logr mean cv accuracy: {:.3f}'.format(____))

In [None]:
# Get P(y=1|x) for the test set using the logistic regression model (we'll use this later)
pypos_logr = ____

### Model selection using a Random Forest model

In [None]:
# Perform 3-fold cross validated grid search over the number of trees
# The parameter settings to try are n_estimators = [5,50,100] 
# Perform the search using the training set
from sklearn.ensemble import RandomForestClassifier
params = ____
gs = ____

In [None]:
# Print out the best score found and the best parameter setting found
print('gs best accuracy: {:.3f}'.format(____))
print('gs best params  : {}'.format(____))

In [None]:
# Retrain on the entire training set using the best number of trees found
rf = ____

In [None]:
# get p(y=1|x) for the test set using the trained rf model
pypos_rf = ____

### Plotting Precision-Recall curve for the Random Forest model

In [None]:
# Plot Precision (y-axis) vs. Recall (x-axis) curve for the Random Forest model
# First calculate precision and recall using the y_test_c and pypos_rf 
from sklearn.metrics import precision_recall_curve
precision, recall, _ = ____

In [None]:
# Next, plot the curve using plt.step()
# Recall should be on the x-axis
# Label the x and y axes appropriately
_ = ____
_ = ____
_ = ____

### Plotting ROC curves for all models

In [None]:
# Plot the ROC curves of our 3 trained models (dummy, logr and rf) 
# First calculate fpr and tpr for each model using the using y_test_c and each set of pypos values
from sklearn.metrics import roc_curve
fpr_dummy,tpr_dummy,_ = ____
fpr_logr,tpr_logr,_ = ____
fpr_rf,tpr_rf,_ = ____

In [None]:
# Next, plot each of the 3 curves using plt.step()
# Each curve should be a different color (dummy:blue, logr:red, rf:green)
# Include a legend by adding label='model_name' to each plt.step call and calling plt.legend()
# Label the axis as 'fpr' and 'tpr' appropriately
_ = ____ # curve for dummy
_ = ____ # curve for logr
_ = ____ # curve rf
_ = ____ # add a legend
_ = ____ # set x-axis label
_ = ____ # set y-axis label

In [None]:
# Calculate and print the ROC AUC values on the test set for each model
from sklearn.metrics import roc_auc_score
dummy_auc = ____
logr_auc = ____
rf_auc = ____
print('dummy auc: {:.3f}'.format(dummy_auc))
print('logr auc : {:.3f}'.format(logr_auc))
print('rf auc   : {:.3f}'.format(rf_auc))

### Feature selection

In [None]:
# Using the feature importances from the trained Random Forest model, 
#  print the feature name and feature importances for each feature in X
# Each row should look like this, for example: SqFtLot_norm : 0.025
____

In [None]:
# Select the most informative features using SelectFromModel using 'mean' as threshold
# Use prefit=True since the model is already trained to save needing to retrain
from sklearn.feature_selection import SelectFromModel
sfm = ____

In [None]:
# print out the selected features using X.columns and sfm.get_support()
kept_columns = ____
print('kept columns: {}'.format(kept_columns))