# Assignment 5: Data Preprocessing

# Setup

Import python libraries

In [None]:
% pylab inline

import pandas as pd
import seaborn

from sklearn import model_selection
from sklearn import pipeline, feature_selection, linear_model, preprocessing, metrics

seaborn.set_style("whitegrid")

We'll be using the Ames Housing dataset during this exercise. Below we'll read the data in and display the first five rows.

In [5]:
ames_df = pd.read_csv("http://www.amstat.org/publications/jse/v19n3/Decock/AmesHousing.txt", delimiter="\t")
ames_df.head()

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


We see that there are a mixture of numerical, categorical, and NaN values in the dataset. Below we will select only numerical features and drap any NaN values

In [6]:
num_cols = [c for c in ames_df.columns if ames_df[c].dtype in ['int64', 'float64']]
ames_df = ames_df[num_cols].copy()

ames_df.dropna(inplace=True)

In [7]:
ames_df.head()

Unnamed: 0,Order,PID,MS SubClass,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,...,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold,SalePrice
0,1,526301100,20,141,31770,6,5,1960,1960,112,...,210,62,0,0,0,0,0,5,2010,215000
1,2,526350040,20,80,11622,5,6,1961,1961,0,...,140,0,0,0,120,0,0,6,2010,105000
2,3,526351010,20,81,14267,6,6,1958,1958,108,...,393,36,0,0,0,0,12500,6,2010,172000
3,4,526353030,20,93,11160,7,5,1968,1968,0,...,0,0,0,0,0,0,0,4,2010,244000
4,5,527105010,60,74,13830,5,5,1997,1998,0,...,212,34,0,0,0,0,0,3,2010,189900


We'll be working to predict the sales price of the house. Below we store the target sales price in a variable named y. We remove the SalesPrice, Order, and PID as features and store the remaining columns in a variable named x. We then split the dataset into a training and testing set.

In [None]:
y = ames_df['SalePrice']
x = ames_df.drop(columns=['Order', 'PID', 'SalePrice'])

x_train, x_test, y_train, y_test = model_selection.train_test_split(x,y)

## Demo

We're going to use sklearn's pipelines to organize our workflow as well as the following for preprocessing and modeling. 
* StandardScaler: scales all input features to have a mean of zero and std of one
* SelectPercentile: selects the 20 best features
* LinearRegression: the model we'll use today

Below is a demo of how to fit and evaluate a pipeline. 

In [None]:
first_pipe = pipeline.Pipeline([
    ("scale", preprocessing.StandardScaler()),
    ("selection", feature_selection.SelectPercentile(feature_selection.f_regression, percentile=50)),
    ("regression", linear_model.LinearRegression()),
])

In [None]:
score = model_selection.cross_validate(first_pipe, x_train,y_train, scoring="r2", cv=5, return_train_score=True)
train_score = score['train_score'].mean()
test_score = score['test_score'].mean()

print(f"train score: {train_score}")
print(f"test score: {test_score}")

#  **Exercise 1: Learning Curve**

Today's lesson discussed data: How does one get it, what should it look like, and what to look out for. One important part of this topic is knowing if our model is underfit or overfit. 

To know if we are overfit or underfit, we need to plot a learning curve. A learning curve plots performance (either error or score) against some measure of complexity. We'll make our first learning curve using the number of features as the measure of complexity. 

Take the training procedure in the cell above an run it for different values of `percentile` in `SelectPercentile()`. Plot the train score vs the test score. For which values of `percentile` is the model underfit and for which is it overfit?


Hint: `first_pipe.get_params()` shows you which parameters the pipeline uses. `first_pipe.set_params()` allows you to set the parameters. How can we use this to set a different `percentile` instead of building a new model each time?

# **Exercise 2: Features Engineering**

One way to add features is through feature engineering—making new features. Revisit your pipeline and add a new step before `"scale"`. Use `preprocessing.PolynomialFeatures(degree=2, include_bias=False)` for this. You may vary the degree but you will want to be sure to keep `include_bias=False`. If you get many errors, you may need to add a new step to your python to reduce features with zero variance. Try `("var", feature_selection.VarianceThreshold())`.

When building models, you may either engineer new features by hand, by picking out interactions that are likely to be helpful or by selecting transformations. It is also valid to incorporate this as part of your pipeline so that you as the modeler cannot hand select the features—that way, you're letting the model select what's useful so that you don't miss anything.

How does this step affect when and whether the model is overfit? If this were a production model, which solution would you recommend?


# **Exercise 3: Regularization**

Adding new features gives models more capacity to overfit. For the last exercise, we'll explore one more method for responding to overfitting: regularization. Regularization generally refers to anything that reduces the model's performance on the training data but improves performance on the test data. As you saw in our first exercise, when a large number of features are used, feature selection (in other words, reducing the number of features) tends to regularize models. Another common method is adding noise to training data. Here, we will regularize our model directly, by asking it to trade off between fitting to the training data and listening to more features.

Here, replace the `LinearRegression()` model with a version that uses regularization, either `Lasso()`, `Ridge()`, or `ElasticNet()`. Each of these models has an `alpha` parameter that you should tweak. Often, the greater the number of features, the higher `alpha` needs to be. 

Note: If you find yourself getting many `ConvergenceWarning`s then you may either increase the `max_iter` parameter, increase the `tol` parameter, or ignore the warning.

How does this step affect when and where the model is overfit or underfit?

Go through each step in your pipeline and discuss whether the step pushes the model to be more biased (in other words, to be more underfit) or have more variance (in other words, to be more overfit).