# UTSC Machine Learning WorkShop
## Cross-validation for feature selection with Linear Regression
*From the video series: [Introduction to machine learning with scikit-learn](https://github.com/justmarkham/scikit-learn-videos)*

## Agenda

- Put together what we learned, using **corss-validation** to select **features** for linear regration models. 
- Practice on a different problem. 

## Cross-validation example: feature selection

## Model Evaluation Metrics for Regression

For classification problems, we have only used classification accuracy as our evaluation metric. What metrics can we used for regression problems?

**Mean Absolute Error** (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

**Mean Squared Error** (MSE) is the mean of the squared errors:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

**Root Mean Squared Error** (RMSE) is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

## Read More ##
http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter


**Goal**: Select whether the Newspaper feature should be included in the linear regression model on the advertising dataset

In [6]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.cross_validation import cross_val_score

In [7]:
# read in the advertising dataset
data = pd.read_csv('data/Advertising.csv', index_col=0)

In [8]:
# create a Python list of three feature names
feature_cols = ['TV', 'Radio', 'Newspaper']

# use the list to select a subset of the DataFrame (X)
X = data[feature_cols]

# select the Sales column as the response (y)
y = data.Sales

In [9]:
# 10-fold cross-validation with all three features
lm = LinearRegression()
MAEscores = cross_val_score(lm, X, y, cv=10, scoring='mean_absolute_error')
print MAEscores

[-1.41470822 -1.42067103 -1.18520036 -1.39731782 -0.90578551 -0.96357362
 -2.00464419 -1.17610998 -1.18157732 -1.37291164]


MSE is more popular than MAE because MSE "punishes" larger errors. But, RMSE is even more popular than MSE because RMSE is interpretable in the "y" units.

In [10]:
# The MSE scores can be calculated by: 
scores = cross_val_score(lm, X, y, cv=10, scoring='mean_squared_error')
print scores

[-3.56038438 -3.29767522 -2.08943356 -2.82474283 -1.3027754  -1.74163618
 -8.17338214 -2.11409746 -3.04273109 -2.45281793]


In [11]:
# fix the sign of MSE scores
mse_scores = -scores
print mse_scores

[ 3.56038438  3.29767522  2.08943356  2.82474283  1.3027754   1.74163618
  8.17338214  2.11409746  3.04273109  2.45281793]


In [12]:
# convert from MSE to RMSE
rmse_scores = np.sqrt(mse_scores)
print rmse_scores

[ 1.88689808  1.81595022  1.44548731  1.68069713  1.14139187  1.31971064
  2.85891276  1.45399362  1.7443426   1.56614748]


In [13]:
# calculate the average RMSE
print rmse_scores.mean()

1.69135317081


In [14]:
# 10-fold cross-validation with two features (excluding Newspaper)
feature_cols = ['TV', 'Radio']
X = data[feature_cols]
print np.sqrt(-cross_val_score(lm, X, y, cv=10, scoring='mean_squared_error')).mean()

1.67967484191


**TASK**

Select the best polynomial order for feature Grith to use in the tree problem. 

In [15]:
import pydataset
from pydataset import data
trees=data('trees')

In [32]:
#set up features and aimed result
#A.S. Higher orders keep trying to overfit the data. Lower value of RMS the better. Probably need cross terms too!
feature_cols=["Girth", "Height"]
print 'i j score'
for i in xrange(2,10):
    name = "Girth"+str(i)
    trees[name] = trees["Girth"]**i
    for j in xrange(1,i):
        trees[name+"Height"+str(j)] = trees[name]*trees["Height"]**j
        feature_cols.append("Girth"+str(i))
        X=trees[feature_cols]
        y=trees.Volume
# find the cross validation score
        print i, j, np.sqrt(-cross_val_score(lm, X, y, cv=10, scoring='mean_squared_error')).mean()

# find the cross validation score for higher polynomial features

i,j,score
2 1 2.97984997122
3 1 3.3624184288
3 2 3.3624184288
4 1 4.00623739093
4 2 4.00623739093
4 3 4.00623739093
5 1 10.5424723888
5 2 10.5424723891
5 3 10.5424723888
5 4 10.5424723891
6 1 11.4203465064
6 2 11.4203465046
6 3 11.4203465039
6 4 11.4203465054
6 5 11.4203465163
7 1 85.0431288351
7 2 85.0432073859
7 3 85.0432485939
7 4 85.0430002369
7 5 85.0427045846
7 6 85.0434710968
8 1 215.363665052
8 2 215.36618385
8 3 215.361627334
8 4 215.365825614
8 5 215.360057816
8 6 239.872089708
8 7 183.865926413
9 1 275.749264723
9 2 273.854043765
9 3 272.331805204
9 4 275.640293895
9 5 282.963692618
9 6 277.690039909
9 7 277.877507814
9 8 270.389322363


**Feature engineering and selection within cross-validation iterations**

- Normally, feature engineering and selection occurs **before** cross-validation
- Instead, perform all feature engineering and selection **within each cross-validation iteration**
- More reliable estimate of out-of-sample performance since it **better mimics** the application of the model to out-of-sample data

In [17]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

IOError: [Errno 2] No such file or directory: 'styles/custom.css'