### Lesson 7 - Model Fit Evaluation

Let's go through the usual library imports

In [3]:
import numpy as np
import pandas as pd
from sklearn import linear_model, metrics


## Cross validation
#### Intro to cross validation with bike share data from last time. We will be modeling casual ridership. 

In [17]:
from sklearn import cross_validation
bikeshare = pd.read_csv('bikeshare.csv')

#### Create dummy variables and set outcome (dependent) variable

In [18]:
weather = pd.get_dummies(bikeshare.weathersit, prefix='weather')
modeldata = bikeshare[['temp', 'hum']].join(weather[['weather_1', 'weather_2', 'weather_3']])
y = bikeshare.casual 

#### Create a cross valiation with 5 folds

In [10]:
# The first parameter is the total number of data points
# The second parameter is the number of folds (K=5)
# The last parameter allows us to have some of the data in previous folds be in the new fold.
kf = cross_validation.KFold(len(modeldata), n_folds=5, shuffle=True)

Let's actually train and test the model on each fold. and save the RMSE values for each test/iteration

In [111]:
rmse_values = []
scores = []
n= 0
print "#### CROSS VALIDATION each fold ####"
for train_index, test_index in kf:
    lm = linear_model.LinearRegression().fit(modeldata.iloc[train_index], y.iloc[train_index])
    mse = metrics.root_mean_squared_error(y.iloc[test_index], lm.predict(modeldata.iloc[test_index]))
    rmse_values.append(mse**.5)
    scores.append(lm.score(modeldata, y))
    n+=1
    print 'Model', n
    print 'RMSE:', rmse_values[n-1]
    print 'R2:', scores[n-1]


print "####  SUMMARY OF CROSS VALIDATION #####"
print 'Mean of RMSE for all folds:', np.mean(rmse_values)
print 'Mean of R2 for all folds:', np.mean(scores)

~~~~ CROSS VALIDATION each fold ~~~~
Model 1
MSE: 931.998393898
R2: 0.307405662658
Model 2
MSE: 1867.76931132
R2: 0.300084687568
Model 3
MSE: 615.67122778
R2: 0.310323515558
Model 4
MSE: 3264.08298499
R2: 0.307216506476
Model 5
MSE: 2225.37428617
R2: 0.308187875546
~~~~ SUMMARY OF CROSS VALIDATION ~~~~
Mean of MSE for all folds: 1780.97924083
Mean of R2 for all folds: 0.306643649561


In [108]:
lm = linear_model.LinearRegression().fit(modeldata, y)
print "~~~~ Single Model ~~~~"
print 'MSE of single model:', metrics.mean_squared_error(y, lm.predict(modeldata))
print 'R2: ', lm.score(modeldata, y)

~~~~ Single Model ~~~~
MSE of single model: 1672.58110765
R2:  0.311934605989


### Check
While the cross validated approach here generated more overall error, which of the two approaches would predict new data more accurately: the single model or the cross validated, averaged one?


### Advanced: There are ways to improve our model with regularization. 
Let's check out the effects on MSE and R2

In [5]:
print "### OLS ###"
lm = linear_model.LinearRegression().fit(modeldata, y)
print "### OLS ###"
print 'OLS RMSE: ', metrics.mean_squared_error(y, lm.predict(modeldata))**.5
print 'OLS R2:', lm.score(modeldata, y)

print "~~~ Ridge ~~~"
ridge = linear_model.RidgeCV(alphas=[0.1, 1.0, 10.0])
ridge.fit(modeldata,y)       
print 'Ridge RMSE: ', metrics.mean_squared_error(y, ridge.predict(modeldata))**.5
print 'Ridge R2:', ridge.score(modeldata, y)

### OLS ###


NameError: name 'modeldata' is not defined

#### Examine cross-validation performance

In [None]:
OLS_rmse_values = []
Ridge_rmse_values = []
n= 0
print "#### CROSS VALIDATION each fold ####"
for train_index, test_index in kf:
    mse = metrics.root_mean_squared_error(y.iloc[test_index], lm.predict(modeldata.iloc[test_index]))
    OLS_rmse_values.append(mse**.5)

    mse = metrics.root_mean_squared_error(y.iloc[test_index], ridge.predict(modeldata.iloc[test_index]))
    OLS_rmse_values.append(mse**.5)

    n+=1


print "####  SUMMARY OF CROSS VALIDATION #####"
print 'Mean of RMSE for OLS:', np.mean(OLS_rmse_values)
print 'Mean of RMSE for Ridge:', np.mean(Ridge_rmse_values)


## Example Application of Gradient Descent 

In [117]:
lm = linear_model.SGDRegressor()
lm.fit(modeldata, y)
print "Gradient Descent R2:", lm.score(modeldata, y)
print "Gradient Descent MSE:", metrics.mean_squared_error(y, lm.predict(modeldata))

Gradient Descent R2: 0.30853517891
Gradient Descent MSE: 1680.84459185


Check: Untuned, how well did gradient descent perform compared to OLS?

# Independent Practice: Bike data revisited

There are tons of ways to approach a regression problem. The regularization techniques appended to ordinary least squares optimizes the size of coefficients to best account for error. Gradient Descent also introduces learning rate (how aggressively do we solve the problem), epsilon (at what point do we say the error margin is acceptable), and iterations (when should we stop no matter what?)

For this deliverable, our goals are to:

- implement the gradient descent approach to our bike-share modeling problem,
- show how gradient descent solves and optimizes the solution,
- demonstrate the grid_search module!

While exploring the Gradient Descent regressor object, you'll build a grid search using the stochastic gradient descent estimator for the bike-share data set. Continue with either the model you evaluated last class or the simpler one from today. In particular, be sure to implement the "param_grid" in the grid search to get answers for the following questions:

- With a set of alpha values between 10^-10 and 10^-1, how does the mean squared error change?
- Based on the data, we know when to properly use l1 vs l2 regularization. By using a grid search with l1_ratios between 0 and 1 (increasing every 0.05), does that statement hold true? If not, did gradient descent have enough iterations?
- How do these results change when you alter the learning rate (eta0)?

**Bonus**: Can you see the advantages and disadvantages of using gradient descent after finishing this exercise?

### Starter Code

In [None]:
params = {} # put your gradient descent parameters here
gs = grid_search.GridSearchCV(
    estimator=linear_model.SGDRegressor(),
    cv=cross_validation.KFold(len(modeldata), n_folds=5, shuffle=True),
    param_grid=params,
    scoring='mean_squared_error',
    )

gs.fit(modeldata, y)

print 'BEST ESTIMATOR'
print -gs.best_score_
print gs.best_estimator_
print 'ALL ESTIMATORS'
print gs.grid_scores_

### Independent Practice Solution

This code shows the variety of challenges and some student gotchas. The plots will help showcase what should be learned.

1. With a set of alpha values between 10^-10 and 10^-1, how does the mean squared error change?
2. We know when to properly use l1 vs l2 regularization based on the data. By using a grid search with l1_ratios between 0 and 1 (increasing every 0.05), does that statement hold true?
    * (if it didn't look like it, did gradient descent have enough iterations?)
3. How do results change when you alter the learning rate (power_t)?