# Problem setup:

The last step in most machine learning problems is to tune a model with a grid search. However, you have to be careful how you evaluate the results of the search.

In [2]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import KFold
from sklearn.ensemble import GradientBoostingRegressor
from scipy.stats import randint
import numpy as np

# Load the data
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Define (roughly) our hyper parameters
hyper = {
    'max_depth': randint(3, 10),
    'n_estimators': randint(25, 250),
    'learning_rate': np.linspace(0.001, 0.01, 20),
    'min_samples_leaf': [1, 5, 10]
}

# Define our CV class (remember to always shuffle!)
cv = KFold(shuffle=True, n_splits=3, random_state=1)

# Define our estimator
search = RandomizedSearchCV(GradientBoostingRegressor(random_state=42),
                            scoring='neg_mean_squared_error', n_iter=25,
                            param_distributions=hyper, cv=cv,
                            random_state=12, n_jobs=4)

# Fit the grid search
search.fit(X_train, y_train)

RandomizedSearchCV(cv=KFold(n_splits=3, random_state=1, shuffle=True),
          error_score='raise',
          estimator=GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_split=1e-07,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=100,
             presort='auto', random_state=42, subsample=1.0, verbose=0,
             warm_start=False),
          fit_params={}, iid=True, n_iter=25, n_jobs=4,
          param_distributions={'max_depth': <scipy.stats._distn_infrastructure.rv_frozen object at 0x10388d9b0>, 'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x117542a20>, 'learning_rate': array([0.001  , 0.00147, 0.00195, 0.00242, 0.00289, 0.00337, 0.00384,
       0.00432, 0.00479, 0.00526, 0.00574, 0.00621, 0.00668, 0.00716,
       0.00763, 0.00811, 0.00858, 0.009

Now we want to know if the model is good enough. __Does this model meet business requirements?__

## Wrong approach:

If you repeatedly expose your model to your test set, you risk "p-hacking":

In [3]:
from sklearn.metrics import mean_squared_error

# Evaluate:
print("Test MSE: %.3f" % mean_squared_error(y_test, search.predict(X_test)))

Test MSE: 12.394


This is the wrong approach since you've now gained information that could cause model leakage. If you decide to make adjustments to your model to improve the test score, you're effectively fitting the test set indirectly.

The more appropriate approach is to examine the CV scores of the model.

## Better approach:

In [4]:
import pandas as pd

pd.DataFrame(search.cv_results_)\
  .sort_values('mean_test_score',
               # descend since neg MSE
               ascending=False)\
  .head()

  return f(*args, **kwds)


Unnamed: 0,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_max_depth,param_min_samples_leaf,param_n_estimators,params
14,-9.779118,-36.088421,-11.244133,-19.012796,12.065257,1,-3.539486,-2.213193,-3.721718,-3.158132,0.672302,0.166742,0.009509,0.001527,0.000196,0.01,7,5,228,"{'learning_rate': 0.01, 'max_depth': 7, 'min_s..."
2,-13.972549,-38.82143,-15.160443,-22.628574,11.437726,2,-8.256319,-6.447233,-9.416248,-8.039934,1.221714,0.114991,0.00264,0.001153,0.000166,0.00905263,7,5,166,"{'learning_rate': 0.009052631578947368, 'max_d..."
13,-14.944225,-39.504609,-15.895012,-23.425511,11.353793,3,-9.331705,-7.41946,-10.591472,-9.114212,1.304069,0.164653,0.007149,0.00152,0.000124,0.00621053,7,5,228,"{'learning_rate': 0.0062105263157894745, 'max_..."
9,-16.173797,-39.453283,-18.064387,-24.541685,10.551537,4,-13.251544,-10.359841,-14.557962,-12.723116,1.754134,0.090297,0.003094,0.000796,0.000126,0.00715789,4,5,183,"{'learning_rate': 0.007157894736842105, 'max_d..."
8,-19.206718,-41.374121,-17.530125,-26.018966,10.857879,5,-15.349376,-11.007713,-15.80478,-14.053956,2.162028,0.082872,0.000505,0.000931,0.000122,0.00857895,7,10,148,"{'learning_rate': 0.008578947368421054, 'max_d..."


## CV outside scope of grid search:

You typically don't go straight into a grid search. First, you try several models. Scikit allows us to fit a model in the context of cross validation and examine the fold scores. This
is useful for determining whether a model will perform in the ballpark of business requirements before a lengthy tuning process:

In [5]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

# Set our CV seed
cv = KFold(n_splits=3, random_state=0, shuffle=True)

# Fit and score a model in CV:
cross_val_score(GradientBoostingRegressor(random_state=42),
                X_train, y_train, cv=cv, scoring='neg_mean_squared_error')

array([ -7.62352454, -15.10931642, -16.47872053])