# Cross-validation and Grid Search

In this notebook we demonstrate some of sklearn tools for cross-validation and grid search.


# Cross-validation

Cross-validation is a method for robustly estimating test-set performance of a model.
First, let's train and test Decision tree model on the iris dataset.

In [4]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split


iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 3)
clf = DecisionTreeClassifier().fit(X_train, y_train)

print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

Accuracy of Decision Tree classifier on training set: 1.00
Accuracy of Decision Tree classifier on test set: 0.97


We obtained perfect score for the training set and very good one for a test set. But we are aware that Decision Trees are very sensitive to variations in the training data. There is a chance that Decision Tree is overfitting the training data. Therefore it is worthwhile to check if our model generalizes well.  For this purpose we can use  skearln K-fold cross-validation feature. 

The following code randomly splits the training set into 10 distinct subsets called folds, then it trains and evaluates the Decision Tree model 10 times, picking a different fold for evaluation every time and training on the other 9 folds. The result is an array containing the 10 evaluation scores:

In [6]:
import numpy as np
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(clf, X, y, cv=10)

print('Cross-validation scores (10-fold):', cv_scores)
print('Mean cross-validation score (10-fold): {:.3f}'
     .format(np.mean(cv_scores)))

Cross-validation scores (10-fold): [1.         0.93333333 1.         0.93333333 0.93333333 0.86666667
 0.93333333 0.93333333 1.         1.        ]
Mean cross-validation score (10-fold): 0.953


We can see that indeed depending on the fold the score can be lower and average cross-validation score is somewhat lower than the one we initially obtained. 

# Grid Search


Grid search is a method to perform hyperparameter optimization.

Let’s assume that at some point of designing your machine learning pipeline you have a shortlist of promising models. You now need to fine-tune them. 
One way to do that would be to experiment with the hyperparameters manually, until you find a great combination of hyperparameter values. This would be very tedious work, and you may not have time to explore many combinations.
Instead you should get sklearn GridSearchCV to search for you. All you need to do is tell it which hyperparameters you want it to experiment with, and what values to try out, and it will evaluate all the possible combinations of hyperparameter values, using cross-validation.


In this example we are going to apply Grid Search experimenting with several parameters of the Decision tree:
    max_leaf_nodes
    max_depth,
    min_samples_split 
The parameters of the model are optimized by cross-validated grid-search over a parameter grid.



In [7]:
from sklearn.model_selection import GridSearchCV
# define parameters grid
params = {'max_leaf_nodes': list(range(2, 100)), 'min_samples_split': [2, 3, 4], 'max_depth': list(range(2,5))}
grid_search_cv = GridSearchCV(DecisionTreeClassifier(random_state=42), params, verbose=1, cv=10)

grid_search_cv.fit(X_train, y_train)

Fitting 3 folds for each of 882 candidates, totalling 2646 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 2646 out of 2646 | elapsed:    1.9s finished


GridSearchCV(cv=3, error_score=nan,
             estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=42,
                                              splitter='best'),
             iid='deprecated', n_jobs=None,
             param_grid={'max_depth': [2, 3, 4],
                         'max_l

To check the best model us best_estimator_ attribute:

In [8]:
grid_search_cv.best_estimator_

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=2, max_features=None, max_leaf_nodes=3,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=42, splitter='best')

In [10]:
print(grid_search_cv.best_score_)

0.9554291133238503
