# Grid Search and Randomized Search

We shall compare  randomized search and grid search for optimizing hyperparameters of a Random Forest.
All parameters that influence the learning are searched simultaneously (except for the number of estimators).

In [1]:
import numpy as np

from time import time
from scipy.stats import randint

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier

In [2]:
# Digits data
digits = load_digits()
X, y = digits.data, digits.target

# Build a classifier
clf = RandomForestClassifier(n_estimators=20)

In [3]:
# Utility function to report best scores
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

The randomized search and the grid search explore exactly the same space of
parameters. 

In [4]:
param_dist_r = {"max_depth": [3, None],
              "max_features": randint(1, 11),
              "min_samples_split": randint(2, 11),
              "min_samples_leaf": randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

param_grid_g = {"max_depth": [3, None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

In [5]:
# Run Randomized Search
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist_r,
                                   n_iter=n_iter_search)

start = time()
random_search.fit(X, y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
report(random_search.cv_results_)

RandomizedSearchCV took 5.42 seconds for 20 candidates parameter settings.
Model with rank: 1
Mean validation score: 0.932 (std: 0.016)
Parameters: {'bootstrap': False, 'min_samples_leaf': 2, 'min_samples_split': 8, 'criterion': 'gini', 'max_features': 10, 'max_depth': None}

Model with rank: 2
Mean validation score: 0.920 (std: 0.016)
Parameters: {'bootstrap': True, 'min_samples_leaf': 4, 'min_samples_split': 7, 'criterion': 'entropy', 'max_features': 7, 'max_depth': None}

Model with rank: 3
Mean validation score: 0.914 (std: 0.021)
Parameters: {'bootstrap': True, 'min_samples_leaf': 3, 'min_samples_split': 3, 'criterion': 'gini', 'max_features': 10, 'max_depth': None}



In [6]:
# Run Grid Search
grid_search = GridSearchCV(clf, param_grid=param_grid_g)
start = time()
grid_search.fit(X, y)

print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
      % (time() - start, len(grid_search.cv_results_['params'])))
report(grid_search.cv_results_)

GridSearchCV took 56.20 seconds for 216 candidate parameter settings.
Model with rank: 1
Mean validation score: 0.929 (std: 0.007)
Parameters: {'bootstrap': False, 'min_samples_leaf': 1, 'min_samples_split': 3, 'criterion': 'gini', 'max_features': 10, 'max_depth': None}

Model with rank: 2
Mean validation score: 0.928 (std: 0.013)
Parameters: {'bootstrap': False, 'min_samples_leaf': 1, 'min_samples_split': 3, 'criterion': 'gini', 'max_features': 3, 'max_depth': None}

Model with rank: 2
Mean validation score: 0.928 (std: 0.008)
Parameters: {'bootstrap': False, 'min_samples_leaf': 3, 'min_samples_split': 3, 'criterion': 'gini', 'max_features': 3, 'max_depth': None}

Model with rank: 2
Mean validation score: 0.928 (std: 0.011)
Parameters: {'bootstrap': False, 'min_samples_leaf': 3, 'min_samples_split': 2, 'criterion': 'gini', 'max_features': 10, 'max_depth': None}

Model with rank: 2
Mean validation score: 0.928 (std: 0.000)
Parameters: {'bootstrap': False, 'min_samples_leaf': 1, 'min_sa

The result in parameter settings is quite similar, while the runtime for randomized search is drastically lower.

The performance is slightly worse for the randomized search, though this is most likely a noise effect and would not carry over to a held-out test set.

Note that in practice, one would not search over this many different parameters simultaneously using grid search, but pick only the ones deemed most important.