# Hyperparameter Tuning

## Parameter vs Hyperparameter

## Overview of Methods

### Grid Search

Machine learning models are parameterized so that their behavior can be tuned for a given problem. Models can have many parameters and finding the best combination of parameters can be treated as a search problem.
 
Grid search is an approach to parameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid.

The following code evaluates different alpha values for the Ridge Regression algorithm on the diabetes dataset.

#### Load Data and Libraries

In [1]:
# Grid Search for Algorithm Tuning
import numpy as np
from sklearn import datasets
from sklearn.linear_model import Ridge
from sklearn.grid_search import GridSearchCV

# load the diabetes datasets
dataset = datasets.load_diabetes()

#### Define the Grid

What is the Grid?

In [2]:
# prepare a range of alpha values to test
alphas = np.array([1,0.1,0.01,0.001,0.0001,0])

#### Perform the Grid Search

In [3]:
# create and fit a ridge regression model, testing each alpha
model = Ridge()
grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))
grid.fit(dataset.data, dataset.target)

print(grid)

GridSearchCV(cv=None, error_score='raise',
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'alpha': array([  1.00000e+00,   1.00000e-01,   1.00000e-02,   1.00000e-03,
         1.00000e-04,   0.00000e+00])},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)


#### Summarize the Result

In [4]:
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_.alpha)

0.488790204461
0.001


### Random Search

Random search is an approach to parameter tuning that will sample algorithm parameters from a random distribution (i.e. uniform) for a fixed number of iterations. A model is constructed and evaluated for each combination of parameters chosen. 
 
The following code evaluates different alpha random values between 0 and 1 for the Ridge Regression algorithm on the diabetes dataset. 

#### Load Data and Libraries

In [5]:
# Randomized Search for Algorithm Tuning
import numpy as np
from scipy.stats import uniform as sp_rand
from sklearn import datasets
from sklearn.linear_model import Ridge
from sklearn.grid_search import RandomizedSearchCV

# load the diabetes datasets
dataset = datasets.load_diabetes()

#### Define the Grid

In [6]:
# prepare a uniform distribution to sample for the alpha parameter
param_grid = {'alpha': sp_rand()}

#### Perform the Random Search

In [7]:
# create and fit a ridge regression model, testing random alpha values
model = Ridge()
rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100)
rsearch.fit(dataset.data, dataset.target)
print(rsearch)

RandomizedSearchCV(cv=None, error_score='raise',
          estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001),
          fit_params={}, iid=True, n_iter=100, n_jobs=1,
          param_distributions={'alpha': <scipy.stats._distn_infrastructure.rv_frozen object at 0x10427bcf8>},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          scoring=None, verbose=0)


#### Summarize the Result

In [8]:
# summarize the results of the random parameter search
print(rsearch.best_score_)
print(rsearch.best_estimator_.alpha)

0.489165385076
0.055347884724043395


### Grid Search vs Random Search: Which one is better?

* http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf
* https://medium.com/rants-on-machine-learning/smarter-parameter-sweeps-or-why-grid-search-is-plain-stupid-c17d97a0e881#.z27w2p3sr
* http://scikit-learn.org/stable/auto_examples/model_selection/randomized_search.html
* Hyperparameter Tuning - http://blog.sigopt.com/post/144221180573/evaluating-hyperparameter-optimization-strategies

In [7]:
import numpy as np

from time import time
from operator import itemgetter
from scipy.stats import randint as sp_randint

from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier

# get some data
digits = load_digits()
X, y = digits.data, digits.target

# build a classifier
clf = RandomForestClassifier(n_estimators=20)


# Utility function to report best scores
def report(grid_scores, n_top=3):
    top_scores = sorted(grid_scores, key=itemgetter(1), reverse=True)[:n_top]
    for i, score in enumerate(top_scores):
        print("Model with rank: {0}".format(i + 1))
        print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
              score.mean_validation_score,
              np.std(score.cv_validation_scores)))
        print("Parameters: {0}".format(score.parameters))
        print("")


# specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1, 11),
              "min_samples_split": sp_randint(1, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# run randomized search
n_iter_search = 100
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search)

start = time()
random_search.fit(X, y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
report(random_search.grid_scores_)

# use a full grid over all parameters
param_grid = {"max_depth": [3, None],
              "max_features": [1, 3, 10],
              "min_samples_split": [1, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid)
start = time()
grid_search.fit(X, y)

print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
      % (time() - start, len(grid_search.grid_scores_)))
report(grid_search.grid_scores_)

RandomizedSearchCV took 10.78 seconds for 100 candidates parameter settings.
Model with rank: 1
Mean validation score: 0.931 (std: 0.014)
Parameters: {'bootstrap': False, 'criterion': 'entropy', 'max_features': 4, 'min_samples_leaf': 3, 'max_depth': None, 'min_samples_split': 2}

Model with rank: 2
Mean validation score: 0.930 (std: 0.011)
Parameters: {'bootstrap': False, 'criterion': 'gini', 'max_features': 8, 'min_samples_leaf': 4, 'max_depth': None, 'min_samples_split': 7}

Model with rank: 3
Mean validation score: 0.927 (std: 0.011)
Parameters: {'bootstrap': True, 'criterion': 'gini', 'max_features': 8, 'min_samples_leaf': 3, 'max_depth': None, 'min_samples_split': 2}

GridSearchCV took 21.07 seconds for 216 candidate parameter settings.
Model with rank: 1
Mean validation score: 0.939 (std: 0.006)
Parameters: {'bootstrap': False, 'criterion': 'gini', 'max_features': 10, 'min_samples_leaf': 1, 'max_depth': None, 'min_samples_split': 3}

Model with rank: 2
Mean validation score: 0.93

* The randomized search and the grid search explore exactly the same space of parameters.
* The result in parameter settings is quite similar, while the run time for randomized search can be drastically lower.
* The performance is slightly worse for the randomized search, though this is most likely a noise effect and would not carry over to a held-out test set.
* Note that in practice, one would not search over this many different parameters simultaneously using grid search, but pick only the ones deemed most important.

# Parameter Estimation with grid search and cross-validation

* development set comprises only half of the available labeled data
* the performance of the selected hyper-parameters and trained model is then measured on a dedicated evaluation set that was not used during the model selection step

In [8]:
from __future__ import print_function

from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC


# Loading the Digits dataset
digits = datasets.load_digits()

# To apply an classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target

# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=0)

# Set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

scores = ['precision', 'recall']

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    clf = GridSearchCV(SVC(C=1), tuned_parameters, cv=5,
                       scoring='%s_weighted' % score)
    clf.fit(X_train, y_train)

    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    for params, mean_score, scores in clf.grid_scores_:
        print("%0.3f (+/-%0.03f) for %r"
              % (mean_score, scores.std() * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test, clf.predict(X_test)
    print(classification_report(y_true, y_pred))
    print()

Automatically created module for IPython interactive environment
# Tuning hyper-parameters for precision

Best parameters set found on development set:

{'gamma': 0.001, 'C': 10, 'kernel': 'rbf'}

Grid scores on development set:

0.987 (+/-0.018) for {'gamma': 0.001, 'C': 1, 'kernel': 'rbf'}
0.959 (+/-0.030) for {'gamma': 0.0001, 'C': 1, 'kernel': 'rbf'}
0.988 (+/-0.018) for {'gamma': 0.001, 'C': 10, 'kernel': 'rbf'}
0.982 (+/-0.027) for {'gamma': 0.0001, 'C': 10, 'kernel': 'rbf'}
0.988 (+/-0.018) for {'gamma': 0.001, 'C': 100, 'kernel': 'rbf'}
0.982 (+/-0.026) for {'gamma': 0.0001, 'C': 100, 'kernel': 'rbf'}
0.988 (+/-0.018) for {'gamma': 0.001, 'C': 1000, 'kernel': 'rbf'}
0.982 (+/-0.026) for {'gamma': 0.0001, 'C': 1000, 'kernel': 'rbf'}
0.974 (+/-0.014) for {'kernel': 'linear', 'C': 1}
0.974 (+/-0.014) for {'kernel': 'linear', 'C': 10}
0.974 (+/-0.014) for {'kernel': 'linear', 'C': 100}
0.974 (+/-0.014) for {'kernel': 'linear', 'C': 1000}

Detailed classification report:

The model 