# Hyperparameter Optimization
> Hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters are learned. The same kind of machine learning model can require different constraints, weights or learning rates to generalize different data patterns. These measures are called hyperparameters, and have to be tuned so that the model can optimally solve the machine learning problem. 
Hyperparameter optimization finds a tuple of hyperparameters that yields an optimal model which minimizes a predefined loss function on given independent data.The objective function takes a tuple of hyperparameters and returns the associated loss. Cross-validation is often used to estimate this generalization performance

For each method, we'll how to search for the optimal structure of a random forest classifer. Random forests are an ensemble model comprised of a collection of decision trees.
Hyperparameters to keep in mind:
* How many estimators (ie. decision trees) should be utilized?
* What should be the maximum allowable depth for each decision tree?
* What criterion to pick to measure the quality of a split?

# Please upvote the kernel if you found it insightful! 

# Import Libraries

In [None]:
import pandas as pd
import numpy as np

from sklearn import ensemble
from sklearn import metrics
from sklearn import model_selection

# Load Dataset and define features and target

In [None]:
df = pd.read_csv("../input/mobile-price-classification/train.csv")
X = df.drop("price_range", axis = 1).values #Features
y = df.price_range.values #target

# Grid Search with Random Forest Classifier
> Grid search is essentially an optimization algorithm which lets you select the best parameters for your optimization problem from a list of parameter options that you provide, hence automating the 'trial-and-error' method. It is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set or evaluation on a held-out validation set.

In [None]:
classifier = ensemble.RandomForestClassifier(n_jobs=-1) #n_jobs = -1 means using all processors.
param_grid = {
    "n_estimators": [100, 200, 300, 400], 
    "max_depth": [1, 3, 7, 5],
    "criterion": ["gini", "entropy"],
}

model = model_selection.GridSearchCV(
    estimator=classifier,
    param_grid=param_grid, #Dictionary with parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries
    scoring="accuracy", #A single str to evaluate the predictions on the test set.
    n_jobs=1, #Number of jobs to run in parallel
    cv=5, #Determines the cross-validation splitting strategy. If the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.
)

model.fit(X, y)

In [None]:
print(model.best_score_)
print(model.best_estimator_.get_params())

As you can see, this is an exhaustive sampling of the hyperparameter space and can be quite inefficient.

# Random Search with Random Forest Classifier
> Random Search replaces the exhaustive enumeration of all combinations by selecting them randomly. This can be simply applied to the discrete setting described above, but also generalizes to continuous and mixed spaces. It can outperform Grid search, especially when only a small number of hyperparameters affects the final performance of the machine learning algorithm. Random search differs from grid search in that we longer provide a discrete set of values to explore for each hyperparameter; rather, we provide a statistical distribution for each hyperparameter from which values may be randomly sampled. 
We can also define how many iterations we'd like to build when searching for the optimal model.

In [None]:
classifier = ensemble.RandomForestClassifier(n_jobs=-1)
param_grid = {
        "n_estimators": np.arange(100, 1500, 100),
        "max_depth": np.arange(1, 20),
        "criterion": ["gini", "entropy"],
    }
# Random search is not as expensive as grid search
model = model_selection.RandomizedSearchCV(
    estimator=classifier,
    param_distributions=param_grid,
    n_iter=10, #Number of parameter settings that are sampled. n_iter trades off runtime vs quality of the solution.
    scoring="accuracy",
    n_jobs=1,
    cv=5,
)
model.fit(X, y)

In [None]:
print(model.best_score_)    
print(model.best_estimator_.get_params())

As you can see, this search method works best under the assumption that not all hyperparameters are equally important. 

# Bayesian Optimization with Gaussian Process 
> Bayesian optimization is a global optimization method for noisy black-box functions. Applied to hyperparameter optimization, Bayesian optimization builds a probabilistic model of the function mapping from hyperparameter values to the objective evaluated on a validation set. By iteratively evaluating a promising hyperparameter configuration based on the current model, and then updating it, Bayesian optimization, aims to gather observations revealing as much information as possible about this function and, in particular, the location of the optimum. It tries to balance exploration (hyperparameters for which the outcome is most uncertain) and exploitation (hyperparameters expected close to the optimum). 

> Gaussian Processes (GPs) provide a rich and flexible class of non-parametric statistical models over function spaces with domains that can be continuous, discrete, mixed, or even hierarchical in nature. 

In [None]:
from functools import partial
# Sequential model-based optimization in Python
from skopt import space # Initialize a search space from given specifications.
from skopt import gp_minimize # Bayesian optimization using Gaussian Processes.

The idea is to approximate the function using a Gaussian process. In other words the function values are assumed to follow a multivariate gaussian. The covariance of the function values are given by a GP kernel between the parameters. Then a smart choice to choose the next parameter to evaluate can be made by the acquisition function over the Gaussian prior which is much quicker to evaluate.

In [None]:
# Function to minimize. Should take a single list of parameters and return the objective value.
def optimize(params, param_names, x, y):
    params = dict(zip(param_names, params)) # Create a dictonary of parameter names and values to feed into the model.
    model = ensemble.RandomForestClassifier(**params)
    kf = model_selection.KFold(n_splits=5)
    accuracies = []
    for idx in kf.split(X=x, y=y):
        train_idx, test_idx = idx[0], idx[1]
        xtrain = x[train_idx]
        ytrain = y[train_idx]

        xtest = x[test_idx]
        ytest = y[test_idx]

        model.fit(xtrain, ytrain)
        preds = model.predict(xtest)
        fold_acc = metrics.accuracy_score(ytest, preds)
        accuracies.append(fold_acc)

    return -1.0 * np.mean(accuracies)

In [None]:
# Initialize a search space of max_depth, n_estimators, criterion and max_features
param_space = [
    space.Integer(3, 15, name="max_depth"),
    space.Integer(100, 600, name="n_estimators"),
    space.Categorical(["gini", "entropy"], name="criterion"),
    space.Real(0.01, 1, prior = "uniform", name="max_features")
]

param_names = [
    "max_depth",
    "n_estimators",
    "criterion",
    "max_features"
]

optimization_function = partial(
    optimize,
    param_names=param_names,
    x=X,
    y=y
)

result = gp_minimize(
    optimization_function,  # Function to minimize. Should take a single list of parameters and return the objective value.
    dimensions=param_space, # List of search space dimensions.
    n_calls=15, # Number of calls to func
    n_random_starts=10, # Number of evaluations of func with random points
    verbose=10 # Control the verbosity. It is advised to set the verbosity to True for long optimization runs
)

print(dict(zip(param_names, result.x)))

It can obtain better results in fewer evaluations compared to grid search and random search, due to the ability to reason about the quality of experiments before they are run.

# Hyperopt
> Hyperopt is a way to search through an hyperparameter space. For example, it can use the Tree-structured Parzen Estimator (TPE) algorithm, which explore intelligently the search space while narrowing down to the estimated best parameters.This is an oriented random search, in contrast with a Grid Search where hyperparameters are pre-established with fixed steps increase. Random Search for Hyper-Parameter Optimization (such as what Hyperopt do) has proven to be an effective search technique.

In [None]:
from hyperopt import hp, fmin, tpe, Trials
from hyperopt.pyll.base import scope

In [None]:
# Optimization is finding the input value or set of values to an objective function that yields the lowest output value, called a “loss”. 
def optimize(params, x, y):
    model = ensemble.RandomForestClassifier(**params)
    kf = model_selection.KFold(n_splits=5)
    accuracies = []
    for idx in kf.split(X=x, y=y):
        train_idx, test_idx = idx[0], idx[1]
        xtrain = x[train_idx]
        ytrain = y[train_idx]

        xtest = x[test_idx]
        ytest = y[test_idx]

        model.fit(xtrain, ytrain)
        preds = model.predict(xtest)
        fold_acc = metrics.accuracy_score(ytest, preds)
        accuracies.append(fold_acc)

    return -1.0 * np.mean(accuracies)

The way to use hyperopt is to describe:

* the objective function to minimize
* the space over which to search
* the database in which to store all the point evaluations of the search
* the search algorithm to use

In [None]:
'''There is also a few quantized versions of those functions, which rounds the generated values at each step of “q”:
    ∙ hp.quniform(label, low, high, q)
    ∙ hp.qloguniform(label, low, high, q) '''

param_space = {
    "max_depth": scope.int(hp.quniform("max_depth", 3, 15, 1)),
    "n_estimators": scope.int(hp.quniform("n_estimators", 100, 600, 1)),
    "criterion": hp.choice("criterion", ["gini", "entropy"]),
    "max_features": hp.uniform("max_features", 0.01, 1)
}


optimization_function = partial(
    optimize,
    x=X,
    y=y
)

trials = Trials() # It would be nice to see exactly what is happening inside the hyperopt black box. The Trials object allows us to do just that.

result = fmin(
    fn=optimization_function,
    space=param_space,
    algo=tpe.suggest,
    max_evals=15,
    trials=trials,
)

print(result)