# Introduction

This tutorial is intended to introduce you to the concept of hyperparamter tuning. In particular, it will focus on a relatively new algorithm called LIPO that is notable because it has no hyperparamters, and is also proven to be better than random search in a lot of real situations.

# Installing the libraries

For this tutorial we're going to use two different libraries: ```scikit-learn```, and ```dlib```. ```Scikit-learn``` gives us a wide range of machine learning algorithms, as well as some good hyperparameter tuning algorithms to benchmark against. It does not implement LIPO however, so we're also going to use ```dlib``` which does have an implementation. To install both using conda:

```
conda install scikit-learn
conda install -c menpo dlib
```

Note that this is installing dlib from a different channel, so it's possible issues might arise in the future with this step.

We can now make sure this works by importing both ```dlib``` and ```scikit-learn```, as well as ```numpy``` to manipulate data.

In [106]:
import dlib
import scipy
import numpy as np
from sklearn import svm
from sklearn import datasets
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_val_score

# Hyperparameters

A hyperparameter is a parameter of a machine learning algorithm that determines something about how the algorithm will run. In particular, hyperparameters are set before the algorithm runs and do not change value during the run. Some algorithms don't have any hyperparameters; a least squares regression, for example, requires nothing up front. In comparison, support vector classifier with an RBF kernel has two significant hyperparameters, and a decision tree classifier has many more. When using any algorithm with hyperparameters, which values are used is very important, as they significantly impact the performance of the algorithm. Choosing hyperparameters, however, is a fairly difficult problem, especially for algorithms that have a large number of them.

As an example, let's use the aforementioned support vector classifier to classify irises based on their petal and sepal dimensions. There is a sample dataset with these features built into ```scikit-learn```; we can load it and look at the first few elements of both the data (feautres) and targets (actual classes). Looking at the frequency of each class label we can see that we have 50 of each type of iris.

In [107]:
iris_data = datasets.load_iris()
print(iris_data.data[:5])
print()
print(iris_data.target[:5])
print()
print(np.asarray(np.unique(iris_data.target, return_counts=True)))

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]

[0 0 0 0 0]

[[ 0  1  2]
 [50 50 50]]


Now we can create a few different support vector classifiers with different hyperparameters to see how they can change performance. For brevity's sake we'll just look at three different values for the ```C``` hyperparameter. In essence, this hyperparameter specifies how bad misclassifying something is.

Note that to determine the accuracy of each classifier we're using a stratified 10-fold cross validation. This means we divide the data into 10 smaller sets, each with roughly the same number of each type of flower. Then for each smaller set we train the classifier on the other 9 sets, before scoring it on the holdout set. The overall accuracy is the average of the 10 accuracies.

In [108]:
svc_1_classifier = svm.SVC(C=.0001)
svc_1_accuracy = cross_val_score(svc_1_classifier, iris_data.data, iris_data.target, cv=10)

svc_2_classifier = svm.SVC(C=1)
svc_2_accuracy = cross_val_score(svc_2_classifier, iris_data.data, iris_data.target, cv=10)

svc_3_classifier = svm.SVC(C=10000)
svc_3_accuracy = cross_val_score(svc_3_classifier, iris_data.data, iris_data.target, cv=10)

print(svc_1_accuracy.mean())
print(svc_2_accuracy.mean())
print(svc_3_accuracy.mean())

0.9333333333333333
0.9800000000000001
0.9133333333333333


We can see that the choice of hyperparameter makes a big difference! It's very possible that by adjusting it further we could do even better than 98% accuracy, and that's not even considering possible adjustments to the second hyperparameter, ```gamma```.

# Simple optimization methods

The middle value for ```C``` above is just the default in ```scikit-learn```, and the other values were chosen somewhat randomly. In this case it's likely that the value we found to be best is a good choice, but our criteria for saying that is that it's close to the maximum possible accuracy. It's very possible to imagine a situation where we've hit a local maximum rather than a global, or where the global maximum lies between two of our (again, randomly) chosen values.

## Grid search

Luckily there are a few simple ways to pick hyperparameters that are better than the guess-and-check we employed above. One of those is grid search, which involves searching across the cartesian product of sets of values for each hyperparameter in question. We can observe this using our three values for ```C``` plus two other values in between, and picking 5 values for ```gamma``` that seem reasonable. We'll use ```scikit-learn```'s built in ```GridSearchCV``` and once again measure accuracy with stratified 10-fold cross validation.

Note that we wrap our support vector classifier in the ```GridSearchCV```. When we call ```fit``` it tries all the parameter combinations and picks the best one, which we then score ourselves.

In [109]:
grid_search_parameters = {
    'C': [0.0001, 0.01, 1, 100, 10000],
    'gamma': [0.01, 0.1, 1, 10, 100]
}

grid_search_svc = svm.SVC()
grid_search_classifier = GridSearchCV(grid_search_svc, grid_search_parameters, cv=10)
grid_search_classifier.fit(iris_data.data, iris_data.target)

grid_search_accuracy = cross_val_score(
    grid_search_classifier.best_estimator_,
    iris_data.data,
    iris_data.target,
    cv=10
)

print(grid_search_accuracy.mean())
print()
print(grid_search_classifier.best_params_)

0.9800000000000001

{'C': 100, 'gamma': 0.01}


The results show that we managed to match our accuracy from before. But to some degree we were guided by the default values from ```scikit-learn```. Grid search can be useful, but it requires some knowledge of what values might work well for each hyperparameter.

## Random Search

It might seem unintuitive, but randomly choosing hyperparameters is actually a relatively effective strategy. One reason why this is the case is that varying any single hyperparameter (like one would do for grid search) often doesn't change accuracy by very much, meaning that a lot of the time spent by grid search is wasted. As we saw above, we have a decent idea of what the hyperparameters should be for the data that we're using, so grid search could still be more effective here, but we can demonstrate that random search can also do a good job. Note that we still have to provide a distribution for each of the hyperparameters, but that this is just as simple if not simpler than picking values like we did for grid search.

In [127]:
random_search_distributions = {
    'C': scipy.stats.expon(scale=100),
    'gamma': scipy.stats.expon(scale=.1)
}

random_search_svc = svm.SVC()
random_search_classifier = RandomizedSearchCV(
    random_search_svc,
    random_search_distributions,
    cv=10,
    n_iter=25)
random_search_classifier.fit(iris_data.data, iris_data.target)

random_search_accuracy = cross_val_score(
    random_search_classifier.best_estimator_,
    iris_data.data,
    iris_data.target,
    cv=10
)

print(random_search_accuracy.mean())
print()
print(random_search_classifier.best_params_)

0.9866666666666667

{'C': 51.617731761813545, 'gamma': 0.008101361429248288}


As it turns out, random search does even better than grid search here. We also only examined 25 different sets of hyperparameters, which is the exact same number that we used for grid search.

Assuming you can pick appropriate distributions for each hyperparameter, random search can be very effective. However, if that isn't the case it might not work so well. For example, if we picked a random number for ```gamma``` between 0 and 10, it's likely our result wouldn't have been nearly as good. So we need to know both a range and a distribution within that range.

# LIPO

So far none of our algorithms for choosing hyperparameters have looked at the actual function we're trying to optimize, they've just been various ways to choose the hyperparameters themselves. They take all the function outputs (accuracies) and pick the inputs (hyperparameters) that maximize that output. But it turns out by looking at the inputs and outputs together we can make smarter decisions about what additional inputs to try. This is what LIPO does.

Central to the idea of LIPO are Lipschitz functions, or functions that for any pair of points on their graphs have a slope that is bounded in magnitude by a real number, known as the Lipschitz constant. It's easy to imagine that the function we're trying to optimize, accuracy as a function of the hyperparameters, is a Lipschitz function.

It's also simple to see that if we know the Lipschitz constant for a function we can take our set of known points and bound the function at any point. For example, in the image below the black points are known, and the green lines have slope equal to the Lipschitz constant:

![Lipschitz constant](https://1.bp.blogspot.com/-J6B_BQWCR8o/WkFzO5qfFyI/AAAAAAAAA18/slBcjvnaupoNUlueG-I_V9BVxAwWfGDQwCEwYBhgL/s1600/g4175.png)



Keeping the above image in mind, the core of LIPO is relatively intuituve: at every step we evaluate the function at the point with the highest maximum bound and then adjust our bounds based on the new point. But how do we know what value to use as the Lipschitz constant? In the paper presenting LIPO, the authors provide a simple method that they show works well: simply taking the largest slope seen so far as the Lipschitz constant.

It turns out that there are some other minor issues with LIPO. One is that it's possible for the observed Lipschitz constant to be infinity. The solution to this is fairly complicated, but involves adding noise at points where it's likely an infinite slope would exist. Another problem is that LIPO is not particularly good at converging once the area of the maximum has been identified. Since this is an issue with the core of the algorithm, ```dlib``` implements a version that switches to another method once LIPO identifies the area of the maximum.

In the implementation below we can see one of the major benefits of using LIPO: all we have to provide are bounds for each hyperparameter.

In [133]:
def f(C, gamma):
    LIPO_classifier = svm.SVC(C=C, gamma=gamma)
    LIPO_accuracy = cross_val_score(LIPO_classifier, iris_data.data, iris_data.target, cv=10)
    return LIPO_accuracy.mean()

[C, gamma], LIPO_accuracy = dlib.find_max_global(f, [.001, .001], [1000, 1000], 25)

print(LIPO_accuracy)
print()
print(C, gamma)

0.9933333333333334

16.52036505133443 0.011262444232475159


As it turns out, LIPO does just as well on this particular problem as random search. Looking specifically at the hyperparameters that it finds are optimal, the values are very similar to the ones found by random search.

# Benchmarking

The test that we did so far was relatively simple, as evidenced by the fact that all three of our hyperparameter tuning methods managed to achieve very high (98%+) accuracy with just a few iterations. In particular, the optimal values for the hyperparameters were close to the ```scikit-learn``` defaults, meaning that grid search worked better than it otherwise might.

In addition to the Iris dataset, ```scikit-learn``` includes a dataset of hand-written digits. We can use some code similar to what we wrote above to benchmark the different hyperparameter tuning methods against each other. We begin by loading and examining the data. Each image is 8x8 so we have 64 features for each digit, and we can see that there are roughly equal numbers of each digit.

In [112]:
digit_data = datasets.load_digits()
print(digit_data.data[:5])
print()
print(digit_data.target[:5])
print()
print(np.asarray(np.unique(digit_data.target, return_counts=True)))

[[ 0.  0.  5. 13.  9.  1.  0.  0.  0.  0. 13. 15. 10. 15.  5.  0.  0.  3.
  15.  2.  0. 11.  8.  0.  0.  4. 12.  0.  0.  8.  8.  0.  0.  5.  8.  0.
   0.  9.  8.  0.  0.  4. 11.  0.  1. 12.  7.  0.  0.  2. 14.  5. 10. 12.
   0.  0.  0.  0.  6. 13. 10.  0.  0.  0.]
 [ 0.  0.  0. 12. 13.  5.  0.  0.  0.  0.  0. 11. 16.  9.  0.  0.  0.  0.
   3. 15. 16.  6.  0.  0.  0.  7. 15. 16. 16.  2.  0.  0.  0.  0.  1. 16.
  16.  3.  0.  0.  0.  0.  1. 16. 16.  6.  0.  0.  0.  0.  1. 16. 16.  6.
   0.  0.  0.  0.  0. 11. 16. 10.  0.  0.]
 [ 0.  0.  0.  4. 15. 12.  0.  0.  0.  0.  3. 16. 15. 14.  0.  0.  0.  0.
   8. 13.  8. 16.  0.  0.  0.  0.  1.  6. 15. 11.  0.  0.  0.  1.  8. 13.
  15.  1.  0.  0.  0.  9. 16. 16.  5.  0.  0.  0.  0.  3. 13. 16. 16. 11.
   5.  0.  0.  0.  0.  3. 11. 16.  9.  0.]
 [ 0.  0.  7. 15. 13.  1.  0.  0.  0.  8. 13.  6. 15.  4.  0.  0.  0.  2.
   1. 13. 13.  0.  0.  0.  0.  0.  2. 15. 11.  1.  0.  0.  0.  0.  0.  1.
  12. 12.  1.  0.  0.  0.  0.  0.  1. 10.  8.  0.  0.  0.

In [128]:
grid_search_parameters = {
    'C': [0.0001, 0.01, 1, 100, 10000],
    'gamma': [0.01, 0.1, 1, 10, 100]
}

grid_search_svc = svm.SVC()
grid_search_classifier = GridSearchCV(grid_search_svc, grid_search_parameters, cv=10)
grid_search_classifier.fit(digit_data.data, digit_data.target)

grid_search_accuracy = cross_val_score(
    grid_search_classifier.best_estimator_,
    digit_data.data,
    digit_data.target,
    cv=10
)

print(grid_search_accuracy.mean())
print()
print(grid_search_classifier.best_params_)

0.751209710760296

{'C': 100, 'gamma': 0.01}


In [129]:
random_search_distributions = {
    'C': scipy.stats.expon(scale=100),
    'gamma': scipy.stats.expon(scale=.1)
}

random_search_svc = svm.SVC()
random_search_classifier = RandomizedSearchCV(
    random_search_svc,
    random_search_distributions,
    cv=10,
    n_iter=25)
random_search_classifier.fit(digit_data.data, digit_data.target)

random_search_accuracy = cross_val_score(
    random_search_classifier.best_estimator_,
    digit_data.data,
    digit_data.target,
    cv=10
)

print(random_search_accuracy.mean())
print()
print(random_search_classifier.best_params_)

0.9816944170412627

{'C': 37.34961130566053, 'gamma': 0.001381501094060212}


In [131]:
def f(C, gamma):
    LIPO_classifier = svm.SVC(C=C, gamma=gamma)
    LIPO_accuracy = cross_val_score(LIPO_classifier, digit_data.data, digit_data.target, cv=10)
    return LIPO_accuracy.mean()

[C, gamma], LIPO_accuracy = dlib.find_max_global(f, [.001, .001], [1000, 1000], 25)

print(LIPO_accuracy)
print()
print(C, gamma)

0.9822593887926752

999.9999999999998 0.0015964154273407744


These results are very much in line with what we'd expect. Grid search does fairly poorly, while random search and LIPO both do very well.

# Conclusion

LIPO certainly isn't perfect; despite the excellent results obtained in the very limited tests we ran above, there are still flaws. That being said, it does work well in many situations (and the authors of the original LIPO paper proved as much). Perhaps even more importantly it's easy to use. The lack of hyperparameters makes it an excellent first (and perhaps last, depending on the quality of the results) option when looking to optimize hyperparameters.