Implement Gaussian Process based hyper-parameter optimizer #474

Open
GaelVaroquaux opened this Issue Dec 18, 2011 · 30 comments

9 participants

@GaelVaroquaux
scikit-learn member

Following Bergstra NIPS 2011 (@jaberg for the intimates) it would be great to have a GridSearch-like object implementing Gaussian-Process based hyper-parameter optimization.

To keep things simple, we would optimize only on continuous parameter, i.e. in a hyper-cube. The API would specify an initial bounding box, and also whether each parameter should be varied in the log-domain or in the linear-domain.

Some notes from a discussion with James Bergstra:

  • The algorithm goes mostly as
    
    
  • Compute the score on a set of random points in the specified hyper-cube
  • For i in budget:
    2.a fit Gaussian Process to the pairs (parameters, scores)
    2.b find the parameters that optimizes the expectation of getting a lower score then the best currently available. For this, optimize it using a black-box scipy optimizer, such as the simulated annealing).

  • A lot of the gain comes from sequentially choosing the points, so, as we don't have a queue mechanism, we should do this sequentially for now. Parallel can be done in internal cross-validations, or to compute initial points.

Remark: to computing the expectancy of (X, y) with Gaussian process, we should overide score to use something more clever than the standard LinearRegression score.

@elyase

this would be great to have.

@kastnerkyle
scikit-learn member

Looking at this and other associated papers there are 3 primary algorithm "families" - GP based (Snoek et. al.), TPE based (Bergstra et. al.), and Random Forest based (SMAC, Leyton-Brown et. al.). There are a few examples of these using simple algorithms on Boston housing and MNIST which seem like good fodder for a gist.

What I see as a possible way forward would be to first get a minimum working example, most likely of GP but perhaps TPE (I understand these 2 best). Then, implement some kind of SequentialSearchCV (name to be decided) that can house these hyperparameter optimizers, while hiding as much of the implementation complexity as possible. Ultimately these authors showed that each of the three algorithms can have better performance than the others in certain situations. Frankly, I am not sure how to handle that, except have options to choose each one, while keeping the sanest default (or even choosing the hyperparameter optimizer based on the underlying algorithm).

One other interesting thing that these allow you to do is tune for time - many of these papers use expected improvement per second as an optimization metric and find very fast, very performant settings.

cc @ogrisel

@rmcgibbo

I think that Jasper Snoek and co. are working internally on a new release of Spearmint, so it might be worth it for whoever starts working on this to coordinate with them to avoid duplicating effort. Their paper at ICML this year on input warping for the GP (http://arxiv.org/abs/1402.0929) seems like a pretty nice practical advance here. This depends on doing full Bayesian GPs, with MCMC over the GP's scale parameters -- they argue that this does much better than type-II likelihood maximization. (The GP module here only does the later).

The entropy search acquisition function is quite tricky to compute, but otherwise, once you've got the GP implementation, the rest is not very complex (e.g. expected improvement or ei/time).

From an API perspective, making it possible for the "SequentialSearchCV" object to persist itself to disk (e.g. joblib.Memory caching) would be very useful, since these calculations take a long enough time that you want to be able to restart them if they die.

@sds-dubois

I'd be happy to implement a GP-based hyperparameter optimizer. I recently worked on a similar project so I should be fine.
I propose to fix a budget (a number of iterations) as for the random-based search and for each step :

  • model y = f(X) through a GP
  • sample randomly candidates within the hyperparameter space
  • compute the value of the acquisition function for each candidate, thanks to the GP model (where the acquisition function could be the expected improvement and the upper confidence bound)
  • select the candidate that maximizes the acquistion function as the next point to test

Sampling candidates in the parameter space make it easier to handle discrete/continuous parameters at the same time.

@amueller
scikit-learn member

I am not entirely sure if the GP based method is better than SMAC, and implementing both is probably a hassle.
It would be nice to first find a use-case in scikit-learn. GP based methods don't really handle tree-based structures, so I'm not sure what a good application would be. Do you have any in mind?

@GaelVaroquaux
scikit-learn member
@amueller
scikit-learn member

It's all open source and I'm pretty sure they published the code to reproduce their experiments.

@kastnerkyle
scikit-learn member
@GaelVaroquaux
scikit-learn member
@kastnerkyle
scikit-learn member
@kastnerkyle
scikit-learn member
@amueller
scikit-learn member

auto-sklearn is SMAC.

I think we really need to decide what our application is. Something like auto-sklearn doesn't really work with GPs, as the parameter space is tree-shaped (contains a lot of contitionals). GPs can't handle that afaik.

Which algorithms are there in sklearn that need bo?

@sds-dubois

It's true that SMAC handles tree-based structures but a GP-based method is definitely easier to implement (at least for me) and I think in most cases we don't really need the tree structure. And as @kastnerkyle said, they are complementary so why not implementing both ?

Here is an example of application :
deepmining_workflow
This presents a (simple, by purpose) pipeline to deal with the handwritten digit problem and some hyperparameters that can be tuned. It can all be handled by GP-based optimization.
In that case, GP-based search is much faster than the randomized one, but there isn't any smart search currently implemented in sklearn..

Categorical choices can be seen as integer hyperparameters. And even if in some cases a tree structure could be useful, I don't think it's a real problem. In the example above, if we choose not to perform the blurring, then the kernel's std does not matter. Still taking it into account when the blurring is not done make the process a bit slower but that's all.

@amueller
scikit-learn member

Or are we catering to people that do deep learning with a sklearn-like interface?

@GaelVaroquaux
scikit-learn member
@amueller
scikit-learn member

alright. It's not like I'm against it ;) please go ahead.
You should think about whether you want to use the current GP (probably not) or the rewrite #4270 (probably yes). I'm not entirely sure about the status. I think it was basically ready to go. Talk to @jmetzen. Also ping @cangermueller who might want to help you :)

@jmetzen
scikit-learn member

Regarding the GP: the GP regression part of the PR #4270 should be pretty mature. The GP classification part does not support multi-class classification yet (only binary), that's why the PR is still in the WIP status. But since only regression matters in Bayesian Optimization (BO), it should be alright to use the GPs from the PR.

Regarding the GP-based BO: I have already some private code that performs BO based on #4270 (albeit not for hyperparameter-tuning via an sklearn-metaestimator). I would definitively be interested in supporting you in integrating BO in sklearn. Do you have already a base-optimizer (for finding the maximum of the acquisition function) in mind? Commonly used are DIRECT (implementation in nlopt) and CMA-ES, but there are no implementations in scipy unfortunately, and further dependencies are probably a no-go. Thus, we have to live with the scipy-optimizers or reimplement DIRECT.

@jmetzen
scikit-learn member

Ah, I see, you want to randomly sample candidates in the parameter space rather than optimizing the acquisition function directly. That's ok for the start, I guess.

@sds-dubois

Yes, I guess it's the most straightforward way to deal with categorical parameters / integers, and that will still provide a great improvement from the randomized search. But indeed directly optimizing the acquisition function can be added later.
Is it fine for you if I write the code with sklearn's current GP and then you update the 3/4 lines where the GP API is called ?

@amueller
scikit-learn member

Ah, I also didn't catch that you wanted to sample the space. That is probably fine for the start. Jasper Snoek mentioned he just seeds a grid and then does local optimization (gradient based?) from there.

@amueller
scikit-learn member

If you can make it work with the current GP, sure. I'm not sure how easy it will be to specify the kernels you need.

@kastnerkyle
scikit-learn member
@kegl

"Balasz Kegl recently told me (private communication) that SMAC didn't
work that well, and he wasn't convinced that any of the published
empirical evaluations were to be trusted."

Gael, I don't recall being so blunt, so let me elaborate. SMAC is certainly one of the best Bayesian optimization algorithms out there. What I might have meant is that in general there is no slam dunk experimental result over random search. My feeling is that all these methods are basically running a mixture of random sampling and "repulsion", but it's more an intuition than a formal result. Generally, the difference between SMAC and GP-based hyperopt is the same as the difference between RFs and GPs. RFs scale well both with the dimension and the number of points (and they can handle conditional parameters), whereas GPs are pretty good in small-dimensional spaces. RFs in my experience don't work well in small dimensional spaces as Bayesian regressors, and there is certainly no theoretical justification (afaik) for using the trees as samples from any distribution.

In general. nobody has done impartial large scale survey experiments on comparing the hyperparameter optimizers for figuring out these issues. And I wouldn't cross my fingers, these experiments take a lot of CPU power, it's not clear how to do them correctly, and there is no research payoff if you don't propose a new method for writing such a paper. Note also that these methods are far from being black box. One has to specify an initial interval or a prior distribution for each hyperparameter, and it's not straightforward how to specify these. The surrogate regressor also has to be tuned and we bite our tail there.

As an advice, I would not implement anything inside sklearn at this point, rather let everybody propose hyperopt-like (SMAC-like and optunity-like) add-ons that people can try. It's an active research area, and I'm sure we'll accumulate more knowledge in the next years.

@GaelVaroquaux
scikit-learn member
@amueller
scikit-learn member

Thank you for your input @kegl. I think we are mostly in agreement. If there are useful applications inside scikit-learn, I would consider adding something to scikit-learn, but it is clear that this is still a field very much in flux and I would not like to move too quickly here.

@sds-dubois

Hi all,

So I've worked a bit on this, please check it on my copy of sklearn on the gp_hyperopt branch.
I've made two examples based on the Iris dataset and the 20 newsgroups dataset which were already used as examples for grid / randomized search.

Here are the results I got (blue line is GP-based search and green line is randomized search; the first 20 iterations are random) :

  • Iris dataset iris_results
  • 20 newsgroups dataset text_results

Let me know if you think there are indeed useful applications (and so if it's worth continuing) or if you have any suggestion,

PS: sorry my code is not documented yet..

@jmetzen
scikit-learn member

The curves looks promising for initial results. It would be interesting to illustrate also the performance on a holdout test set. For the x-axis, as an alternative to the number of iterations, the computation time would be a more realistic metric.

Could you create a PR such that we can directly comment code and see if unittests are ok? It would also be nice if you could check pep8 conformity; this would make it easier to read your code.

@amueller
scikit-learn member

It would also be nice to compare against grid-search. (and yes, please do a PR)

@sds-dubois

Sorry for the delay..
I created the PR #5185

@FlorianWilhelm

Here is an article from Quantopian about Bayesian Optimization with some references to papers that could be of interest for this issue. I wonder which approach SigOpt is using.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment