Following Bergstra NIPS 2011 (@jaberg for the intimates) it would be great to have a GridSearch-like object implementing Gaussian-Process based hyper-parameter optimization.
To keep things simple, we would optimize only on continuous parameter, i.e. in a hyper-cube. The API would specify an initial bounding box, and also whether each parameter should be varied in the log-domain or in the linear-domain.
Some notes from a discussion with James Bergstra:
For i in budget:
2.a fit Gaussian Process to the pairs (parameters, scores)
2.b find the parameters that optimizes the expectation of getting a lower score then the best currently available. For this, optimize it using a black-box scipy optimizer, such as the simulated annealing).
A lot of the gain comes from sequentially choosing the points, so, as we don't have a queue mechanism, we should do this sequentially for now. Parallel can be done in internal cross-validations, or to compute initial points.
Remark: to computing the expectancy of (X, y) with Gaussian process, we should overide score to use something more clever than the standard LinearRegression score.
this would be great to have.
Looking at this and other associated papers there are 3 primary algorithm "families" - GP based (Snoek et. al.), TPE based (Bergstra et. al.), and Random Forest based (SMAC, Leyton-Brown et. al.). There are a few examples of these using simple algorithms on Boston housing and MNIST which seem like good fodder for a gist.
What I see as a possible way forward would be to first get a minimum working example, most likely of GP but perhaps TPE (I understand these 2 best). Then, implement some kind of SequentialSearchCV (name to be decided) that can house these hyperparameter optimizers, while hiding as much of the implementation complexity as possible. Ultimately these authors showed that each of the three algorithms can have better performance than the others in certain situations. Frankly, I am not sure how to handle that, except have options to choose each one, while keeping the sanest default (or even choosing the hyperparameter optimizer based on the underlying algorithm).
One other interesting thing that these allow you to do is tune for time - many of these papers use expected improvement per second as an optimization metric and find very fast, very performant settings.
I think that Jasper Snoek and co. are working internally on a new release of Spearmint, so it might be worth it for whoever starts working on this to coordinate with them to avoid duplicating effort. Their paper at ICML this year on input warping for the GP (http://arxiv.org/abs/1402.0929) seems like a pretty nice practical advance here. This depends on doing full Bayesian GPs, with MCMC over the GP's scale parameters -- they argue that this does much better than type-II likelihood maximization. (The GP module here only does the later).
The entropy search acquisition function is quite tricky to compute, but otherwise, once you've got the GP implementation, the rest is not very complex (e.g. expected improvement or ei/time).
From an API perspective, making it possible for the "SequentialSearchCV" object to persist itself to disk (e.g. joblib.Memory caching) would be very useful, since these calculations take a long enough time that you want to be able to restart them if they die.
I'd be happy to implement a GP-based hyperparameter optimizer. I recently worked on a similar project so I should be fine.
I propose to fix a budget (a number of iterations) as for the random-based search and for each step :
Sampling candidates in the parameter space make it easier to handle discrete/continuous parameters at the same time.
I am not entirely sure if the GP based method is better than SMAC, and implementing both is probably a hassle.
It would be nice to first find a use-case in scikit-learn. GP based methods don't really handle tree-based structures, so I'm not sure what a good application would be. Do you have any in mind?
It's all open source and I'm pretty sure they published the code to reproduce their experiments.
auto-sklearn is SMAC.
I think we really need to decide what our application is. Something like auto-sklearn doesn't really work with GPs, as the parameter space is tree-shaped (contains a lot of contitionals). GPs can't handle that afaik.
Which algorithms are there in sklearn that need bo?
It's true that SMAC handles tree-based structures but a GP-based method is definitely easier to implement (at least for me) and I think in most cases we don't really need the tree structure. And as @kastnerkyle said, they are complementary so why not implementing both ?
Here is an example of application :
This presents a (simple, by purpose) pipeline to deal with the handwritten digit problem and some hyperparameters that can be tuned. It can all be handled by GP-based optimization.
In that case, GP-based search is much faster than the randomized one, but there isn't any smart search currently implemented in sklearn..
Categorical choices can be seen as integer hyperparameters. And even if in some cases a tree structure could be useful, I don't think it's a real problem. In the example above, if we choose not to perform the blurring, then the kernel's std does not matter. Still taking it into account when the blurring is not done make the process a bit slower but that's all.
Or are we catering to people that do deep learning with a sklearn-like interface?
alright. It's not like I'm against it ;) please go ahead.
You should think about whether you want to use the current GP (probably not) or the rewrite #4270 (probably yes). I'm not entirely sure about the status. I think it was basically ready to go. Talk to @jmetzen. Also ping @cangermueller who might want to help you :)
Regarding the GP: the GP regression part of the PR #4270 should be pretty mature. The GP classification part does not support multi-class classification yet (only binary), that's why the PR is still in the WIP status. But since only regression matters in Bayesian Optimization (BO), it should be alright to use the GPs from the PR.
Regarding the GP-based BO: I have already some private code that performs BO based on #4270 (albeit not for hyperparameter-tuning via an sklearn-metaestimator). I would definitively be interested in supporting you in integrating BO in sklearn. Do you have already a base-optimizer (for finding the maximum of the acquisition function) in mind? Commonly used are DIRECT (implementation in nlopt) and CMA-ES, but there are no implementations in scipy unfortunately, and further dependencies are probably a no-go. Thus, we have to live with the scipy-optimizers or reimplement DIRECT.
Ah, I see, you want to randomly sample candidates in the parameter space rather than optimizing the acquisition function directly. That's ok for the start, I guess.
Yes, I guess it's the most straightforward way to deal with categorical parameters / integers, and that will still provide a great improvement from the randomized search. But indeed directly optimizing the acquisition function can be added later.
Is it fine for you if I write the code with sklearn's current GP and then you update the 3/4 lines where the GP API is called ?
Ah, I also didn't catch that you wanted to sample the space. That is probably fine for the start. Jasper Snoek mentioned he just seeds a grid and then does local optimization (gradient based?) from there.
If you can make it work with the current GP, sure. I'm not sure how easy it will be to specify the kernels you need.
"Balasz Kegl recently told me (private communication) that SMAC didn't
work that well, and he wasn't convinced that any of the published
empirical evaluations were to be trusted."
Gael, I don't recall being so blunt, so let me elaborate. SMAC is certainly one of the best Bayesian optimization algorithms out there. What I might have meant is that in general there is no slam dunk experimental result over random search. My feeling is that all these methods are basically running a mixture of random sampling and "repulsion", but it's more an intuition than a formal result. Generally, the difference between SMAC and GP-based hyperopt is the same as the difference between RFs and GPs. RFs scale well both with the dimension and the number of points (and they can handle conditional parameters), whereas GPs are pretty good in small-dimensional spaces. RFs in my experience don't work well in small dimensional spaces as Bayesian regressors, and there is certainly no theoretical justification (afaik) for using the trees as samples from any distribution.
In general. nobody has done impartial large scale survey experiments on comparing the hyperparameter optimizers for figuring out these issues. And I wouldn't cross my fingers, these experiments take a lot of CPU power, it's not clear how to do them correctly, and there is no research payoff if you don't propose a new method for writing such a paper. Note also that these methods are far from being black box. One has to specify an initial interval or a prior distribution for each hyperparameter, and it's not straightforward how to specify these. The surrogate regressor also has to be tuned and we bite our tail there.
As an advice, I would not implement anything inside sklearn at this point, rather let everybody propose hyperopt-like (SMAC-like and optunity-like) add-ons that people can try. It's an active research area, and I'm sure we'll accumulate more knowledge in the next years.
Thank you for your input @kegl. I think we are mostly in agreement. If there are useful applications inside scikit-learn, I would consider adding something to scikit-learn, but it is clear that this is still a field very much in flux and I would not like to move too quickly here.
So I've worked a bit on this, please check it on my copy of sklearn on the gp_hyperopt branch.
I've made two examples based on the Iris dataset and the 20 newsgroups dataset which were already used as examples for grid / randomized search.
Here are the results I got (blue line is GP-based search and green line is randomized search; the first 20 iterations are random) :
Let me know if you think there are indeed useful applications (and so if it's worth continuing) or if you have any suggestion,
PS: sorry my code is not documented yet..
The curves looks promising for initial results. It would be interesting to illustrate also the performance on a holdout test set. For the x-axis, as an alternative to the number of iterations, the computation time would be a more realistic metric.
Could you create a PR such that we can directly comment code and see if unittests are ok? It would also be nice if you could check pep8 conformity; this would make it easier to read your code.
It would also be nice to compare against grid-search. (and yes, please do a PR)
Sorry for the delay..
I created the PR #5185
Here is an article from Quantopian about Bayesian Optimization with some references to papers that could be of interest for this issue. I wonder which approach SigOpt is using.