GBRT based minimizer #34

betatim · 2016-03-29T20:47:45Z

Fixes #24.

This is a minimal implementation of a tree based minimiser. Right now it only supports EI, and can't deal with categorical variables.

The # XXX comments are where I am not sure if it is worth doing something now (for this PR) or leaving it for later.

It looks like _expected_improvement should be refactored so it can be shared with the GP model. After a brief discussion with @glouppe I think it is worth keeping it like this for the moment. In the longer term we might be able to do something smarter/more appropriate that is specialised to the tree based minimizer.

_random_point could also be shared, and in the future will need to get smarter for tree based models to deal with categorical variables.

Todo:

MechCoder · 2016-04-01T17:43:12Z

skopt/trees.py

+        ``bounds[i][0]`` should give the lower bound of each parameter and
+        ``bounds[i][1]`` should give the upper bound of each parameter.
+
+    base_estimator: a GradientBoostingQuantileRegressor


Is gradient boosting strictly a tree based approach?
We just fit the negative gradient to the data in every iteration right.
I would mention gradient boosting somewhere in the name of the function.

I will add it to L45. For the name, how about gbt_minimize?

this is addressed in 757b0b9

MechCoder · 2016-04-01T17:51:27Z

Are gradient boosting methods popular in bayesian optimization? As I asked before, we could maybe write a paper if we find out that they are useful in some scenarios as compared to GP or RF based approaches.

cc: @glouppe

glouppe · 2016-04-01T18:15:29Z

Are gradient boosting methods popular in bayesian optimization? As I asked before, we could maybe write a paper if we find out that they are useful in some scenarios as compared to GP or RF based approaches

I have actually never seen them used, but I dont see why they would not work. In general, I believe quite a few things can be explored based on tree-based methods (of any kind). SMAC is the only example I know and it is a very successful proof-of-concept, despite the rather simple usage of RF. I expect that it can be improved in many interesting ways :)

betatim · 2016-04-01T19:04:44Z

No idea. To be honest I picked gradient boosting as in scikit-learn it supports quantile regression out of the box...and I like gradient boosting. No deeper reason.

To me it seems like it shouldn't matter what you use (trees, GPs, ...) all you need is a regression model to act as the surrogate function and how (un)certain you are about its value. If your method can do this, you are cooking with gas 🔥.

With non-parametric models like trees you probably lose the "bayesian" part, at least it becomes harder/impossible to specify a prior.

The advantage is that you don't care so much about high dimensions, categorical variables, weird ranges, etc.

MechCoder · 2016-04-01T19:51:47Z

I can comment in a less-blunt manner after the weekend, but it would be great to have a set of benchmarks and see what improvements can be done to the existing GP methods and if the new GBRT methods are better in different scenarios.

I hope to help to contribute to that in the coming week.

MechCoder · 2016-04-01T19:52:07Z

Basically, https://github.com/MechCoder/scikit-optimize/issues/18

betatim · 2016-04-05T12:52:48Z

Would like to get this to the point where it passes all of the tests that the GP passes.

betatim · 2016-04-05T13:00:58Z

Related to #41: I kept finding myself having to do .T which was a smell for "something wrong here", now it stacks horizontally so a,b,c = r.predict(X) works.

glouppe · 2016-04-05T13:36:06Z

Could you take care of #40 as well? :)

betatim · 2016-04-05T13:38:26Z

Added to the todo.

glouppe · 2016-04-05T13:41:20Z

skopt/trees.py

@@ -0,0 +1,139 @@
+"""Gradient boosted trees based minimization algorithms."""


For consistency, can we rename that file gbrt_opt.py?

betatim · 2016-04-05T13:41:28Z

We should also discuss which parts are worth factoring out into a shared place for GP, trees, future-methods. Seems silly to have EI and friends duplicated as long as they don't do clever things specialised for the method. Yet right now they would need probably as much code to adapt minor differences as each function is in size. So maybe a starting point would be a common set of tests that ensures that various acquisition functions compute the same thing.

glouppe · 2016-04-05T13:41:50Z

skopt/trees.py

+    return lower + rng.rand(num_params) * delta
+
+
+def gbt_minimize(func, bounds, base_estimator=None, maxiter=100,


gbt -> gbrt (this is a more common acronym in ML)

also, as part of #38, I added an n_start argument. Surely, this should be helpful here as well.

betatim · 2016-04-06T20:51:49Z

Notes:

# do not want EI below zero, happens for points that are (much)
# worse than ``best_y``
ei = np.clip(ei, 0., np.max(ei))

adding this doesn't actually help/do anything. This confuses me.

All benchmarks pass (with same settings as for GP), except for hart6.

Switched to using random sampling to find the minimum of the acquisition function for the moment. Can't think of a smarter way to do it right now.

Updated the todo list.

MechCoder · 2016-04-07T19:51:14Z

skopt/gbrt_opt.py

+
+    # Default estimator
+    if base_estimator is None:
+        base_estimator = GradientBoostingQuantileRegressor()


Should we keep a different default? The default max_depth should be less right since we need weak learners

Ooh, I just realized that the default max_depth is 3. Scratch this comment

MechCoder · 2016-04-07T20:19:39Z

OK, so I ran a simple example to see if we are able to approximate the original function. This is what I get for the default values.

from skopt.gbt import GradientBoostingQuantileRegressor
from sklearn.gaussian_process import GaussianProcessRegressor

import numpy as np
import matplotlib.pyplot as plt

X = np.array([0, 1, 2, 3, 4, 5, 6])
y = np.sin(X)
X = np.reshape(X, (-1, 1))
X_new = np.linspace(0, 6, 200)
X_new = np.reshape(X_new, (-1, 1))


gbqr = GradientBoostingQuantileRegressor()
# gpr = GaussianProcessRegressor()
gbqr.fit(X, y)
l, mean, h = gbqr.predict(X_new)
std = (h - l) / 2.0
plt.plot(X_new, mean)
plt.plot(X_new.ravel(), np.sin(X_new.ravel()))
plt.fill_between(X_new.ravel(), mean - std, mean + std, color='darkorange',
                 alpha=0.2)
plt.show()

MechCoder · 2016-04-07T20:33:04Z

The graph does not change much by varying max_depth and n_estimators. It should be interesting to see what acquisition function is used by random forests and we can get inspired by that.

MechCoder · 2016-04-13T14:08:25Z

I'll be able to have a closer look during the weekend hopefully ....

betatim · 2016-04-14T15:44:58Z

Reading your email just now I realise that you were worried about this plot. I misunderstood you because I thought "oh ja, nice plot" and moved on.

Which part of it makes you think there is something up?

MechCoder · 2016-04-14T22:52:25Z

I was just afraid that the mean and std are piecewise constant over quite large intervals, which means the acquisition_function is also constant over large intervals right? (Which also explains why lbfgs does not work as we would like it to and that you had to resort to random sampling)

MechCoder · 2016-04-14T22:54:20Z

Something like this scikit-learn/scikit-learn#6166 (comment) would be better

betatim · 2016-04-15T07:45:45Z

The piecewise constant mu and std stops us from using bfgs etc but I am not sure what else the tree could really predict. If you asked me (the TimRegressor) to make a prediction given only two or three points I would probably make a constant prediction as well as there isn't anything else to go on.

scikit-learn/scikit-learn#6166 (comment) still has small areas of constant prediction. I was wondering how "big" a constant prediction is "big" in terms of bfgs. Say we make a constant prediction over 0.1, 1 or 10% of the range of a variable, at some point the prediction will start appearing "continous" to bfgs even though in reality it is still piecewise constant. Wondering what that length scale is. Or if we will always be stuck because of this.

Do you know what arXiv:1211.0906 does about this problem?

glouppe · 2016-04-20T05:30:33Z

skopt/gbt.py

@@ -50,4 +50,4 @@ def fit(self, X, y):

    def predict(self, X):
        """Predictions for each quantile."""
-        return np.vstack([rgr.predict(X) for rgr in self.regressors_])
+        return np.asarray([rgr.predict(X) for rgr in self.regressors_]).T


Can you rename this file to gbrt.py?

glouppe · 2016-04-20T06:00:18Z

skopt/gbrt_opt.py

+    return ei
+
+
+def _random_point(lower, upper, n_points=1, random_state=None):


Is this function really necessary? We inlined it in gp_minimize by doing X = lb + (ub - lb) * rng.rand(n_points, n_params) instead.

Needed it at several points and was planning ahead for when we have categorical variables as well so I made a function out of it.

Should be _random_points in any case

betatim · 2016-04-20T07:02:22Z

Renamed gbt.py -> gbrt.py but then remembered we wanted to move the contents to skopt.learning. Should it be skopt.learning.gbrt.GradientBoost...? I would prefer skopt.learning.GradientBoost.... When that is done, all things from the to do list should be ticked off.

glouppe · 2016-04-20T07:04:12Z

Should it be skopt.learning.gbrt.GradientBoost...? I would prefer skopt.learning.GradientBoost....

Create learning/gbrt.py and then expose whatever is needed in learning/__init__.py, so that skopt.learning.GradientBoost... works.

betatim · 2016-04-20T12:04:08Z

Travis is happy, Tim is happy, @glouppe happy too?

MechCoder · 2016-04-20T13:14:16Z

skopt/gbrt_opt.py

+        X = np.expand_dims(X, axis=0)
+
+    # low and high are assumed to be the 16% and 84% quantiles
+    low, mu, high = surrogate.predict(X).T


What do you think about adding a return_std to Quantile.predict option that returns just the mean and std? In that way we can easily refactor this into the existing implementation and add "LCB" support.

Yes, because reimplementing the acquisition functions is silly. No, because we could do something smarter when you can estimate the quantile directly instead of having to make an assumption about things being gaussian.

Oh, we can keep the existing implementation by setting return_std to False by default, if that worries you

Just to clarify, I'm suggesting to keep both implementation, one that returns the quantiles by default and the other that returns the mean and std when return_std is set to True.

If you want I can do that refactoring separately after this has been merged.

Aha! I hadn't even thought of such a behaviour. It is a good idea. We'd have to check the user asked for the right quantiles to be estimated when they use return_std=True.

I was preoccupied with the thought that for LCB you'd only need to estimate one quantile instead of all three.

New PR?

MechCoder · 2016-04-20T13:39:57Z

skopt/gbrt_opt.py

+        lower_bounds, upper_bounds, n_points=n_start, random_state=rng)
+    best_x = Xi[:n_start].ravel()
+    yi[:n_start] = [func(xi) for xi in Xi[:n_start]]
+    best_y = np.min(yi[:n_start])


best_x should Xi[np.argmin(yi[:n_start])] right?

MechCoder · 2016-04-20T14:29:35Z

We can merge after Travis passes. But here are some thoughts. For things like lbfgs, we would need smooth functions. Or else in this case, lbfgs would degrade to random sampling since the gradient is zero.

Here is what they do in the SMAC paper and claim it gives smooth decision surfaces, but I don't grasp the intuition quickly. Let us say a split point is x(j, k), any point in between the next point x(j, k) and x(j, k+1) also is a split point, so they sample this split point uniformly between x(j, k) and x(j, k+1)

This has been described in page 17 of the paper (http://arxiv.org/pdf/1211.0906v2.pdf) just in case my interpretation is wrong.

MechCoder · 2016-04-20T14:32:30Z

Merging for now! Thanks :-)

betatim · 2016-04-20T14:43:56Z

Thanks for all the comments and help!

glouppe · 2016-04-20T14:46:02Z

🍻

Upstream changes

MechCoder reviewed Apr 1, 2016
View reviewed changes

betatim force-pushed the tree-minimise branch 2 times, most recently from 2e9512c to 757b0b9 Compare April 5, 2016 12:47

betatim mentioned this pull request Apr 5, 2016

Methods to optimize the acquisition function #35

Closed

glouppe reviewed Apr 5, 2016
View reviewed changes

MechCoder reviewed Apr 7, 2016
View reviewed changes

betatim added 2 commits April 19, 2016 21:08

First version of tree based minimizer

acbdfd5

GBT -> GBRT

89a9cbf

betatim force-pushed the tree-minimise branch from 80e75ac to 23afb4d Compare April 19, 2016 19:08

glouppe reviewed Apr 20, 2016
View reviewed changes

Rename operation GBT -> GBRT everywhere

f44f8b7

glouppe reviewed Apr 20, 2016
View reviewed changes

Make number of random samples configurable

98c3d1f

betatim force-pushed the tree-minimise branch from 477c386 to 98c3d1f Compare April 20, 2016 06:58

MechCoder reviewed Apr 20, 2016
View reviewed changes

betatim force-pushed the tree-minimise branch from 017323e to 1bf9368 Compare April 20, 2016 13:37

MechCoder reviewed Apr 20, 2016
View reviewed changes

New skopt.learning module

079c531

betatim force-pushed the tree-minimise branch from 1bf9368 to 079c531 Compare April 20, 2016 14:13

MechCoder merged commit a6570e4 into scikit-optimize:master Apr 20, 2016

betatim deleted the tree-minimise branch April 20, 2016 14:43

This was referenced Apr 20, 2016

GradientBoostingQuantileRegressor.predict API #41

Closed

Move GBRT in a learning submodule #40

Closed

betatim mentioned this pull request Apr 20, 2016

Unify GBRT and GP interfaces #60

Closed

holgern added a commit that referenced this pull request Feb 12, 2020

Merge pull request #34 from scikit-optimize/master

5c2b5ba

Upstream changes

		@@ -0,0 +1,139 @@
		"""Gradient boosted trees based minimization algorithms."""

		return lower + rng.rand(num_params) * delta


		def gbt_minimize(func, bounds, base_estimator=None, maxiter=100,

		return ei


		def _random_point(lower, upper, n_points=1, random_state=None):

GBRT based minimizer #34

GBRT based minimizer #34

Conversation

betatim commented Mar 29, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MechCoder commented Apr 1, 2016

glouppe commented Apr 1, 2016

betatim commented Apr 1, 2016

MechCoder commented Apr 1, 2016

MechCoder commented Apr 1, 2016

betatim commented Apr 5, 2016

betatim commented Apr 5, 2016

glouppe commented Apr 5, 2016

betatim commented Apr 5, 2016

Choose a reason for hiding this comment

betatim commented Apr 5, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

betatim commented Apr 6, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MechCoder commented Apr 7, 2016

MechCoder commented Apr 7, 2016

MechCoder commented Apr 13, 2016

betatim commented Apr 14, 2016

MechCoder commented Apr 14, 2016

MechCoder commented Apr 14, 2016

betatim commented Apr 15, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

betatim commented Apr 20, 2016

glouppe commented Apr 20, 2016 • edited

betatim commented Apr 20, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MechCoder commented Apr 20, 2016

MechCoder commented Apr 20, 2016

betatim commented Apr 20, 2016

glouppe commented Apr 20, 2016

betatim commented Mar 29, 2016 •

edited

glouppe commented Apr 20, 2016 •

edited