Skip to content
This repository has been archived by the owner on Feb 28, 2024. It is now read-only.

GBRT based minimizer #34

Merged
merged 8 commits into from Apr 20, 2016
Merged

Conversation

betatim
Copy link
Member

@betatim betatim commented Mar 29, 2016

Fixes #24.

This is a minimal implementation of a tree based minimiser. Right now it only supports EI, and can't deal with categorical variables.

The # XXX comments are where I am not sure if it is worth doing something now (for this PR) or leaving it for later.

It looks like _expected_improvement should be refactored so it can be shared with the GP model. After a brief discussion with @glouppe I think it is worth keeping it like this for the moment. In the longer term we might be able to do something smarter/more appropriate that is specialised to the tree based minimizer.

_random_point could also be shared, and in the future will need to get smarter for tree based models to deal with categorical variables.

Todo:

``bounds[i][0]`` should give the lower bound of each parameter and
``bounds[i][1]`` should give the upper bound of each parameter.

base_estimator: a GradientBoostingQuantileRegressor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is gradient boosting strictly a tree based approach?
We just fit the negative gradient to the data in every iteration right.
I would mention gradient boosting somewhere in the name of the function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add it to L45. For the name, how about gbt_minimize?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is addressed in 757b0b9

@MechCoder
Copy link
Member

Are gradient boosting methods popular in bayesian optimization? As I asked before, we could maybe write a paper if we find out that they are useful in some scenarios as compared to GP or RF based approaches.

cc: @glouppe

@glouppe
Copy link
Member

glouppe commented Apr 1, 2016

Are gradient boosting methods popular in bayesian optimization? As I asked before, we could maybe write a paper if we find out that they are useful in some scenarios as compared to GP or RF based approaches

I have actually never seen them used, but I dont see why they would not work. In general, I believe quite a few things can be explored based on tree-based methods (of any kind). SMAC is the only example I know and it is a very successful proof-of-concept, despite the rather simple usage of RF. I expect that it can be improved in many interesting ways :)

@betatim
Copy link
Member Author

betatim commented Apr 1, 2016

No idea. To be honest I picked gradient boosting as in scikit-learn it supports quantile regression out of the box...and I like gradient boosting. No deeper reason.

To me it seems like it shouldn't matter what you use (trees, GPs, ...) all you need is a regression model to act as the surrogate function and how (un)certain you are about its value. If your method can do this, you are cooking with gas 🔥.

With non-parametric models like trees you probably lose the "bayesian" part, at least it becomes harder/impossible to specify a prior.

The advantage is that you don't care so much about high dimensions, categorical variables, weird ranges, etc.

@MechCoder
Copy link
Member

I can comment in a less-blunt manner after the weekend, but it would be great to have a set of benchmarks and see what improvements can be done to the existing GP methods and if the new GBRT methods are better in different scenarios.

I hope to help to contribute to that in the coming week.

@MechCoder
Copy link
Member

Basically, https://github.com/MechCoder/scikit-optimize/issues/18

@betatim betatim force-pushed the tree-minimise branch 2 times, most recently from 2e9512c to 757b0b9 Compare April 5, 2016 12:47
@betatim
Copy link
Member Author

betatim commented Apr 5, 2016

Would like to get this to the point where it passes all of the tests that the GP passes.

@betatim
Copy link
Member Author

betatim commented Apr 5, 2016

Related to #41: I kept finding myself having to do .T which was a smell for "something wrong here", now it stacks horizontally so a,b,c = r.predict(X) works.

@glouppe
Copy link
Member

glouppe commented Apr 5, 2016

Could you take care of #40 as well? :)

@betatim
Copy link
Member Author

betatim commented Apr 5, 2016

Added to the todo.

@@ -0,0 +1,139 @@
"""Gradient boosted trees based minimization algorithms."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, can we rename that file gbrt_opt.py?

@betatim
Copy link
Member Author

betatim commented Apr 5, 2016

We should also discuss which parts are worth factoring out into a shared place for GP, trees, future-methods. Seems silly to have EI and friends duplicated as long as they don't do clever things specialised for the method. Yet right now they would need probably as much code to adapt minor differences as each function is in size. So maybe a starting point would be a common set of tests that ensures that various acquisition functions compute the same thing.

return lower + rng.rand(num_params) * delta


def gbt_minimize(func, bounds, base_estimator=None, maxiter=100,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gbt -> gbrt (this is a more common acronym in ML)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, as part of #38, I added an n_start argument. Surely, this should be helpful here as well.

@betatim
Copy link
Member Author

betatim commented Apr 6, 2016

Notes:

# do not want EI below zero, happens for points that are (much)
# worse than ``best_y``
ei = np.clip(ei, 0., np.max(ei))

adding this doesn't actually help/do anything. This confuses me.

All benchmarks pass (with same settings as for GP), except for hart6.

Switched to using random sampling to find the minimum of the acquisition function for the moment. Can't think of a smarter way to do it right now.

Updated the todo list.


# Default estimator
if base_estimator is None:
base_estimator = GradientBoostingQuantileRegressor()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep a different default? The default max_depth should be less right since we need weak learners

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, I just realized that the default max_depth is 3. Scratch this comment

@MechCoder
Copy link
Member

OK, so I ran a simple example to see if we are able to approximate the original function. This is what I get for the default values.

from skopt.gbt import GradientBoostingQuantileRegressor
from sklearn.gaussian_process import GaussianProcessRegressor

import numpy as np
import matplotlib.pyplot as plt

X = np.array([0, 1, 2, 3, 4, 5, 6])
y = np.sin(X)
X = np.reshape(X, (-1, 1))
X_new = np.linspace(0, 6, 200)
X_new = np.reshape(X_new, (-1, 1))


gbqr = GradientBoostingQuantileRegressor()
# gpr = GaussianProcessRegressor()
gbqr.fit(X, y)
l, mean, h = gbqr.predict(X_new)
std = (h - l) / 2.0
plt.plot(X_new, mean)
plt.plot(X_new.ravel(), np.sin(X_new.ravel()))
plt.fill_between(X_new.ravel(), mean - std, mean + std, color='darkorange',
                 alpha=0.2)
plt.show()

gbrt

@MechCoder
Copy link
Member

The graph does not change much by varying max_depth and n_estimators. It should be interesting to see what acquisition function is used by random forests and we can get inspired by that.

@MechCoder
Copy link
Member

I'll be able to have a closer look during the weekend hopefully ....

@betatim
Copy link
Member Author

betatim commented Apr 14, 2016

Reading your email just now I realise that you were worried about this plot. I misunderstood you because I thought "oh ja, nice plot" and moved on.

Which part of it makes you think there is something up?

@MechCoder
Copy link
Member

I was just afraid that the mean and std are piecewise constant over quite large intervals, which means the acquisition_function is also constant over large intervals right? (Which also explains why lbfgs does not work as we would like it to and that you had to resort to random sampling)

@MechCoder
Copy link
Member

Something like this scikit-learn/scikit-learn#6166 (comment) would be better

@betatim
Copy link
Member Author

betatim commented Apr 15, 2016

The piecewise constant mu and std stops us from using bfgs etc but I am not sure what else the tree could really predict. If you asked me (the TimRegressor) to make a prediction given only two or three points I would probably make a constant prediction as well as there isn't anything else to go on.

scikit-learn/scikit-learn#6166 (comment) still has small areas of constant prediction. I was wondering how "big" a constant prediction is "big" in terms of bfgs. Say we make a constant prediction over 0.1, 1 or 10% of the range of a variable, at some point the prediction will start appearing "continous" to bfgs even though in reality it is still piecewise constant. Wondering what that length scale is. Or if we will always be stuck because of this.

Do you know what arXiv:1211.0906 does about this problem?

@@ -50,4 +50,4 @@ def fit(self, X, y):

def predict(self, X):
"""Predictions for each quantile."""
return np.vstack([rgr.predict(X) for rgr in self.regressors_])
return np.asarray([rgr.predict(X) for rgr in self.regressors_]).T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename this file to gbrt.py?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return ei


def _random_point(lower, upper, n_points=1, random_state=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function really necessary? We inlined it in gp_minimize by doing X = lb + (ub - lb) * rng.rand(n_points, n_params) instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed it at several points and was planning ahead for when we have categorical variables as well so I made a function out of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be _random_points in any case

@betatim
Copy link
Member Author

betatim commented Apr 20, 2016

Renamed gbt.py -> gbrt.py but then remembered we wanted to move the contents to skopt.learning. Should it be skopt.learning.gbrt.GradientBoost...? I would prefer skopt.learning.GradientBoost.... When that is done, all things from the to do list should be ticked off.

@glouppe
Copy link
Member

glouppe commented Apr 20, 2016

Should it be skopt.learning.gbrt.GradientBoost...? I would prefer skopt.learning.GradientBoost....

Create learning/gbrt.py and then expose whatever is needed in learning/__init__.py, so that skopt.learning.GradientBoost... works.

@betatim
Copy link
Member Author

betatim commented Apr 20, 2016

Travis is happy, Tim is happy, @glouppe happy too?

X = np.expand_dims(X, axis=0)

# low and high are assumed to be the 16% and 84% quantiles
low, mu, high = surrogate.predict(X).T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding a return_std to Quantile.predict option that returns just the mean and std? In that way we can easily refactor this into the existing implementation and add "LCB" support.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because reimplementing the acquisition functions is silly. No, because we could do something smarter when you can estimate the quantile directly instead of having to make an assumption about things being gaussian.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, we can keep the existing implementation by setting return_std to False by default, if that worries you

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, I'm suggesting to keep both implementation, one that returns the quantiles by default and the other that returns the mean and std when return_std is set to True.

If you want I can do that refactoring separately after this has been merged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha! I hadn't even thought of such a behaviour. It is a good idea. We'd have to check the user asked for the right quantiles to be estimated when they use return_std=True.

I was preoccupied with the thought that for LCB you'd only need to estimate one quantile instead of all three.

New PR?

lower_bounds, upper_bounds, n_points=n_start, random_state=rng)
best_x = Xi[:n_start].ravel()
yi[:n_start] = [func(xi) for xi in Xi[:n_start]]
best_y = np.min(yi[:n_start])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

best_x should Xi[np.argmin(yi[:n_start])] right?

@MechCoder
Copy link
Member

We can merge after Travis passes. But here are some thoughts. For things like lbfgs, we would need smooth functions. Or else in this case, lbfgs would degrade to random sampling since the gradient is zero.

Here is what they do in the SMAC paper and claim it gives smooth decision surfaces, but I don't grasp the intuition quickly. Let us say a split point is x(j, k), any point in between the next point x(j, k) and x(j, k+1) also is a split point, so they sample this split point uniformly between x(j, k) and x(j, k+1)

This has been described in page 17 of the paper (http://arxiv.org/pdf/1211.0906v2.pdf) just in case my interpretation is wrong.

@MechCoder
Copy link
Member

Merging for now! Thanks :-)

@MechCoder MechCoder merged commit a6570e4 into scikit-optimize:master Apr 20, 2016
@betatim betatim deleted the tree-minimise branch April 20, 2016 14:43
@betatim
Copy link
Member Author

betatim commented Apr 20, 2016

Thanks for all the comments and help!

@glouppe
Copy link
Member

glouppe commented Apr 20, 2016

🍻

holgern added a commit that referenced this pull request Feb 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants