Conversation
``bounds[i][0]`` should give the lower bound of each parameter and | ||
``bounds[i][1]`` should give the upper bound of each parameter. | ||
|
||
base_estimator: a GradientBoostingQuantileRegressor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is gradient boosting strictly a tree based approach?
We just fit the negative gradient to the data in every iteration right.
I would mention gradient boosting somewhere in the name of the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add it to L45. For the name, how about gbt_minimize
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is addressed in 757b0b9
Are gradient boosting methods popular in bayesian optimization? As I asked before, we could maybe write a paper if we find out that they are useful in some scenarios as compared to GP or RF based approaches. cc: @glouppe |
I have actually never seen them used, but I dont see why they would not work. In general, I believe quite a few things can be explored based on tree-based methods (of any kind). SMAC is the only example I know and it is a very successful proof-of-concept, despite the rather simple usage of RF. I expect that it can be improved in many interesting ways :) |
No idea. To be honest I picked gradient boosting as in scikit-learn it supports quantile regression out of the box...and I like gradient boosting. No deeper reason. To me it seems like it shouldn't matter what you use (trees, GPs, ...) all you need is a regression model to act as the surrogate function and how (un)certain you are about its value. If your method can do this, you are cooking with gas 🔥. With non-parametric models like trees you probably lose the "bayesian" part, at least it becomes harder/impossible to specify a prior. The advantage is that you don't care so much about high dimensions, categorical variables, weird ranges, etc. |
I can comment in a less-blunt manner after the weekend, but it would be great to have a set of benchmarks and see what improvements can be done to the existing GP methods and if the new GBRT methods are better in different scenarios. I hope to help to contribute to that in the coming week. |
2e9512c
to
757b0b9
Compare
Would like to get this to the point where it passes all of the tests that the GP passes. |
Related to #41: I kept finding myself having to do |
Could you take care of #40 as well? :) |
Added to the todo. |
@@ -0,0 +1,139 @@ | |||
"""Gradient boosted trees based minimization algorithms.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, can we rename that file gbrt_opt.py
?
We should also discuss which parts are worth factoring out into a shared place for GP, trees, future-methods. Seems silly to have EI and friends duplicated as long as they don't do clever things specialised for the method. Yet right now they would need probably as much code to adapt minor differences as each function is in size. So maybe a starting point would be a common set of tests that ensures that various acquisition functions compute the same thing. |
return lower + rng.rand(num_params) * delta | ||
|
||
|
||
def gbt_minimize(func, bounds, base_estimator=None, maxiter=100, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gbt -> gbrt (this is a more common acronym in ML)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, as part of #38, I added an n_start
argument. Surely, this should be helpful here as well.
Notes: # do not want EI below zero, happens for points that are (much)
# worse than ``best_y``
ei = np.clip(ei, 0., np.max(ei)) adding this doesn't actually help/do anything. This confuses me. All benchmarks pass (with same settings as for GP), except for Switched to using random sampling to find the minimum of the acquisition function for the moment. Can't think of a smarter way to do it right now. Updated the todo list. |
|
||
# Default estimator | ||
if base_estimator is None: | ||
base_estimator = GradientBoostingQuantileRegressor() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep a different default? The default max_depth
should be less right since we need weak learners
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh, I just realized that the default max_depth
is 3. Scratch this comment
OK, so I ran a simple example to see if we are able to approximate the original function. This is what I get for the default values.
|
The graph does not change much by varying |
I'll be able to have a closer look during the weekend hopefully .... |
Reading your email just now I realise that you were worried about this plot. I misunderstood you because I thought "oh ja, nice plot" and moved on. Which part of it makes you think there is something up? |
I was just afraid that the |
Something like this scikit-learn/scikit-learn#6166 (comment) would be better |
The piecewise constant scikit-learn/scikit-learn#6166 (comment) still has small areas of constant prediction. I was wondering how "big" a constant prediction is "big" in terms of Do you know what arXiv:1211.0906 does about this problem? |
@@ -50,4 +50,4 @@ def fit(self, X, y): | |||
|
|||
def predict(self, X): | |||
"""Predictions for each quantile.""" | |||
return np.vstack([rgr.predict(X) for rgr in self.regressors_]) | |||
return np.asarray([rgr.predict(X) for rgr in self.regressors_]).T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rename this file to gbrt.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅
return ei | ||
|
||
|
||
def _random_point(lower, upper, n_points=1, random_state=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function really necessary? We inlined it in gp_minimize
by doing X = lb + (ub - lb) * rng.rand(n_points, n_params)
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed it at several points and was planning ahead for when we have categorical variables as well so I made a function out of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be _random_points
in any case
Renamed |
Create |
Travis is happy, Tim is happy, @glouppe happy too? |
X = np.expand_dims(X, axis=0) | ||
|
||
# low and high are assumed to be the 16% and 84% quantiles | ||
low, mu, high = surrogate.predict(X).T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about adding a return_std
to Quantile.predict
option that returns just the mean
and std
? In that way we can easily refactor this into the existing implementation and add "LCB" support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because reimplementing the acquisition functions is silly. No, because we could do something smarter when you can estimate the quantile directly instead of having to make an assumption about things being gaussian.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, we can keep the existing implementation by setting return_std
to False by default, if that worries you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, I'm suggesting to keep both implementation, one that returns the quantiles by default and the other that returns the mean
and std
when return_std is set to True.
If you want I can do that refactoring separately after this has been merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha! I hadn't even thought of such a behaviour. It is a good idea. We'd have to check the user asked for the right quantiles to be estimated when they use return_std=True
.
I was preoccupied with the thought that for LCB you'd only need to estimate one quantile instead of all three.
New PR?
lower_bounds, upper_bounds, n_points=n_start, random_state=rng) | ||
best_x = Xi[:n_start].ravel() | ||
yi[:n_start] = [func(xi) for xi in Xi[:n_start]] | ||
best_y = np.min(yi[:n_start]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
best_x
should Xi[np.argmin(yi[:n_start])]
right?
We can merge after Travis passes. But here are some thoughts. For things like Here is what they do in the SMAC paper and claim it gives smooth decision surfaces, but I don't grasp the intuition quickly. Let us say a split point is This has been described in page 17 of the paper (http://arxiv.org/pdf/1211.0906v2.pdf) just in case my interpretation is wrong. |
Merging for now! Thanks :-) |
Thanks for all the comments and help! |
🍻 |
Fixes #24.
This is a minimal implementation of a tree based minimiser. Right now it only supports EI, and can't deal with categorical variables.
The
# XXX
comments are where I am not sure if it is worth doing something now (for this PR) or leaving it for later.It looks like
_expected_improvement
should be refactored so it can be shared with the GP model. After a brief discussion with @glouppe I think it is worth keeping it like this for the moment. In the longer term we might be able to do something smarter/more appropriate that is specialised to the tree based minimizer._random_point
could also be shared, and in the future will need to get smarter for tree based models to deal with categorical variables.Todo:
bfgs
to optimise tree ensemblescan we be smarter than random sampling to minimise EI?punting this to laterGBT.predict()
API GradientBoostingQuantileRegressor.predict API #41learning
submodule #40 move GBT toskopt.learning
(maybedefinitely)