-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
MRG: Gradient boosting #448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
benchmarks/bench_tree.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've included bench_tree.py by accident - I'll kick it out in the next commit
|
Peter, amazing work! thank you very much for the contribution. Paolo On Thu, Nov 24, 2011 at 9:37 AM, Peter Prettenhofer
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd appreciate your feedback on the monitor argument. The main intention is to have a way to introspect the model while it's being trained. The monitor is basically something that can be call'ed with the current state of the model (i.e. the current model at iteration i). Such a feature can be useful for: a) custom termination criteria (e.g. error/deviance on held-out set increases/stalls) and b) to introspect the model for model selection.
An alternative to the monitor would be to store the progress of the model in dedicated attributes (such as self.train_deviance) and do introspection / early stopping after the model has been trained. AFAIK that's the way R's gbm package does it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We indeed need such a pattern in scikit-learn (I needed something similar to debug convergence issues in minibatch kmeans and power iteration clustering too for instance). I will try to give it a deeper look this WE.
|
This is great work @pprett! I look forward to reviewing it in greater detail. |
|
I don't have much time now but I'll review your work as best as I can starting from next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really useful, and we might find it moving to the tree.py module. One of my colleagues has been working on autonomous driving and the trouble he found with using the standard mean predictor (bagging) is that it tends to smooth the output and underestimate extremal values. Using a median predictor went a long way to resolving this.
|
@pprett can you please either rebase or merge master into this branch so that github marks it as green? Better review something that does not conflict with the current master. |
|
I did some more benchmarks against R's gbm package. I measured error rate, training time and testing time on four classification datasets and MSE, training time and testing time on four regression datasets. Classification - decision stumps - 200 iterations when using decision stumps the implementation in this PR is competitive to R's gbm for binary classification. Gbm has a hard time on arcene - maybe some numerical issues (it spills out warnings too). Classification - depth 3 trees - 200 iterations If we use more complex weak learners we see that the training time of this PR grows compared to gbm. Regression - stumps - 100 iterations (normalized such that GBM's score is 1.0) For regression gbm is much faster than this PR. The internal data structure used by gbm's tree impl. might be more efficient than our sample mask approach but this shouldn't affect decision stumps too much. Regression - Depth 4 trees - 100 iterations (normalized such that GBM's score is 1.0) Interesting, the differences in effectiveness; for stumps results were pretty much equal but not for deeper trees - they might use some heuristics to control tree growing... here's the album link: |
|
Thanks for the awesome work Peter! I was not familiar with Gradient boosting so I had a quick look at the Wikipedia page. It reminds me a lot of the matching pursuit family of algorithms (but replacing the residuals by the pseudo residuals so as to support arbitrary loss functions) and of forward stepwise regression algorithms like LARS. Quick questions:
|
|
@mblondel : thanks! @step-length: gradient tree boosting (as introduced by Friedman [1]) involves two "learning rates"; The first for the "functional gradient descent" (i.e. fitting the pseudo-residuals) - this is where the line search is involved; the second is a regularization heuristic proposed by Friedman to tackle overfitting. The In some cases the line search can be done in closed form. In the case of regression trees as base learners you actually add J separate basis function in each iteration instead of a singe one (where J is the number of leaves); so you actually have to do J line-searches in each iteration. This is implemented in the [1] Friedman, J. H. "Greedy Function Approximation: A Gradient Boosting Machine." (February 1999) @learner-agnostic: This is correct; on the other hand, trees are AFAIK the most popular base learners and Gradient Tree Boosting has some distinct characteristics (e.g. it fits J instead of just one basis functions). Furthermore, by assuming trees as base learners we can gain some efficiency because they can share certain data-structures (e.g. the sorted features array). R's gbm package is also hard-wired to trees as base learners and is quite efficient because of that. R's mboost package [2] supports both trees and generalized linear models. But I've to admit I don't know how mboost compares to gbm in terms of efficiency. [2] Bühlmann, P and Horton, ."Boosting Algorithms: Regularization, Prediction and Model Fitting" |
|
I hadn't realized that you handle the loss-specific line search code in the loss functions and I wanted to suggest to use the closed-form solution for the squared loss. For other smooth loss functions, the number of Newton-Raphson steps could be an option (defaulting to 1). Regarding |
|
I just realized that sklearns I'll push an update in the days to come which is able to build trees with exactly J terminal regions. |
|
Hi Peter, I have started reviewing your code. This looks great! To make things easier, I will directly pull my changes to your branch during the sprint. |
|
Hi Gilles, thanks! BTW: I had a hard time figuring out how to build a tree with exactly J leaves (e.g. depth-first-search, breath-first search, some greedy heuristic based on improvement?) so I simply provide a |
|
Peter, sorry for replying so late. As you may have seen, I have two pending pull requests concerning the tree and the forest module. I would like to have them merged first before helping you with your branch. I have had to the opportunity to read your request though and I think we should aim at making a boosting algorithm that is more generic first, in the spirit of @jaberg recent PR, and then optimize for trees, but not the other way around. Among other things, I have for intance added in my PRs a similar trick as what you have used to avoid the recompution of Anyway, don't take it wrong. Thsi is not my intention. I really want to help you with boosting. This is one of the most important algorithms in machine learning and I really want to have it the project. I will address your PR in more details with more direct suggestions at my return in Belgium (I have to catch my plane in a few hours). |
|
@glouppe no worries - this sounds convincing! I definelty agree that if we have both a generic gradient boosting and a specialized gradient tree boosting the latter should be a subclass of the former. @glouppe @jaberg We should definelty combine our efforts - maybe the best starting point is to make a list of requirements - What functionality (e.g. loss functions, parameters) and quality attributes (performance, extensibility, ...) should the component have. I've to admit that I'm a bit biased towards GBM, it's a very well crafted implementation - both flexible (various loss functions for regression and classification) and performant - having something competitive in the scikit would be great! |
|
I'm interested in seeing this PR merged soon. What remains to be done? |
|
@mbondel I did some performance enhancements yesterday - basically, reducing the difference to R's gbm (now its competitive for classification but still a bit slower for (least squares) regression). I started working on the narrative documentation. There are some open issues::
Still I haven't merged it with James' initial draft for generic functional gradient boosting. [2] Bühlmann, P and Horton, ."Boosting Algorithms: Regularization, Prediction and Model Fitting" |
|
OK - now here's the updated version. I've added two benchmark scripts Here's a summary of the changes I needed to to in other modules::
I'll continue with the narrative documentation... |
sklearn/tree/tree.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any reference for that method? It seems counter-intuitive to me to not weight by the number of samples in the node. This indeed means that a feature used near the leaves might be as important as the feature used at the root, even if the former is used to separate significantly less samples (recall that best_error and init_error are already normalized by the number of samples in the node).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bump.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glouppe I coded up the feature importance procedure from ESLII (2nd Edition). Your argument sounds convincing though - I'll look into the formulas more carefully - maybe I missed a normalization step. Could you pass me the reference for the 'gini' method (is it Breiman's random forest paper?)
|
When you get time, could you merge master @pprett? I wanted to to play with this branch but it currently has merge conflicts. (It's low priority, I just wanted to play around :)) Regarding the generic and tree-specialized implementations, instead of using inheritance, could we just enable the tree-optimized routines if |
sklearn/tree/tree.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
node_id seems to be discarded, so maybe just leave it out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch - thx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain this a bit more? I would have thought the class prior is np.mean(y) (if y is 0 or 1) and if I understand the code correctly,
self.prior is what I would call the log-odds or log probability ratio.
Do I misunderstand something or are we just using different words?
In the multi-class case it seems to do something different, though.
Also in the multi-class case, wouldn't it be easier to use LabelBinarizer and np.mean ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are correct - I used to use np.mean(y) but I changed to the same init as gbm - I forgot to update the docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into the multi-class init tomorrow and I have to double check with Scott.
btw: Thanks for your throught review - I really appreciate that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're very welcome. Well it got more superficial in the end as I got a bit tired ;)
Btw I'm leaving Vienna in ~2 Weeks. Wanna go for a drink before?
Oh and if you like I'd appreciate if you looked at the 5 lines of code I deleted here: #707 ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely - I'll send you an email immediately.
regarding the PR - sure thing - I had it on my radar but I was a bit busy lately.
…ement allows to do so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason to make this a function instead of a static member variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope - should be a member (inherited from MulticlassLossFunction?) - I'll look into that tomorrow.
|
Thanks for addressing my comments. Pretty impressive work :) I have not fully gone through the algorithm but I'm pretty sure it's correct by now ;) About the label binarization: I think you can not avoid a Thanks for all the good work! |
|
Oh, I saw you just edited the docstrings for the ClassPriorPredictors :) |
|
@amueller regarding BTW: I renamed all *Predictor classes to *Estimator. |
|
Oh I missed the relabeling of classes last night. |
|
Here are the new benchmark results; Sklearn vs. GBM using decision stumps and 250 base learners. Classification Regression Values are normalized such that GBM is 1.0 Lower is better - Sklearn is competitive to GBM for classification - but for least-squares regression GBM is significantly faster (don't know why to be honest...) - AFAIK GBM does not support multi-class classification - test times for Sklearn are usually better (except for the Boston dataset) |
|
pardon my stupidity but how do you explain the perf difference? is |
Don't be too harsh on yourself Alex ;-) |
I am fighting against my ego :) btw @pprett congrats for the amazing job ! |
|
@agramfort I don't think it's due to any overhead w.r.t. python itself - I think its due to a number of subtle differences that "add up"; One interesting aspect though is the effect of the number of features on the performance. The datasets differ considerably w.r.t. the number of features:: It seems that the more features, the better we are off - e.g. GBM completely fails on |
|
WOOOT!!! :) |
|
Great work! Just a minor suggestion: shouldn't the classifier and regressor be named |






This is a PR for Gradient Boosted Regression Trees [1](aka Gradient Boosting, MART, TreeNet).
GBRTs have been advertised as one of the best off-the-shelf data-mining procedures; they share many properties with random forests including little need for tuning and data preprocessing as well as high predictive accuracy. GBRT have been used very successfully in areas such as learning of ranking functions and ecology.
This should be an alternative to R's 'gbm' package [2]. Currently, it feature three loss functions (binary classification, least squared regression and robust regression), stochastic gradient boosting, and variable importance.
I've benchmarked the code against R's gbm package (via rpy2) using a variety of datasets (about 4 classification and 3 regression datasets) - the results are remarkably similar; gbm, however, is usually a bit faster for least-squares regression.
Some features are still on my TODO list:
I haven't benchmarked it against OpenCVs implementation.
[1] http://en.wikipedia.org/wiki/Gradient_boosting
[2] http://cran.r-project.org/web/packages/gbm/index.html