Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
[MRG] Faster Gradient Boosting Decision Trees with binned features #12807
This PR proposes a new implementation for Gradient Boosting Decision Trees. This isn't meant to be a replacement of the current sklearn implementation but rather an addition.
This addresses the second bullet point from #8231.
Algorithm details and refs
The main differences with the current sklearn implementation are:
Notes to reviewers
This is going to be a lot of work to review, so please feel free to tell me if there's anything I can do / add that could ease reviewing.
Here's a list of things that probably need to be discussed at some point or that are worth pointing out.
API differences with current implementation:
Happy to discuss these points of course. In general I tried to match the parameters names with those of the current GBDTs.
Changed parameters and attributes:
Unsupported parameters and attributes:
Future improvement, for later PRs (no specific order):
Done on my laptop, intel i5 7th Gen, 4 cores, 8GB Ram.
Comparison between proposed PR and current estimators:
on binary classification only, I don't think it's really needed to do more since the performance difference is striking. Note that for larger sample sizes the current estimators simply cannot run because of the sorting step that never terminates. I don't provide the benchmark code, it's exactly the same as that of
Comparison between proposed PR and LightGBM / XGBoost:
referenced this pull request
Jan 2, 2019
On the empty
I thought we don't need an empty
Maybe it'd be cleaner to move the
Also, I think at least one
Regarding examples I'm not sure how useful it would really be, for now. Looking at the existing examples for the current GBDTs, they all rely on some non-implemented feature like plotting the validation loss at each iteration (requires
if that was true for all possible examples showing the benefits of this method, then we wouldn't be trying to merge this PR, would we? The mere fact that it's much faster for larger datasets while the performance doesn't degrade, deserves a simple example itself. That said, we do need to keep the examples fast. Is it feasible to have an example whish runs fast, and compares this implementation with the old one and/or other ensembles, and yet shows the speedup?
I could make an example reproducing the first benchmark? The thing is, it will be either slow or not super interesting, since the comparison is interesting precisely when the current implementation starts to be slow.
I tried to come-up with other examples that would be interesting but I didn't get anything convincing so far.
For example, I thought it'd be nice to illustrate the impact of the
Really the only reason one would want to use this new implementation is that it's (much) faster.