Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Gradient Boosting improvement #8231

Closed
2 tasks
glemaitre opened this issue Jan 24, 2017 · 10 comments
Closed
2 tasks

[RFC] Gradient Boosting improvement #8231

glemaitre opened this issue Jan 24, 2017 · 10 comments

Comments

@glemaitre
Copy link
Member

glemaitre commented Jan 24, 2017

Current situation

The speed performance of the scikit-learn gradient-boosting is relatively slow compared to the other implementations: xgboost and lightgbm.

Which bottlenecks have been identified

During some benchmarking, we could identified a major difference between scikit-learn and xgboost.

For xgboost, all samples for a given feature will be scanned. Each sample will be used to update the impurity statistic of the node to which it belongs to (cf. here).
On the contrary in scikit-learn, only the samples for a node will be selected (cf. here or here). Therefore, it involves repetitive scans which are not necessary.

What have been tested

In an effort of speeding up the gradient-boosting, @ogrisel thought about implementing a subsampling at each split, lowering the computation cost while sorting and computing the statistic. The obtained results do not allow to find a satisfactory trade-off time/accuracy performance to compete with the other implementations.

Proposal

I would like to contribute with two changes:

  • Propose a prototype of the gradient boosting modifying the samples scanning and splitting strategies. It will required a change of the decision tree algorithm (that's why we want a prototype first). This change should allow to match the performance of xgboost while using the exact method. [WIP] Alternative architecture for regression tree #8458
  • Implement an approximation splitter by binning the data similarly to lightgbm and xgboost. This change allowed xgboost to match lightgbm performance recently (cf. here and here).

Related issue

#5212

@GaelVaroquaux @glouppe @ogrisel @jmschrei @raghavrv If you have any comments, I will be happy to hear them.

@raghavrv
Copy link
Member

@glouppe was very interested in the approximate splitter. I also think it would be a nice addition helping speed things up a lot... I'm also pinging @jnothman and @nelson-liu who might be interested in reviewing this...

@glouppe
Copy link
Contributor

glouppe commented Jan 24, 2017

For xgboost, all samples for a given feature will be scanned. Each sample will be used to update the impurity statistic of the node to which it belongs to (cf. here).
On the contrary in scikit-learn, only the samples for a node will be selected (cf. here or here). Therefore, it involves repetitive scans which are not necessary.

Propose a prototype of the gradient boosting modifying the samples scanning and splitting strategies. It will required a change of the decision tree algorithm (that's why we want a prototype first). This change should allow to match the performance of xgboost while using the exact method.

Can you elaborate on this?

In our case, sample_mask is only used with presorting, which we can get rid off if benchmarks show it is indeed slower than the standard implementation.

How does our standard implementation (without presorting) compare with xgboost? both in terms of algorithms and performance?

@glouppe
Copy link
Contributor

glouppe commented Jan 24, 2017

Implement an approximation splitter by binning the data

+1 for binning

@jmschrei
Copy link
Member

I'm also a bit confused what you mean by repetitive scans. We have two different sorting mechanisms, presorting and not. Presorting works better on smaller datasets with a smaller number of features, whereas not presorting works better on larger amounts of data. XGBoost uses presorting (when I last looked at the code) and we do not, but we provide the option for allowing the user to do such.

@glouppe
Copy link
Contributor

glouppe commented Jan 24, 2017

In our case, sample_mask is only used with presorting, which we can get rid off if benchmarks show it is indeed slower than the standard implementation.

Ok, I think I understand. We could could rid of the sample_mask logic by looking for splits across all nodes in parallel -- this is what I think they do in xgboost by maintaining several Criterion-like objects.
Is that more or less what you have in mind?

@glemaitre
Copy link
Member Author

@glouppe in xgboost, sample_mask would store the ID of the current node and as you said, will allow to maintain several criterions.
It allows to make one scan of sample_mask. In scikit-learn, a scan is performed for each node to be split.

@glouppe
Copy link
Contributor

glouppe commented Jan 25, 2017

@glouppe in xgboost, sample_mask would store the ID of the current node and as you said, will allow to maintain several criterions.
It allows to make one scan of sample_mask. In scikit-learn, a scan is performed for each node to be split.

Yup, looks like something promising.

+1 for both changes.

(However, to avoid never-ending PRs, please propose the changes separately, possibly with the least amount of changes to the other parts of the codebase.)

@glemaitre
Copy link
Member Author

@glouppe Yep, I am gonna make separated PRs.

@lorentzenchr
Copy link
Member

Can we close this as #12807 (HistGradientBoostingClassifier and HistGradientBoostingRegressor) is merged?

@glemaitre
Copy link
Member Author

Yes this is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants