RandomForest: Different subsampling per tree for faster training #10668

labodyn · 2018-02-21T12:12:47Z

Description: Feature request for RandomForest classes.

On very big data, training data must be subsampled to keep training times reasonable (Training times grow faster than O(n_samples)). However, this throws away a lot of the data. A smarter approach would be to use a different subsampling per tree, using an extra parameter which controls the fraction of number of observations used during training. This way more data can be used while keeping training times low.

I did a fast implementation which does just that and found that same prediction performances can be achieved with 1/3 of the training time this way.

Versions

Linux-4.13.0-32-generic-x86_64-with-debian-stretch-sid
Python 3.6.3 |Anaconda, Inc.| (default, Oct 27 2017, 19:41:01)
[GCC 7.2.0]
NumPy 1.13.3
SciPy 0.19.1
Scikit-Learn 0.19.1

jnothman · 2018-02-21T21:38:57Z

see #5963, #8732. we really need contributors to bring these to completion

jnothman closed this as completed Feb 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RandomForest: Different subsampling per tree for faster training #10668

RandomForest: Different subsampling per tree for faster training #10668

labodyn commented Feb 21, 2018 •

edited

Loading

jnothman commented Feb 21, 2018 via email

RandomForest: Different subsampling per tree for faster training #10668

RandomForest: Different subsampling per tree for faster training #10668

Comments

labodyn commented Feb 21, 2018 • edited Loading

Description: Feature request for RandomForest classes.

Versions

jnothman commented Feb 21, 2018 via email

labodyn commented Feb 21, 2018 •

edited

Loading