You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description: Feature request for RandomForest classes.
On very big data, training data must be subsampled to keep training times reasonable (Training times grow faster than O(n_samples)). However, this throws away a lot of the data. A smarter approach would be to use a different subsampling per tree, using an extra parameter which controls the fraction of number of observations used during training. This way more data can be used while keeping training times low.
I did a fast implementation which does just that and found that same prediction performances can be achieved with 1/3 of the training time this way.
Description: Feature request for RandomForest classes.
On very big data, training data must be subsampled to keep training times reasonable (Training times grow faster than O(n_samples)). However, this throws away a lot of the data. A smarter approach would be to use a different subsampling per tree, using an extra parameter which controls the fraction of number of observations used during training. This way more data can be used while keeping training times low.
I did a fast implementation which does just that and found that same prediction performances can be achieved with 1/3 of the training time this way.
Versions
Linux-4.13.0-32-generic-x86_64-with-debian-stretch-sid
Python 3.6.3 |Anaconda, Inc.| (default, Oct 27 2017, 19:41:01)
[GCC 7.2.0]
NumPy 1.13.3
SciPy 0.19.1
Scikit-Learn 0.19.1
The text was updated successfully, but these errors were encountered: