-
-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow RandomForest*
and ExtraTrees*
to have a higher max_samples than 1.0 when bootstrap=True
#28507
Comments
I don't know if this YAGNI or not. By definition, having
Could you provide a bit more background. It might be an interesting context to consider for a decision. |
I asume you meant Bootstrap is defined as sampling without replacement. Let
Sure! My collaborators and I commonly do research on random forests because of desirable theoretical guarantees. One of the interesting advances lately has to do with what we can do with out-of-bag samples. Out-of-bag samples is inversely related to your in-bag samples, and rn sklearn constrains the trade-off you're allowed to make here. More concretely, we are interested in hypothesis testing with random forests by using the test-statistic estimated on out-of-bag samples. It is possible to leverage out-of-bag samples to estimate a test statistic per tree and average across trees. However, there is a trade-off between how good your estimate of the test statistic is (# of out-of-bag samples) and how well your tree is fit (# of in-bag samples). By default sklearn upper-bounds how many out-of-bag samples you are allowed (i.e. 1.0 - 0.63 = 0.37). However, we want to estimate out-of-bag statistics on 20% of the data, not 37% of the data. This requires one to bootstrap sample Note we're setting bootstrap to True, so there is no way to hack this by using say Lmk if I can elaborate further! |
Describe the workflow you want to enable
Currently, random/extra forests can bootstrap sample the data such that
max_samples \in (0.0, 1.0]
. This enables an out-of-bag sample estimate in forests.However, this only allows you to sample in principle up to at most 63% unique samples and then 37% of unique samples are for out-of-bag estimation. However, you should be able to control this parameter to a proportion greater. For instance, perhaps I want to leverage 80% of my data to fit each tree, and 20% to estimate oob performance. This requires one to set
max_samples=1.6
.Beyond that, no paper suggests that 63% is required cutoff for bootstrapping the samples in Random/Extra forest. I am happy to submit a PR if the core-dev team thinks the propose solution is simple and reasonable.
See https://stats.stackexchange.com/questions/126107/expected-proportion-of-the-sample-when-bootstrapping for a good reference and explanation.
Describe your proposed solution
The proposed solution is actually backwards-compatable and adds minimal complexity to the codebase.
scikit-learn/sklearn/ensemble/_forest.py
Lines 95 to 125 in 38c8cc3
Note, we probably want some reasonable control over how large
max_samples
can be relative ton_samples
. For instance ifmax_samples = 10*n_samples
, this results in you pretty much sampling all unique samples per tree and almost no samples for oob computation. Thus a reasonable cutoff is we always allow at least 1 sample to be oob.max_samples
is an integer -> then it must be(1 - e^(-max_samples/n_samples)) * n_samples > 1
max_samples
is a float -> then it must be that(1 - e^(-max_samples)) * n_samples > 1
(i.e. you are expected to sample at least 1 sample out of bag).Alternatively, we can impose a reasonable heuristic of 5 samples. I think regardless it works for most use-cases because people would typically want to change the in-bag percentage from 63% to say 80% or 90% at most, but not 99.99%
Describe alternatives you've considered, if relevant
There is no other way of allowing this functionality without forking the code.
Additional context
This allows flexibility in terms of the trees and may help in supporting other issues that require more fine-grained control over what is in-bag vs oob such as #19710.
The text was updated successfully, but these errors were encountered: