Allow `RandomForest` and `ExtraTrees` to have a higher max_samples than 1.0 when `bootstrap=True` #28507

adam2392 · 2024-02-22T17:36:36Z

Describe the workflow you want to enable

Currently, random/extra forests can bootstrap sample the data such that max_samples \in (0.0, 1.0]. This enables an out-of-bag sample estimate in forests.

However, this only allows you to sample in principle up to at most 63% unique samples and then 37% of unique samples are for out-of-bag estimation. However, you should be able to control this parameter to a proportion greater. For instance, perhaps I want to leverage 80% of my data to fit each tree, and 20% to estimate oob performance. This requires one to set max_samples=1.6.

Beyond that, no paper suggests that 63% is required cutoff for bootstrapping the samples in Random/Extra forest. I am happy to submit a PR if the core-dev team thinks the propose solution is simple and reasonable.

See https://stats.stackexchange.com/questions/126107/expected-proportion-of-the-sample-when-bootstrapping for a good reference and explanation.

Describe your proposed solution

The proposed solution is actually backwards-compatable and adds minimal complexity to the codebase.

We change

scikit-learn/sklearn/ensemble/_forest.py

Lines 95 to 125 in 38c8cc3

    
           def _get_n_samples_bootstrap(n_samples, max_samples): 
        
               """ 
        
               Get the number of samples in a bootstrap sample. 
        
               Parameters 
        
               ---------- 
        
               n_samples : int 
        
                   Number of samples in the dataset. 
        
               max_samples : int or float 
        
                   The maximum number of samples to draw from the total available: 
        
                       - if float, this indicates a fraction of the total and should be 
        
                         the interval `(0.0, 1.0]`; 
        
                       - if int, this indicates the exact number of samples; 
        
                       - if None, this indicates the total number of samples. 
        
               Returns 
        
               ------- 
        
               n_samples_bootstrap : int 
        
                   The total number of samples to draw for the bootstrap sample. 
        
               """ 
        
               if max_samples is None: 
        
                   return n_samples 
        
               if isinstance(max_samples, Integral): 
        
                   if max_samples > n_samples: 
        
                       msg = "`max_samples` must be <= n_samples={} but got value {}" 
        
                       raise ValueError(msg.format(n_samples, max_samples)) 
        
                   return max_samples 
        
               if isinstance(max_samples, Real): 
        
                   return max(round(n_samples * max_samples), 1)

to the following LOC:

def _get_n_samples_bootstrap(n_samples, max_samples):
    """
    Get the number of samples in a bootstrap sample.

    The expected total number of unique samples in a bootstrap sample is
    required to be at most ``n_samples - 1``.
    This is equivalent to the expected number of out-of-bag samples being at
    least 1.

    Parameters
    ----------
    n_samples : int
        Number of samples in the dataset.
    max_samples : int or float
        The maximum number of samples to draw from the total available:
            - if float, this indicates a fraction of the total;
            - if int, this indicates the exact number of samples;
            - if None, this indicates the total number of samples.

    Returns
    -------
    n_samples_bootstrap : int
        The total number of samples to draw for the bootstrap sample.
    """
    if max_samples is None:
        return n_samples

    if isinstance(max_samples, Integral):
        expected_oob_samples = (1 - np.exp(-max_samples / n_samples)) * n_samples
        if expected_oob_samples >= n_samples - 1:
            raise ValueError(
                "The expected number of unique samples in the bootstrap sample"
                f" must be at most {n_samples - 1}. It is: {expected_oob_samples}"
            )
        return max_samples

    if isinstance(max_samples, Real):
        expected_oob_samples = (1 - np.exp(-max_samples)) * n_samples
        if expected_oob_samples >= n_samples - 1:
            raise ValueError(
                "The expected number of unique samples in the bootstrap sample"
                f" must be at most {n_samples - 1}. It is: {expected_oob_samples}"
            )
        return max(round(n_samples * max_samples), 1)

Note, we probably want some reasonable control over how large max_samples can be relative to n_samples. For instance if max_samples = 10*n_samples, this results in you pretty much sampling all unique samples per tree and almost no samples for oob computation. Thus a reasonable cutoff is we always allow at least 1 sample to be oob.

if max_samples is an integer -> then it must be (1 - e^(-max_samples/n_samples)) * n_samples > 1
if max_samples is a float -> then it must be that (1 - e^(-max_samples)) * n_samples > 1 (i.e. you are expected to sample at least 1 sample out of bag).

Alternatively, we can impose a reasonable heuristic of 5 samples. I think regardless it works for most use-cases because people would typically want to change the in-bag percentage from 63% to say 80% or 90% at most, but not 99.99%

Describe alternatives you've considered, if relevant

There is no other way of allowing this functionality without forking the code.

Additional context

This allows flexibility in terms of the trees and may help in supporting other issues that require more fine-grained control over what is in-bag vs oob such as #19710.

The text was updated successfully, but these errors were encountered:

glemaitre · 2024-03-11T21:11:25Z

I don't know if this YAGNI or not. By definition, having max_feature > 1 and bootstrap=True would not be a bootstrap by definition anymore. Is there any concrete example where not having this feature is detrimental.

This allows flexibility in terms of the trees and may help in supporting other issues that require more fine-grained control over what is in-bag vs oob such as #19710.

Could you provide a bit more background. It might be an interesting context to consider for a decision.

adam2392 · 2024-03-11T21:24:28Z

I don't know if this YAGNI or not. By definition, having max_feature > 1 and bootstrap=True would not be a bootstrap by definition anymore. Is there any concrete example where not having this feature is detrimental.

I asume you meant max_samples here(?)

Bootstrap is defined as sampling without replacement. Let M be the number of bootstrap samples and N be the number of samples in your dataset. It just happens to be the case that when M == N, you get 63% unique samples in your bootstrap set. However, you can sample M > N, and there is no fundamental reason that wouldn't be considered a bootstrap sample.

Could you provide a bit more background. It might be an interesting context to consider for a decision.

Sure! My collaborators and I commonly do research on random forests because of desirable theoretical guarantees. One of the interesting advances lately has to do with what we can do with out-of-bag samples. Out-of-bag samples is inversely related to your in-bag samples, and rn sklearn constrains the trade-off you're allowed to make here.

More concretely, we are interested in hypothesis testing with random forests by using the test-statistic estimated on out-of-bag samples. It is possible to leverage out-of-bag samples to estimate a test statistic per tree and average across trees. However, there is a trade-off between how good your estimate of the test statistic is (# of out-of-bag samples) and how well your tree is fit (# of in-bag samples). By default sklearn upper-bounds how many out-of-bag samples you are allowed (i.e. 1.0 - 0.63 = 0.37). However, we want to estimate out-of-bag statistics on 20% of the data, not 37% of the data. This requires one to bootstrap sample M = 1.6 * N times, which then on expectation results in out-of-bag estimates on 20% of the data and in-bag training of the tree on 80% of the data.

Note we're setting bootstrap to True, so there is no way to hack this by using say StratifiedShuffleSplit cv object because that would then make each forest (not tree) have different in-bag/out-of-bag data. We want different in-bag/out-of-bag data per tree, so the easiest thing to do is a simple change in sklearn. I've sketched out that there is only a few LOC needed to be changed (mostly due to error checking).

Lmk if I can elaborate further!

adam2392 added Needs Triage Issue requires triage New Feature labels Feb 22, 2024

glemaitre removed the Needs Triage Issue requires triage label Mar 11, 2024

glemaitre added the Needs Decision Requires decision label Mar 11, 2024

adam2392 mentioned this issue Apr 18, 2024

Out of bag probability estimates for Random Forests #4406

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow `RandomForest` and `ExtraTrees` to have a higher max_samples than 1.0 when `bootstrap=True` #28507

Allow `RandomForest` and `ExtraTrees` to have a higher max_samples than 1.0 when `bootstrap=True` #28507

adam2392 commented Feb 22, 2024 •

edited

glemaitre commented Mar 11, 2024

adam2392 commented Mar 11, 2024 •

edited

Allow RandomForest* and ExtraTrees* to have a higher max_samples than 1.0 when bootstrap=True #28507

Allow RandomForest* and ExtraTrees* to have a higher max_samples than 1.0 when bootstrap=True #28507

Comments

adam2392 commented Feb 22, 2024 • edited

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

glemaitre commented Mar 11, 2024

adam2392 commented Mar 11, 2024 • edited

Allow `RandomForest` and `ExtraTrees` to have a higher max_samples than 1.0 when `bootstrap=True` #28507

Allow `RandomForest` and `ExtraTrees` to have a higher max_samples than 1.0 when `bootstrap=True` #28507

adam2392 commented Feb 22, 2024 •

edited

adam2392 commented Mar 11, 2024 •

edited