[MRG] FIX raise error for max_samples if no bootstrap & optimize forest tests #21295

PSSF23 · 2021-10-09T23:36:18Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Raise ValueError with message saying max_samples is only available if bootstrap=True.
Fix forest tests by adding bootstrap=True parameter.

Any other comments?

simonandras

Nice enhancement! Actually the behaviour here before your PR was ambiguous:
in your example after RandomForestClassifier(n_estimators=100, bootstrap=False, max_samples=0.5) the bootstrap was made, because in line 378 in _forest.py the n_samples_bootstrap parameter is set to 0.5 * n_samples by the _get_n_samples_bootstrap function whether bootstrap was False or not.

After writing the tests LGTM.

sklearn/ensemble/_forest.py

PSSF23

#21299 When checking the test errors, I found out that some of the tests for max_samples do not specify the bootstrap parameter. Yet the classifiers/regressors tested might have bootstrap=False as default, so the test results would be invalid.

The max_samples and bootstrap parameters in IsolationForest and bagging estimators seem to follow different logistics, so I didn't change those tests.

adrinjalali · 2021-10-12T08:11:24Z

pinging @glemaitre and @NicolasHug on this one.

NicolasHug

Thanks @PSSF23 , I made some minor comments but overall this LGTM

sklearn/ensemble/_forest.py

sklearn/ensemble/tests/test_forest.py

PSSF23

@NicolasHug I changed the error message, but setting bootstrap=True explicitly is still required in the checks to achieve the same results. Otherwise, the fitting would not randomize the sample indices:

scikit-learn/sklearn/ensemble/_forest.py

Lines 164 to 186 in 28567a5

    
           if forest.bootstrap: 
        
               n_samples = X.shape[0] 
        
               if sample_weight is None: 
        
                   curr_sample_weight = np.ones((n_samples,), dtype=np.float64) 
        
               else: 
        
                   curr_sample_weight = sample_weight.copy() 
        
               indices = _generate_sample_indices( 
        
                   tree.random_state, n_samples, n_samples_bootstrap 
        
               ) 
        
               sample_counts = np.bincount(indices, minlength=n_samples) 
        
               curr_sample_weight *= sample_counts 
        
               if class_weight == "subsample": 
        
                   with catch_warnings(): 
        
                       simplefilter("ignore", DeprecationWarning) 
        
                       curr_sample_weight *= compute_sample_weight("auto", y, indices=indices) 
        
               elif class_weight == "balanced_subsample": 
        
                   curr_sample_weight *= compute_sample_weight("balanced", y, indices=indices) 
        
               tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False) 
        
           else: 
        
               tree.fit(X, y, sample_weight=sample_weight, check_input=False)

PSSF23

@simonandras @adrinjalali @NicolasHug @glemaitre Anything else needed for this PR?

adrinjalali · 2021-11-17T15:33:02Z

I kinda feel like we need a deprecation cycle here, but not entirely sure.

glemaitre · 2021-11-22T13:29:09Z

I agree with @adrinjalali. We should raise a FutureWarning for 2 versions before to error. The warning can be educational by explaining that we are discarding max_samples indeed and that to avoid that ambiguity, an error will be raised in the future.

NicolasHug · 2021-11-22T13:38:41Z

I think we tend to consider these to be bugfixes (hence no deprecation), but I don't mind going through the deprecation cycle to be extra nice.

glemaitre · 2021-11-22T14:34:46Z

But here, the parameter was only ignore that was more or less in line with the docstring from what I understand.

PSSF23

@glemaitre The current code will calculate n_samples_bootstrap based on max_samples no matter bootstrap is True or False. In that process it might raise unnecessary errors if the max_samples format is wrong. As it's eventually ignored, why check it?

scikit-learn/sklearn/ensemble/_forest.py

Lines 384 to 386 in 3626a34

    
           n_samples_bootstrap = _get_n_samples_bootstrap( 
        
               n_samples=X.shape[0], max_samples=self.max_samples 
        
           )

Also in a naive user perspective, I would assume any parameter entered reflects how the program actually runs. So for example, I would think that ExtraTreesClassifier(max_samples=0.5) runs with half sample sizes.

In addition, for consistency, as oob_score raises errors when bootstrap is False, max_samples should do the same thing.

scikit-learn/sklearn/ensemble/_forest.py

Lines 407 to 408 in 3626a34

    
           if not self.bootstrap and self.oob_score: 
        
               raise ValueError("Out of bag estimation only available if bootstrap=True")

glemaitre · 2021-11-23T16:08:17Z

@glemaitre The current code will calculate n_samples_bootstrap based on max_samples no matter bootstrap is True or False. In that process it might raise unnecessary errors if the max_samples format is wrong.
Also in a naive user perspective, I would assume any parameter entered reflects how the program actually runs. So for example, I would think that ExtraTreesClassifier(max_samples=0.5) runs with half sample sizes.

OK so let's consider it as a bug fix and directly raise an error. If would have beforehand a safeguard if boostrap: and silently ignore it, then it would have been different.

In addition, for consistency, as oob_score raises errors when bootstrap is False, max_samples should do the same thing.

This is a different case here because there is no way to have an OOB sample without bootstraping while you could consider getting a smaller sample size with max_sample=0.5.

PSSF23

@glemaitre I modified the code as you suggested. Let me know if there are other improvements needed!

glemaitre

Otherwise it looks good.

sklearn/ensemble/_forest.py

PSSF23

@glemaitre Changes implemented~

glemaitre

LGTM. Before to merge, I would like to have a last review of @adrinjalali

adrinjalali

LGTM, happy with it being a fix

…ts (scikit-learn#21295) * FIX raise error for max_samples if no bootstrap * EHN move check position by suggestion * FIX add bootstrap parameter to tests * FIX resolve test error & DOC add log * FIX add test coverage for bootstrap check * DOC optimize error message * FIX resolve test error * ENH restrict sample bootstrap * ENH optimize bootstrap conditions

…ts (#21295) * FIX raise error for max_samples if no bootstrap * EHN move check position by suggestion * FIX add bootstrap parameter to tests * FIX resolve test error & DOC add log * FIX add test coverage for bootstrap check * DOC optimize error message * FIX resolve test error * ENH restrict sample bootstrap * ENH optimize bootstrap conditions

…ts (scikit-learn#21295) * FIX raise error for max_samples if no bootstrap * EHN move check position by suggestion * FIX add bootstrap parameter to tests * FIX resolve test error & DOC add log * FIX add test coverage for bootstrap check * DOC optimize error message * FIX resolve test error * ENH restrict sample bootstrap * ENH optimize bootstrap conditions

FIX raise error for max_samples if no bootstrap

b7b7b07

github-actions bot added the module:ensemble label Oct 9, 2021

simonandras reviewed Oct 10, 2021

View reviewed changes

sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved

PSSF23 added 2 commits October 10, 2021 16:27

EHN move check position by suggestion

649a5e1

FIX add bootstrap parameter to tests

07e8b48

PSSF23 changed the title ~~FIX raise error for max_samples if no bootstrap~~ FIX raise error for max_samples if no bootstrap & optimize forest tests Oct 10, 2021

PSSF23 commented Oct 10, 2021

View reviewed changes

PSSF23 added 2 commits October 10, 2021 17:51

FIX resolve test error & DOC add log

f927897

FIX add test coverage for bootstrap check

0494c8b

PSSF23 changed the title ~~FIX raise error for max_samples if no bootstrap & optimize forest tests~~ [MRG] FIX raise error for max_samples if no bootstrap & optimize forest tests Oct 11, 2021

NicolasHug approved these changes Oct 12, 2021

View reviewed changes

sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved

sklearn/ensemble/tests/test_forest.py Show resolved Hide resolved

PSSF23 added 2 commits October 12, 2021 08:48

DOC optimize error message

96b27a8

FIX resolve test error

e03fc76

PSSF23 commented Oct 12, 2021

View reviewed changes

Merge branch 'main' into bootstrap

3626a34

PSSF23 commented Nov 17, 2021

View reviewed changes

PSSF23 commented Nov 23, 2021

View reviewed changes

PSSF23 mentioned this pull request Nov 23, 2021

Fix ambiguous tests for ensemble max_samples & bootstrap #21299

Closed

PSSF23 added 2 commits November 23, 2021 11:28

ENH restrict sample bootstrap

e2d4513

Merge branch 'main' into bootstrap

371b8f8

PSSF23 commented Nov 23, 2021

View reviewed changes

glemaitre reviewed Nov 24, 2021

View reviewed changes

sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved

sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved

ENH optimize bootstrap conditions

f8f1d53

PSSF23 commented Nov 24, 2021

View reviewed changes

glemaitre approved these changes Nov 24, 2021

View reviewed changes

adrinjalali approved these changes Nov 25, 2021

View reviewed changes

adrinjalali merged commit f96ce58 into scikit-learn:main Nov 25, 2021

PSSF23 deleted the bootstrap branch November 25, 2021 13:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] FIX raise error for max_samples if no bootstrap & optimize forest tests #21295

[MRG] FIX raise error for max_samples if no bootstrap & optimize forest tests #21295

PSSF23 commented Oct 9, 2021 •

edited

simonandras left a comment

PSSF23 left a comment •

edited

adrinjalali commented Oct 12, 2021

NicolasHug left a comment

PSSF23 left a comment

PSSF23 left a comment

adrinjalali commented Nov 17, 2021

glemaitre commented Nov 22, 2021

NicolasHug commented Nov 22, 2021

glemaitre commented Nov 22, 2021

PSSF23 left a comment •

edited

glemaitre commented Nov 23, 2021

PSSF23 left a comment

glemaitre left a comment

PSSF23 left a comment

glemaitre left a comment

adrinjalali left a comment

	if forest.bootstrap:
	n_samples = X.shape[0]
	if sample_weight is None:
	curr_sample_weight = np.ones((n_samples,), dtype=np.float64)
	else:
	curr_sample_weight = sample_weight.copy()

	indices = _generate_sample_indices(
	tree.random_state, n_samples, n_samples_bootstrap
	)
	sample_counts = np.bincount(indices, minlength=n_samples)
	curr_sample_weight *= sample_counts

	if class_weight == "subsample":
	with catch_warnings():
	simplefilter("ignore", DeprecationWarning)
	curr_sample_weight *= compute_sample_weight("auto", y, indices=indices)
	elif class_weight == "balanced_subsample":
	curr_sample_weight *= compute_sample_weight("balanced", y, indices=indices)

	tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False)
	else:
	tree.fit(X, y, sample_weight=sample_weight, check_input=False)

	n_samples_bootstrap = _get_n_samples_bootstrap(
	n_samples=X.shape[0], max_samples=self.max_samples
	)

	if not self.bootstrap and self.oob_score:
	raise ValueError("Out of bag estimation only available if bootstrap=True")

[MRG] FIX raise error for max_samples if no bootstrap & optimize forest tests #21295

[MRG] FIX raise error for max_samples if no bootstrap & optimize forest tests #21295

Conversation

PSSF23 commented Oct 9, 2021 • edited

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

simonandras left a comment

Choose a reason for hiding this comment

PSSF23 left a comment • edited

Choose a reason for hiding this comment

adrinjalali commented Oct 12, 2021

NicolasHug left a comment

Choose a reason for hiding this comment

PSSF23 left a comment

Choose a reason for hiding this comment

PSSF23 left a comment

Choose a reason for hiding this comment

adrinjalali commented Nov 17, 2021

glemaitre commented Nov 22, 2021

NicolasHug commented Nov 22, 2021

glemaitre commented Nov 22, 2021

PSSF23 left a comment • edited

Choose a reason for hiding this comment

glemaitre commented Nov 23, 2021

PSSF23 left a comment

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

PSSF23 left a comment

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

adrinjalali left a comment

Choose a reason for hiding this comment

PSSF23 commented Oct 9, 2021 •

edited

PSSF23 left a comment •

edited

PSSF23 left a comment •

edited