Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] FIX raise error for max_samples if no bootstrap & optimize forest tests #21295

Merged
merged 11 commits into from Nov 25, 2021

Conversation

PSSF23
Copy link
Contributor

@PSSF23 PSSF23 commented Oct 9, 2021

Reference Issues/PRs

Close #21294
Close #21299

What does this implement/fix? Explain your changes.

  • Raise ValueError with message saying max_samples is only available if bootstrap=True.
  • Fix forest tests by adding bootstrap=True parameter.

Any other comments?

Copy link
Contributor

@simonandras simonandras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice enhancement! Actually the behaviour here before your PR was ambiguous:
in your example after RandomForestClassifier(n_estimators=100, bootstrap=False, max_samples=0.5) the bootstrap was made, because in line 378 in _forest.py the n_samples_bootstrap parameter is set to 0.5 * n_samples by the _get_n_samples_bootstrap function whether bootstrap was False or not.

After writing the tests LGTM.

@PSSF23 PSSF23 changed the title FIX raise error for max_samples if no bootstrap FIX raise error for max_samples if no bootstrap & optimize forest tests Oct 10, 2021
Copy link
Contributor Author

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#21299 When checking the test errors, I found out that some of the tests for max_samples do not specify the bootstrap parameter. Yet the classifiers/regressors tested might have bootstrap=False as default, so the test results would be invalid.

The max_samples and bootstrap parameters in IsolationForest and bagging estimators seem to follow different logistics, so I didn't change those tests.

@PSSF23 PSSF23 changed the title FIX raise error for max_samples if no bootstrap & optimize forest tests [MRG] FIX raise error for max_samples if no bootstrap & optimize forest tests Oct 11, 2021
@adrinjalali
Copy link
Member

pinging @glemaitre and @NicolasHug on this one.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @PSSF23 , I made some minor comments but overall this LGTM

sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/tests/test_forest.py Show resolved Hide resolved
Copy link
Contributor Author

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicolasHug I changed the error message, but setting bootstrap=True explicitly is still required in the checks to achieve the same results. Otherwise, the fitting would not randomize the sample indices:

if forest.bootstrap:
n_samples = X.shape[0]
if sample_weight is None:
curr_sample_weight = np.ones((n_samples,), dtype=np.float64)
else:
curr_sample_weight = sample_weight.copy()
indices = _generate_sample_indices(
tree.random_state, n_samples, n_samples_bootstrap
)
sample_counts = np.bincount(indices, minlength=n_samples)
curr_sample_weight *= sample_counts
if class_weight == "subsample":
with catch_warnings():
simplefilter("ignore", DeprecationWarning)
curr_sample_weight *= compute_sample_weight("auto", y, indices=indices)
elif class_weight == "balanced_subsample":
curr_sample_weight *= compute_sample_weight("balanced", y, indices=indices)
tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False)
else:
tree.fit(X, y, sample_weight=sample_weight, check_input=False)

Copy link
Contributor Author

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonandras @adrinjalali @NicolasHug @glemaitre Anything else needed for this PR?

@adrinjalali
Copy link
Member

I kinda feel like we need a deprecation cycle here, but not entirely sure.

@glemaitre
Copy link
Member

I agree with @adrinjalali. We should raise a FutureWarning for 2 versions before to error. The warning can be educational by explaining that we are discarding max_samples indeed and that to avoid that ambiguity, an error will be raised in the future.

@NicolasHug
Copy link
Member

I think we tend to consider these to be bugfixes (hence no deprecation), but I don't mind going through the deprecation cycle to be extra nice.

@glemaitre
Copy link
Member

But here, the parameter was only ignore that was more or less in line with the docstring from what I understand.

Copy link
Contributor Author

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glemaitre The current code will calculate n_samples_bootstrap based on max_samples no matter bootstrap is True or False. In that process it might raise unnecessary errors if the max_samples format is wrong. As it's eventually ignored, why check it?

n_samples_bootstrap = _get_n_samples_bootstrap(
n_samples=X.shape[0], max_samples=self.max_samples
)

Also in a naive user perspective, I would assume any parameter entered reflects how the program actually runs. So for example, I would think that ExtraTreesClassifier(max_samples=0.5) runs with half sample sizes.

In addition, for consistency, as oob_score raises errors when bootstrap is False, max_samples should do the same thing.

if not self.bootstrap and self.oob_score:
raise ValueError("Out of bag estimation only available if bootstrap=True")

@glemaitre
Copy link
Member

@glemaitre The current code will calculate n_samples_bootstrap based on max_samples no matter bootstrap is True or False. In that process it might raise unnecessary errors if the max_samples format is wrong.
Also in a naive user perspective, I would assume any parameter entered reflects how the program actually runs. So for example, I would think that ExtraTreesClassifier(max_samples=0.5) runs with half sample sizes.

OK so let's consider it as a bug fix and directly raise an error. If would have beforehand a safeguard if boostrap: and silently ignore it, then it would have been different.

In addition, for consistency, as oob_score raises errors when bootstrap is False, max_samples should do the same thing.

This is a different case here because there is no way to have an OOB sample without bootstraping while you could consider getting a smaller sample size with max_sample=0.5.

Copy link
Contributor Author

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glemaitre I modified the code as you suggested. Let me know if there are other improvements needed!

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise it looks good.

sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glemaitre Changes implemented~

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Before to merge, I would like to have a last review of @adrinjalali

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, happy with it being a fix

@adrinjalali adrinjalali merged commit f96ce58 into scikit-learn:main Nov 25, 2021
@PSSF23 PSSF23 deleted the bootstrap branch November 25, 2021 13:45
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Nov 29, 2021
…ts (scikit-learn#21295)

* FIX raise error for max_samples if no bootstrap

* EHN move check position by suggestion

* FIX add bootstrap parameter to tests

* FIX resolve test error & DOC add log

* FIX add test coverage for bootstrap check

* DOC optimize error message

* FIX resolve test error

* ENH restrict sample bootstrap

* ENH optimize bootstrap conditions
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
…ts (scikit-learn#21295)

* FIX raise error for max_samples if no bootstrap

* EHN move check position by suggestion

* FIX add bootstrap parameter to tests

* FIX resolve test error & DOC add log

* FIX add test coverage for bootstrap check

* DOC optimize error message

* FIX resolve test error

* ENH restrict sample bootstrap

* ENH optimize bootstrap conditions
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Dec 24, 2021
…ts (scikit-learn#21295)

* FIX raise error for max_samples if no bootstrap

* EHN move check position by suggestion

* FIX add bootstrap parameter to tests

* FIX resolve test error & DOC add log

* FIX add test coverage for bootstrap check

* DOC optimize error message

* FIX resolve test error

* ENH restrict sample bootstrap

* ENH optimize bootstrap conditions
glemaitre pushed a commit that referenced this pull request Dec 25, 2021
…ts (#21295)

* FIX raise error for max_samples if no bootstrap

* EHN move check position by suggestion

* FIX add bootstrap parameter to tests

* FIX resolve test error & DOC add log

* FIX add test coverage for bootstrap check

* DOC optimize error message

* FIX resolve test error

* ENH restrict sample bootstrap

* ENH optimize bootstrap conditions
mathijs02 pushed a commit to mathijs02/scikit-learn that referenced this pull request Dec 27, 2022
…ts (scikit-learn#21295)

* FIX raise error for max_samples if no bootstrap

* EHN move check position by suggestion

* FIX add bootstrap parameter to tests

* FIX resolve test error & DOC add log

* FIX add test coverage for bootstrap check

* DOC optimize error message

* FIX resolve test error

* ENH restrict sample bootstrap

* ENH optimize bootstrap conditions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants