New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] FIX raise error for max_samples if no bootstrap & optimize forest tests #21295
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice enhancement! Actually the behaviour here before your PR was ambiguous:
in your example after RandomForestClassifier(n_estimators=100, bootstrap=False, max_samples=0.5)
the bootstrap was made, because in line 378 in _forest.py the n_samples_bootstrap
parameter is set to 0.5 * n_samples
by the _get_n_samples_bootstrap
function whether bootstrap
was False
or not.
After writing the tests LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#21299 When checking the test errors, I found out that some of the tests for max_samples
do not specify the bootstrap
parameter. Yet the classifiers/regressors tested might have bootstrap=False
as default, so the test results would be invalid.
The max_samples
and bootstrap
parameters in IsolationForest
and bagging estimators seem to follow different logistics, so I didn't change those tests.
pinging @glemaitre and @NicolasHug on this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @PSSF23 , I made some minor comments but overall this LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NicolasHug I changed the error message, but setting bootstrap=True
explicitly is still required in the checks to achieve the same results. Otherwise, the fitting would not randomize the sample indices:
scikit-learn/sklearn/ensemble/_forest.py
Lines 164 to 186 in 28567a5
if forest.bootstrap: | |
n_samples = X.shape[0] | |
if sample_weight is None: | |
curr_sample_weight = np.ones((n_samples,), dtype=np.float64) | |
else: | |
curr_sample_weight = sample_weight.copy() | |
indices = _generate_sample_indices( | |
tree.random_state, n_samples, n_samples_bootstrap | |
) | |
sample_counts = np.bincount(indices, minlength=n_samples) | |
curr_sample_weight *= sample_counts | |
if class_weight == "subsample": | |
with catch_warnings(): | |
simplefilter("ignore", DeprecationWarning) | |
curr_sample_weight *= compute_sample_weight("auto", y, indices=indices) | |
elif class_weight == "balanced_subsample": | |
curr_sample_weight *= compute_sample_weight("balanced", y, indices=indices) | |
tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False) | |
else: | |
tree.fit(X, y, sample_weight=sample_weight, check_input=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonandras @adrinjalali @NicolasHug @glemaitre Anything else needed for this PR?
I kinda feel like we need a deprecation cycle here, but not entirely sure. |
I agree with @adrinjalali. We should raise a |
I think we tend to consider these to be bugfixes (hence no deprecation), but I don't mind going through the deprecation cycle to be extra nice. |
But here, the parameter was only ignore that was more or less in line with the docstring from what I understand. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glemaitre The current code will calculate n_samples_bootstrap
based on max_samples
no matter bootstrap
is True
or False
. In that process it might raise unnecessary errors if the max_samples
format is wrong. As it's eventually ignored, why check it?
scikit-learn/sklearn/ensemble/_forest.py
Lines 384 to 386 in 3626a34
n_samples_bootstrap = _get_n_samples_bootstrap( | |
n_samples=X.shape[0], max_samples=self.max_samples | |
) |
Also in a naive user perspective, I would assume any parameter entered reflects how the program actually runs. So for example, I would think that ExtraTreesClassifier(max_samples=0.5)
runs with half sample sizes.
In addition, for consistency, as oob_score
raises errors when bootstrap
is False
, max_samples
should do the same thing.
scikit-learn/sklearn/ensemble/_forest.py
Lines 407 to 408 in 3626a34
if not self.bootstrap and self.oob_score: | |
raise ValueError("Out of bag estimation only available if bootstrap=True") |
OK so let's consider it as a bug fix and directly raise an error. If would have beforehand a safeguard
This is a different case here because there is no way to have an OOB sample without |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glemaitre I modified the code as you suggested. Let me know if there are other improvements needed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise it looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glemaitre Changes implemented~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Before to merge, I would like to have a last review of @adrinjalali
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, happy with it being a fix
…ts (scikit-learn#21295) * FIX raise error for max_samples if no bootstrap * EHN move check position by suggestion * FIX add bootstrap parameter to tests * FIX resolve test error & DOC add log * FIX add test coverage for bootstrap check * DOC optimize error message * FIX resolve test error * ENH restrict sample bootstrap * ENH optimize bootstrap conditions
…ts (scikit-learn#21295) * FIX raise error for max_samples if no bootstrap * EHN move check position by suggestion * FIX add bootstrap parameter to tests * FIX resolve test error & DOC add log * FIX add test coverage for bootstrap check * DOC optimize error message * FIX resolve test error * ENH restrict sample bootstrap * ENH optimize bootstrap conditions
…ts (scikit-learn#21295) * FIX raise error for max_samples if no bootstrap * EHN move check position by suggestion * FIX add bootstrap parameter to tests * FIX resolve test error & DOC add log * FIX add test coverage for bootstrap check * DOC optimize error message * FIX resolve test error * ENH restrict sample bootstrap * ENH optimize bootstrap conditions
…ts (#21295) * FIX raise error for max_samples if no bootstrap * EHN move check position by suggestion * FIX add bootstrap parameter to tests * FIX resolve test error & DOC add log * FIX add test coverage for bootstrap check * DOC optimize error message * FIX resolve test error * ENH restrict sample bootstrap * ENH optimize bootstrap conditions
…ts (scikit-learn#21295) * FIX raise error for max_samples if no bootstrap * EHN move check position by suggestion * FIX add bootstrap parameter to tests * FIX resolve test error & DOC add log * FIX add test coverage for bootstrap check * DOC optimize error message * FIX resolve test error * ENH restrict sample bootstrap * ENH optimize bootstrap conditions
Reference Issues/PRs
Close #21294
Close #21299
What does this implement/fix? Explain your changes.
ValueError
with message sayingmax_samples
is only available ifbootstrap=True
.bootstrap=True
parameter.Any other comments?