Skip to content

[MRG] FIX keep at least one feature when max_features is small fraction #12388

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from Oct 27, 2018
Merged

[MRG] FIX keep at least one feature when max_features is small fraction #12388

merged 11 commits into from Oct 27, 2018

Conversation

ghost
Copy link

@ghost ghost commented Oct 15, 2018

Reference Issues/PRs

Fixes #12386.

What does this implement/fix? Explain your changes.

Often the max_features parameter of a Bagging estimator is set as a float, to represent a fraction of the number of features to use. To convert to an integer, this equation is currently used:
max_features = int(self.max_features * self.n_features_)

However, this often leads to a ValueError if the result is rounded down to zero. This may occur if the number of features is often unknown (for example, due to hyperparameter tuning in an earlier stage).

This PR ensures a minimum of one feature is kept in this situation:
max_features = max(1, int(self.max_features * self.n_features_) )

Any other comments?

Would be grateful to check that unit test is implemented in the right place in an appropriate manner. I've tried to be consistent with other tests.

I've tried to find the cleanest implementation that still raises a ValueError if max_features is negative, zero, too large, or not an int nor float.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

else: # float
max_features = int(self.max_features * self.n_features_)
elif isinstance(self.max_features, (numbers.Real, np.float)):
if not self.max_features > 0.0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you use <= instead of not?

raise ValueError("max_features must be in (0, n_features]")
max_features = max(1, int(self.max_features * self.n_features_))
else:
raise ValueError("max_features must be int or float")

if not (0 < max_features <= self.n_features_):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this before the above test and you can simplify the code. Don't worry about validating that it is numeric. Comparing to a number is good enough for that unexpected case.

Copy link
Author

@ghost ghost Oct 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @jnothman ! I'm struggling to find a simpler implementation that handles unexpected cases. Could you expand on your comment above?

Comparing self.max_features to a number without first casting to an integer causes an unexpected Type error if it's a string: TypeError: unorderable types: int() < str().

However, casting max_features to an integer means that 0.1 would be rounded down to 0, and hence a Value error is raised (which is the behaviour the PR is trying to avoid).

Some relevant existing unit tests:

# Test max_features
assert_raises(ValueError,
BaggingClassifier(base, max_features=-1).fit, X, y)
assert_raises(ValueError,
BaggingClassifier(base, max_features=0.0).fit, X, y)
assert_raises(ValueError,
BaggingClassifier(base, max_features=2.0).fit, X, y)
assert_raises(ValueError,
BaggingClassifier(base, max_features=5).fit, X, y)
assert_raises(ValueError,
BaggingClassifier(base, max_features="foobar").fit, X, y)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... TypeError might really be the more appropriate error anyway, but let's not quibble with the tests. Why not:

if not numeric:
   raise ValueError
if real:
   max_features = max_features * features
if not 0 < max_features <= n_features:
   raise ValueError
max_features = int(max_features)

but perhaps that logic is no less complicated than the present?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point: I could just modify that test to expect a TypeError if a string is input rather than a ValueError.

Here's a suggested modification of the logic that ensures 0.1 is not rounded down to zero:

if isinstance(self.max_features, (numbers.Integral, np.integer)):
    max_features = self.max_features
else:  # float
   max_features = self.max_features * self.n_features_

if not (0 < max_features <= self.n_features_):
    raise ValueError

max_features = max(1, int(max_features))

@ghost
Copy link
Author

ghost commented Oct 16, 2018

Ah, it looks like 'foobar' > 0 raises TypeError in python 3, but is True in python 2.7. Who'd have guessed. I'll add a line to explicitly raise a ValueError if the input is not numeric.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks. Please add a what's new to doc/whats_new/v0.20.rst under the 0.20.1 section mentioning all estimators that are affected by this fix (excluding BaseBagging class itself).

@ghost
Copy link
Author

ghost commented Oct 26, 2018

Thanks for the review @rth . I've added a comment to the doc as requested.

@rth rth merged commit 5cef1df into scikit-learn:master Oct 27, 2018
@rth
Copy link
Member

rth commented Oct 27, 2018

Thanks! (Fixed the formatting in what's new a bit).

thoo pushed a commit to thoo/scikit-learn that referenced this pull request Nov 14, 2018
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 14, 2018
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 14, 2018
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

max_features often rounded down to zero, leading to ValueError
3 participants