Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Nearest Centroid Divide by Zero Fix #18370

Merged
merged 12 commits into from Oct 2, 2020

Conversation

trewaite
Copy link
Contributor

Reference Issues/PRs

#18324

What does this implement/fix? Explain your changes.

Checks to see if X features are all 0 variance. If so, raises value error and will avoid divide by 0.

Any other comments?

Not sure if this is desired behavior but seems like best way to handle, let me know if the error message is specific enough.

@cmarmo
Copy link
Member

cmarmo commented Sep 21, 2020

Thanks @Trevor-Waite for your pull request. Do you mind adding a test checking that the error is thrown when necessary?
When you are ready for review, please change [WIP] to [MRG] in the title of your PR. Thanks!

@trewaite trewaite changed the title [WIP] Nearest Centroid Divide by Zero Fix [MRG] Nearest Centroid Divide by Zero Fix Sep 24, 2020
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @Trevor-Waite !

if np.sum(variance) == 0:
raise ValueError("All features have zero variance. "
"Division by zero.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would been to be careful here with using variance directly:

import numpy as np

X = np.empty((10, 2))
X[:, 0] = -0.13725701
X[:, 1] = -0.9853293

X_means = X.mean(axis=0)
var = (X - means)**2
var = var.sum(axis=0)

var
# array([7.70371978e-33, 1.23259516e-31])

We can use np.ptp instead to look for constant features:

np.ptp(X, axis=0)
# array([0., 0.])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem! Good catch, changed logic to np.ptp(X).sum() == 0 and changed test case to match your example.

@@ -154,16 +154,16 @@ def fit(self, X, y):
self.centroids_[cur_class] = X[center_mask].mean(axis=0)

if self.shrink_threshold:
if np.ptp(X).sum() == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we still need the axis=0 to do ptp for each feature:

Suggested change
if np.ptp(X).sum() == 0:
if np.all(np.ptp(X, axis=0) == 0):

Also avoiding the sum avoids more numerical instability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, pushed before I noticed your review...will change to np.all methodology.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an entry to the change log at doc/whats_new/v0.24.rst with tag |Fix|. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

trewaite and others added 2 commits September 30, 2020 14:20
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @trewaite !

@rth rth merged commit 2538489 into scikit-learn:master Oct 2, 2020
5 checks passed
amrcode pushed a commit to amrcode/scikit-learn that referenced this pull request Oct 19, 2020
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants