New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Nearest Centroid Divide by Zero Fix #18370
[MRG] Nearest Centroid Divide by Zero Fix #18370
Conversation
Thanks @Trevor-Waite for your pull request. Do you mind adding a test checking that the error is thrown when necessary? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @Trevor-Waite !
if np.sum(variance) == 0: | ||
raise ValueError("All features have zero variance. " | ||
"Division by zero.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would been to be careful here with using variance directly:
import numpy as np
X = np.empty((10, 2))
X[:, 0] = -0.13725701
X[:, 1] = -0.9853293
X_means = X.mean(axis=0)
var = (X - means)**2
var = var.sum(axis=0)
var
# array([7.70371978e-33, 1.23259516e-31])
We can use np.ptp
instead to look for constant features:
np.ptp(X, axis=0)
# array([0., 0.])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem! Good catch, changed logic to np.ptp(X).sum() == 0
and changed test case to match your example.
@@ -154,16 +154,16 @@ def fit(self, X, y): | |||
self.centroids_[cur_class] = X[center_mask].mean(axis=0) | |||
|
|||
if self.shrink_threshold: | |||
if np.ptp(X).sum() == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect we still need the axis=0
to do ptp for each feature:
if np.ptp(X).sum() == 0: | |
if np.all(np.ptp(X, axis=0) == 0): |
Also avoiding the sum
avoids more numerical instability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, pushed before I noticed your review...will change to np.all methodology.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add an entry to the change log at doc/whats_new/v0.24.rst
with tag |Fix|. Like the other entries there, please reference this pull request with :pr:
and credit yourself (and other contributors if applicable) with :user:
.
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @trewaite !
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Reference Issues/PRs
#18324
What does this implement/fix? Explain your changes.
Checks to see if X features are all 0 variance. If so, raises value error and will avoid divide by 0.
Any other comments?
Not sure if this is desired behavior but seems like best way to handle, let me know if the error message is specific enough.