Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Nearest Centroid Divide by Zero Fix #18370

Merged
merged 12 commits into from Oct 2, 2020
3 changes: 3 additions & 0 deletions sklearn/neighbors/_nearest_centroid.py
Expand Up @@ -161,6 +161,9 @@ def fit(self, X, y):
# Calculate deviation using the standard deviation of centroids.
variance = (X - self.centroids_[y_ind]) ** 2
variance = variance.sum(axis=0)
if np.sum(variance) == 0:
raise ValueError("All features have zero variance. "
"Division by zero.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would been to be careful here with using variance directly:

import numpy as np

X = np.empty((10, 2))
X[:, 0] = -0.13725701
X[:, 1] = -0.9853293

X_means = X.mean(axis=0)
var = (X - means)**2
var = var.sum(axis=0)

var
# array([7.70371978e-33, 1.23259516e-31])

We can use np.ptp instead to look for constant features:

np.ptp(X, axis=0)
# array([0., 0.])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem! Good catch, changed logic to np.ptp(X).sum() == 0 and changed test case to match your example.

s = np.sqrt(variance / (n_samples - n_classes))
s += np.median(s) # To deter outliers from affecting the results.
mm = m.reshape(len(m), 1) # Reshape to allow broadcasting.
Expand Down
10 changes: 10 additions & 0 deletions sklearn/neighbors/tests/test_nearest_centroid.py
Expand Up @@ -146,3 +146,13 @@ def test_manhattan_metric():
clf.fit(X_csr, y)
assert_array_equal(clf.centroids_, dense_centroid)
assert_array_equal(dense_centroid, [[-1, -1], [1, 1]])


def test_features_zero_var():
# Test that features with 0 variance throw error

X = np.array([[0, 0], [0, 0], [0, 0], [0, 0], [0, 0]])
y = np.array([1, 0, 0, 1, 0])
clf = NearestCentroid(shrink_threshold=0.1)
with assert_raises(ValueError):
clf.fit(X, y)