ENH NearestNeighbors-like classes with metric="nan_euclidean" does not actually support NaN values#25330
Conversation
glemaitre
left a comment
There was a problem hiding this comment.
Thanks @vitaliset, a first pass on the PR.
|
|
||
|
|
||
| def test_nan_euclidean_support(): | ||
| # Test input containing NaN. |
There was a problem hiding this comment.
| # Test input containing NaN. | |
| """Check that the different neighbor estimators are lenient towards `nan` | |
| values if using `metric="nan_euclidean". | |
| """ |
There was a problem hiding this comment.
We should also make sure to check the output of predict. I would also add an additional test to check what happens if we have a full sample with nan values to see how it fails.
There was a problem hiding this comment.
Hello @glemaitre, thanks for the review! :D
Regarding your last comment, the all nan input leads to something similar to constant X (getting the first indexes as nearest neighbors) and being independent of weights (I was expecting something to happen when we weighed by distance).
X = [[np.nan, np.nan], [np.nan, np.nan], [np.nan, np.nan], [np.nan, np.nan]]
y = [1, 2, 3, 4]
model = neighbors.KNeighborsClassifier(metric="nan_euclidean", n_neighbors=3, weights='uniform')
model.fit(X, y).predict(X)
>>> array([1, 1, 1, 1])
model = neighbors.KNeighborsClassifier(metric="nan_euclidean", n_neighbors=3, weights='distance')
model.fit(X, y).predict(X)
>>> array([1, 1, 1, 1])
X = [[0, 0], [0, 0], [0, 0], [0, 0]]
model = neighbors.KNeighborsClassifier(metric="nan_euclidean", n_neighbors=3, weights='distance')
model.fit(X, y).predict(X)
>>> array([1, 1, 1, 1])Results are similar for other classifiers/estimators such as RadiusNeighborsClassifier.
For KNeighborsTransformer and RadiusNeighborsTransformer we get:
model = neighbors.KNeighborsTransformer(metric="nan_euclidean", n_neighbors=2)
model.fit_transform(X).toarray()
>>> array([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])For LocalOutlierFactor we get:
model = neighbors.LocalOutlierFactor(metric="nan_euclidean", n_neighbors=1)
model.fit_predict(X)
>>> array([1, 1, 1, 1])No errors are being released in any case. I think this makes sense for the estimators/neighbors search, but I'm unsure if we want this, especially in transformers. What do you think?
On the other hand, pairwise_distances' family of functions gives us nan distances:
pairwise_distances(X, X, metric="nan_euclidean")
>>> array([[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan]])
pairwise_distances_argmin_min(X, X, metric="nan_euclidean")
>>> (array([0, 0, 0, 0], dtype=int64), array([nan, nan, nan, nan]))Do you want me to do something about this?
There was a problem hiding this comment.
Unless possible new asserts/tests, like the one suggested here, I made the revisions you instructed, so I'm requesting a new review. :D Thanks for the first comments.
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
…cikit-learn into nan_euclidean_nn_bug
glemaitre
left a comment
There was a problem hiding this comment.
It looks good. I would still add a test for the full constant feature to ensure that we don't change the behaviour in the future.
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
| # (neighbors.RadiusNeighborsRegressor, {}), | ||
| # (neighbors.RadiusNeighborsClassifier, {}), | ||
| (neighbors.NearestCentroid, {}), | ||
| # (neighbors.KNeighborsTransformer, {"n_neighbors": 2}), |
There was a problem hiding this comment.
Thanks for the new comments, @glemaitre. I tried to address them in my last commit.
Regarding the test for empty inputs, when I tested it the first time, I must have done something wrong in the notebook and didn't realize that the behavior is a bit strange for some of the classes:
from sklearn import __version__
__version__
>>> '1.3.dev0'
import numpy as np
from sklearn import neighbors
X = [[np.nan, np.nan], [np.nan, np.nan], [np.nan, np.nan], [np.nan, np.nan]]
y = [1, 2, 3, 4]
print(neighbors.RadiusNeighborsRegressor(metric='nan_euclidean').fit(X, y).predict(X))
>>> [-2147483648 -2147483648 -2147483648 -2147483648]
print(neighbors.RadiusNeighborsClassifier(metric="nan_euclidean").fit(X, y).predict(X))
>>> ValueError: No neighbors found for test samples array([0, 1, 2, 3], dtype=int64), you can try using larger radius, giving a label for outliers, or considering removing them from your dataset.
for n in range(1, 5):
model = neighbors.KNeighborsTransformer(metric="nan_euclidean", n_neighbors=n)
print(f"n_neighbors={n}")
print(model.fit_transform(X).toarray())
>>> n_neighbors=1
>>> [[nan nan 0. 0.]
[nan nan 0. 0.]
[nan nan 0. 0.]
[nan nan 0. 0.]]
>>> n_neighbors=2
>>> [[nan nan nan 0.]
[nan nan nan 0.]
[nan nan nan 0.]
[nan nan nan 0.]]
>>> n_neighbors=3
>>> [[nan nan nan nan]
[nan nan nan nan]
[nan nan nan nan]
[nan nan nan nan]]
>>> n_neighbors=4
>>> ValueError: Expected n_neighbors <= n_samples, but n_samples = 4, n_neighbors = 5
# Note that n_samples = 4 and actually n_neighbors should be = 4 alsoIt seems that RadiusNeighborsRegressor and RadiusNeighborsClassifier cannot find anyone with a distance less than the radius (and they behave differently from each other). As for KNeighborsTransformer, I don't know how to explain what's happening... 😵
What should we do?
There was a problem hiding this comment.
Personally, I think that it would be fine to raise an error for this case but I don't know we can easily detect it.
|
I solved the conflicts. I'm not sure anymore that we should care about the constant case because if everything is constant in X then, this would be an undefined behaviour. So I think that by already checking the way the pairwise distance work would be good enough. |
|
@glemaitre I added the |
|
Thanks @adrinjalali and @glemaitre for updating this PR with the latest main changes and for reviewing this PR. :) |
|
Yep, it is cleaner using the tag in this manner. |
This PR fixes #25319.
As suggested by @glemaitre, I changed the
X, yvalidation of._fitand then of.kneighborsand.radius_neighborswhenmetric="nan_euclidean"ofRadiusNeighborsMixin,KNeighborsMixin,NeighborsBase. Consequently, changing the behavior of its heritage (KNeighborsTransformer,RadiusNeighborsTransformer,KNeighborsClassifier,RadiusNeighborsClassifier,LocalOutlierFactor,KNeighborsRegressor,RadiusNeighborsRegressor,NearestNeighbors).I also updated the
NearestCentroidclass to follow this new validation. To make it work I had to change the validation ofsklearn.metrics.pairwise_distances_argminandsklearn.metrics.pairwise_distances_argmin_minas well (updating the docs now that it supportsmetric=nan_euclidean').As
KernelDensityuses kd_tree or ball_tree to build index:scikit-learn/sklearn/neighbors/_kde.py
Line 48 in 98cf537
It does not support
metrics='nan_euclidean', and I made no changes to it.Also, I added a test with the code used to report the issue by checking the behavior of the above classes.