Skip to content

Conversation

vitaliset
Copy link
Contributor

@vitaliset vitaliset commented Jan 8, 2023

This PR fixes #25319.

As suggested by @glemaitre, I changed the X, y validation of ._fit and then of .kneighbors and .radius_neighbors when metric="nan_euclidean" of RadiusNeighborsMixin, KNeighborsMixin, NeighborsBase. Consequently, changing the behavior of its heritage (KNeighborsTransformer, RadiusNeighborsTransformer, KNeighborsClassifier, RadiusNeighborsClassifier, LocalOutlierFactor, KNeighborsRegressor, RadiusNeighborsRegressor, NearestNeighbors).

I also updated the NearestCentroid class to follow this new validation. To make it work I had to change the validation of sklearn.metrics.pairwise_distances_argmin and sklearn.metrics.pairwise_distances_argmin_min as well (updating the docs now that it supports metric=nan_euclidean').

As KernelDensity uses kd_tree or ball_tree to build index:

algorithm : {'kd_tree', 'ball_tree', 'auto'}, default='auto'

It does not support metrics='nan_euclidean', and I made no changes to it.

from sklearn.neighbors import VALID_METRICS
for key in VALID_METRICS.keys():
    print(f"'nan_euclidean' in {key}:", 'nan_euclidean' in VALID_METRICS[key])
>>> 'nan_euclidean' in ball_tree: False
>>> 'nan_euclidean' in kd_tree: False
>>> 'nan_euclidean' in brute: True

Also, I added a test with the code used to report the issue by checking the behavior of the above classes.

@vitaliset vitaliset changed the title [WIP] FIX NearestNeighbors-like classes with metric="nan_euclidean" does not actually support NaN values FIX NearestNeighbors-like classes with metric="nan_euclidean" does not actually support NaN values Jan 10, 2023
@glemaitre glemaitre self-requested a review January 14, 2023 13:14
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vitaliset, a first pass on the PR.



def test_nan_euclidean_support():
# Test input containing NaN.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Test input containing NaN.
"""Check that the different neighbor estimators are lenient towards `nan`
values if using `metric="nan_euclidean".
"""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also make sure to check the output of predict. I would also add an additional test to check what happens if we have a full sample with nan values to see how it fails.

Copy link
Contributor Author

@vitaliset vitaliset Feb 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @glemaitre, thanks for the review! :D

Regarding your last comment, the all nan input leads to something similar to constant X (getting the first indexes as nearest neighbors) and being independent of weights (I was expecting something to happen when we weighed by distance).

X = [[np.nan, np.nan], [np.nan, np.nan], [np.nan, np.nan], [np.nan, np.nan]]
y = [1, 2, 3, 4]

model = neighbors.KNeighborsClassifier(metric="nan_euclidean", n_neighbors=3, weights='uniform')
model.fit(X, y).predict(X)
>>> array([1, 1, 1, 1])

model = neighbors.KNeighborsClassifier(metric="nan_euclidean", n_neighbors=3, weights='distance')
model.fit(X, y).predict(X)
>>> array([1, 1, 1, 1])

X = [[0, 0], [0, 0], [0, 0], [0, 0]]
model = neighbors.KNeighborsClassifier(metric="nan_euclidean", n_neighbors=3, weights='distance')
model.fit(X, y).predict(X)
>>> array([1, 1, 1, 1])

Results are similar for other classifiers/estimators such as RadiusNeighborsClassifier.

For KNeighborsTransformer and RadiusNeighborsTransformer we get:

model = neighbors.KNeighborsTransformer(metric="nan_euclidean", n_neighbors=2)
model.fit_transform(X).toarray()
>>> array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

For LocalOutlierFactor we get:

model = neighbors.LocalOutlierFactor(metric="nan_euclidean", n_neighbors=1)
model.fit_predict(X)
>>> array([1, 1, 1, 1])

No errors are being released in any case. I think this makes sense for the estimators/neighbors search, but I'm unsure if we want this, especially in transformers. What do you think?

On the other hand, pairwise_distances' family of functions gives us nan distances:

pairwise_distances(X, X, metric="nan_euclidean")
>>> array([[nan, nan, nan, nan],
       [nan, nan, nan, nan],
       [nan, nan, nan, nan],
       [nan, nan, nan, nan]])

pairwise_distances_argmin_min(X, X, metric="nan_euclidean")
>>> (array([0, 0, 0, 0], dtype=int64), array([nan, nan, nan, nan]))

Do you want me to do something about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless possible new asserts/tests, like the one suggested here, I made the revisions you instructed, so I'm requesting a new review. :D Thanks for the first comments.

@vitaliset vitaliset requested a review from glemaitre February 4, 2023 19:49
@vitaliset vitaliset changed the title FIX NearestNeighbors-like classes with metric="nan_euclidean" does not actually support NaN values ENH NearestNeighbors-like classes with metric="nan_euclidean" does not actually support NaN values Feb 17, 2023
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good. I would still add a test for the full constant feature to ensure that we don't change the behaviour in the future.

vitaliset and others added 3 commits February 23, 2023 00:22
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Comment on lines 2302 to 2305
# (neighbors.RadiusNeighborsRegressor, {}),
# (neighbors.RadiusNeighborsClassifier, {}),
(neighbors.NearestCentroid, {}),
# (neighbors.KNeighborsTransformer, {"n_neighbors": 2}),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the new comments, @glemaitre. I tried to address them in my last commit.

Regarding the test for empty inputs, when I tested it the first time, I must have done something wrong in the notebook and didn't realize that the behavior is a bit strange for some of the classes:

from sklearn import __version__
__version__
>>> '1.3.dev0'

import numpy as np
from sklearn import neighbors

X = [[np.nan, np.nan], [np.nan, np.nan], [np.nan, np.nan], [np.nan, np.nan]]
y = [1, 2, 3, 4]

print(neighbors.RadiusNeighborsRegressor(metric='nan_euclidean').fit(X, y).predict(X))
>>> [-2147483648 -2147483648 -2147483648 -2147483648]
print(neighbors.RadiusNeighborsClassifier(metric="nan_euclidean").fit(X, y).predict(X))
>>> ValueError: No neighbors found for test samples array([0, 1, 2, 3], dtype=int64), you can try using larger radius, giving a label for outliers, or considering removing them from your dataset.

for n in range(1, 5):
    model = neighbors.KNeighborsTransformer(metric="nan_euclidean", n_neighbors=n)
    print(f"n_neighbors={n}")
    print(model.fit_transform(X).toarray())
>>> n_neighbors=1
>>> [[nan nan  0.  0.]
 [nan nan  0.  0.]
 [nan nan  0.  0.]
 [nan nan  0.  0.]]
>>> n_neighbors=2
>>> [[nan nan nan  0.]
 [nan nan nan  0.]
 [nan nan nan  0.]
 [nan nan nan  0.]]
>>> n_neighbors=3
>>> [[nan nan nan nan]
 [nan nan nan nan]
 [nan nan nan nan]
 [nan nan nan nan]]
>>> n_neighbors=4
>>> ValueError: Expected n_neighbors <= n_samples,  but n_samples = 4, n_neighbors = 5
# Note that n_samples = 4 and actually n_neighbors  should be = 4 also

It seems that RadiusNeighborsRegressor and RadiusNeighborsClassifier cannot find anyone with a distance less than the radius (and they behave differently from each other). As for KNeighborsTransformer, I don't know how to explain what's happening... 😵

What should we do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think that it would be fine to raise an error for this case but I don't know we can easily detect it.

Copy link

github-actions bot commented May 21, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: c69e867. Link to the linter CI: here

@glemaitre
Copy link
Member

I solved the conflicts. I'm not sure anymore that we should care about the constant case because if everything is constant in X then, this would be an undefined behaviour.

So I think that by already checking the way the pairwise distance work would be good enough.

@glemaitre glemaitre added this to the 1.6 milestone May 21, 2024
@adrinjalali
Copy link
Member

@glemaitre I added the allow_nan tag here. So I'll let you review and merge.

@vitaliset
Copy link
Contributor Author

Thanks @adrinjalali and @glemaitre for updating this PR with the latest main changes and for reviewing this PR. :)

@glemaitre
Copy link
Member

glemaitre commented Nov 5, 2024

Yep, it is cleaner using the tag in this manner.

@glemaitre glemaitre merged commit d2f1ea7 into scikit-learn:main Nov 5, 2024
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

KNeighborsRegressor with metric="nan_euclidean" does not actually support NaN values
3 participants