Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'cosine' metric computation bug #21939

Closed
BrandonLiang opened this issue Dec 10, 2021 · 10 comments
Closed

'cosine' metric computation bug #21939

BrandonLiang opened this issue Dec 10, 2021 · 10 comments
Labels
Documentation Easy Well-defined and straightforward way to resolve

Comments

@BrandonLiang
Copy link
Contributor

BrandonLiang commented Dec 10, 2021

Describe the bug

In my unit test for a feature using sklearn.neighbors.NearestNeighbors and cosine as the metric, i have a test to assert that the nearest neighbor of a datapoint itself is itself. So I would expect the return similarity to be 1. However, that's not the case (in fact, it's 0 for high-dimensional features).

Steps/Code to Reproduce

from sklearn.neighbors import NearestNeighbors                                                                                                                    
import numpy as np

X = [[1, 0, 1]] 
X = np.array(X)
neigh = NearestNeighbors(n_neighbors=1, metric="cosine")
neigh.fit(X)
distances, ids = neigh.kneighbors([[1, 0, 1]])
print(distances)

Expected Results

I would expect [[1]]

Actual Results

[[2.22044605e-16]]

which is approximately 0

Versions

>>> import sklearn; sklearn.show_versions()

System:
    python: 3.7.4 (default, Aug 13 2019, 20:35:49)  [GCC 7.3.0]
executable: /home/xliang147/my_mbig/miniconda3/bin/python3
   machine: Linux-3.10.0-957.35.2.el7.x86_64-x86_64-with-redhat-7.6-Maipo

Python dependencies:
          pip: 21.3.1
   setuptools: 58.2.0
      sklearn: 1.0.1
        numpy: 1.18.5
        scipy: 1.7.3
       Cython: None
       pandas: 1.3.4
   matplotlib: 3.4.3
       joblib: 1.1.0
threadpoolctl: 3.0.0

Built with OpenMP: True
@MaxwellLZH
Copy link
Contributor

That's because the output is cosine distance, which is defined as 1.0 minus the cosine similarity

@BrandonLiang
Copy link
Contributor Author

That's because the output is cosine distance, which is defined as 1.0 minus the cosine similarity

Is this documented somewhere?

@thomasjpfan
Copy link
Member

thomasjpfan commented Dec 10, 2021

I do not think we officially support cosine distance as a metric in NearestNeighbors, since is not a true mathematical metric. Cosine distance fails "identity of indiscernibles" and "triangle inequality".

Edit Looking at the code, I think it's undocumented.

@BrandonLiang
Copy link
Contributor Author

I do not think we officially support cosine distance as a metric in NearestNeighbors, since is not a true mathematical metric. Cosine distance fails "identity of indiscernibles" and "triangle inequality".

then why is metric="cosine" supported in constructor

@thomasjpfan
Copy link
Member

thomasjpfan commented Dec 10, 2021

then why is metric="cosine" supported in constructor

Looking closely at the code, I was mistaken. I think we have support "cosine" and we currently test for it here:

for metric in VALID_METRICS["brute"]:

In this case, we need to update the documentation for NearestNeighbors to say we support cosine. Furthermore there are other metrics that we support but not listed: cityblock, correlation, nan_euclidean.

@BrandonLiang Are you interested in opening a PR to update the docstring for NearestNeighbors.metrics and add cosine?

@thomasjpfan thomasjpfan added Documentation Easy Well-defined and straightforward way to resolve and removed Bug: triage labels Dec 10, 2021
@BrandonLiang
Copy link
Contributor Author

Yes I can make a PR and will request your review

@BrandonLiang
Copy link
Contributor Author

@thomasjpfan can you point me to the documentation file for https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html#sklearn.neighbors.DistanceMetric (I think it's better to add the documentation here)

@thomasjpfan
Copy link
Member

Since DistanceMetric does not implement cosine, I do not think 'cosine' would belong there. Interally, NearestNeighbors will end up calling pairwise_distances_chunked, which supports metrics listed in pairwise.PAIRWISE_DISTANCE_FUNCTIONS, which includes cosine.

For NearestNeighbors, the quick doc fix would go here:

metric. For a list of available metrics, see the documentation of
:class:`~sklearn.metrics.DistanceMetric`.

where we state: "metrics in pairwise.PAIRWISE_DISTANCE_FUNCTIONS are valid".

There will still be the issue where it is unclear that cosine is using distances. Without looking at the source, the only way to know is to run:

import sklearn.metrics.pairwise as pairwise
pairwise.PAIRWISE_DISTANCE_FUNCTIONS['cosine']
<function sklearn.metrics.pairwise.cosine_distances(X, Y=None)>

and see that cosine_distances is used, instead of cosine_similarity.

@BrandonLiang
Copy link
Contributor Author

@thomasjpfan just created #22073, don't have permission to add you as reviewer

@thomasjpfan
Copy link
Member

Closing because #22073 resolves this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Easy Well-defined and straightforward way to resolve
Projects
None yet
Development

No branches or pull requests

3 participants