Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cosine_similarity function produces results greater than 1.0 #18122

Open
wwz58 opened this issue Aug 8, 2020 · 12 comments
Open

cosine_similarity function produces results greater than 1.0 #18122

wwz58 opened this issue Aug 8, 2020 · 12 comments
Labels
Needs Decision Requires decision

Comments

@wwz58
Copy link

wwz58 commented Aug 8, 2020

def cosine_similarity(X, Y=None, dense_output=True):

reproduce the greater than 1.0 case:

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

a = np.array([[ 1,  2],
       [ 1,  4],
       [ 1,  0],
       [10,  2],
       [10,  4],
       [10,  0]])
b = np.array([[10.,  2.],
       [ 1.,  2.]])
cosine_similarity(a, b)[3][0]
1.0000000000000002

centos-7.2
python: 3.6.10
sklearn: 0.23.2
numpy: 1.19.1
scipy: 1.5.2

may be we should use K = safe_sparse_dot(X, Y) / sqrt(||X||^2 * ||Y||^2) instead of normarlize X,Y first

@jnothman
Copy link
Member

jnothman commented Aug 8, 2020 via email

@wwz58
Copy link
Author

wwz58 commented Aug 8, 2020

can someone explain what caused this problem?

@wwz58 wwz58 changed the title cosine_similarity function produces results more than 1.0 cosine_similarity function produces results greater than 1.0 Aug 8, 2020
@thomasjpfan
Copy link
Member

Floating point arithmetic can lead to these results.

@rth
Copy link
Member

rth commented Aug 10, 2020

If we start clipping this would apply to any metric that should be in 0, 1 or any other integer interval.

I would say this is the expected behavior since the error is within a few epsilon for float64,

>>> 1.0000000000000002 - 1
2.220446049250313e-16
>>> np.finfo(np.float64).eps
2.220446049250313e-16

+1 to close as "won't fix"

@rth rth added the Needs Decision Requires decision label Aug 10, 2020
@jnothman
Copy link
Member

jnothman commented Aug 11, 2020 via email

@rth
Copy link
Member

rth commented Aug 11, 2020

I meant that most of the the scores (accuracy_score etc) are also bound by integers and can have this behavior.

@jnothman
Copy link
Member

jnothman commented Aug 11, 2020 via email

@rth
Copy link
Member

rth commented Aug 11, 2020

OK. We are actually already clipping cosine_distances to [0, 2]. So it might make sense to move that np.clip with [-1, 1] bounds into cosine_similarity (which is also used by cosine_distances).

@mapio
Copy link

mapio commented Oct 13, 2022

I would say this is the expected behavior since the error is within a few epsilon for float64,

Not sure it is. I've two vectors of 100 elements that give a distance of 1.1700635166989035. That induced a bug hard to spot in my code.

Vectors are

x = array([ 1.00960141e-01, -4.01809782e-02, -2.53052902e-02, -3.83598426e-02,
       -1.84514546e-02, -1.01344371e-02, -1.08593392e-02, -1.15668668e-02,
        1.38312853e-02, -4.68102711e-03, -8.92247190e-03,  1.34186163e-02,
       -2.02401316e-02,  1.97277158e-02,  4.84512549e-03, -2.46691672e-02,
       -1.43678337e-02, -1.08396096e-02, -1.66045133e-02, -1.87338203e-04,
        1.77866272e-02, -1.29115470e-03, -2.39823208e-03,  7.06287462e-03,
       -3.56051338e-03,  1.33550563e-02, -1.41241862e-03,  4.54950362e-03,
        4.61987170e-03, -4.62260326e-04, -2.25454665e-02,  3.17050421e-02,
        2.59202662e-02,  3.03234615e-03,  2.98438669e-02, -3.74388567e-02,
       -1.02059536e-03,  2.19098720e-02, -2.83177246e-03, -1.24929645e-02,
       -9.54568059e-03, -1.27019096e-03, -1.26543314e-03, -7.08116030e-03,
       -1.86535521e-02,  9.04097856e-03,  1.54570539e-02, -1.93908853e-02,
       -7.75302959e-03,  3.16961159e-02,  8.41512851e-03,  2.06787745e-02,
       -2.35410647e-02, -3.02719392e-02,  1.86432058e-02,  6.87715473e-03,
       -5.22584753e-02,  4.21931210e-02,  9.13354063e-03, -2.40448892e-03,
        1.67934159e-02,  2.57438216e-02,  4.81727041e-03, -5.89206481e-02,
       -1.57569599e-04,  3.90460641e-02, -2.21131258e-02, -1.61892075e-02,
        6.00268269e-02, -3.82423066e-02,  4.29680057e-02,  3.52531687e-02,
       -6.03344827e-02,  3.59303944e-02, -1.68842604e-02, -4.02963261e-02,
        3.93567168e-02,  1.76632723e-02, -4.36065930e-02,  4.10331271e-03,
        2.50906174e-03,  4.87005150e-02, -2.46634526e-02,  3.83284114e-02,
        9.80695532e-03, -1.20585818e-02, -8.86564999e-02, -5.56599786e-02,
       -6.17286390e-02,  5.27780188e-03, -3.37788466e-02, -1.93486034e-02,
       -7.68252159e-02, -6.58106279e-02, -1.80238122e-02, -8.06371077e-02,
       -1.63344850e-01, -1.31490290e-01, -4.32950452e-02, -7.88187113e-02])

and

y = array([ 0.14547191, -0.07930009, -0.02698467,  0.00195985,  0.0063571 ,
       -0.0406509 ,  0.10264891, -0.0016326 ,  0.07884686,  0.0015746 ,
       -0.01817623,  0.00044795, -0.00437634, -0.01084891, -0.01855354,
        0.00257071, -0.02612611, -0.02764799, -0.00324183, -0.0013933 ,
       -0.00085558, -0.011029  , -0.0112539 , -0.02094967,  0.01518229,
        0.007486  ,  0.01591899, -0.01279154, -0.01077763, -0.01469155,
       -0.02087268, -0.00257903, -0.00752398,  0.00423674,  0.03864373,
        0.00284597,  0.00111167,  0.00524912,  0.00654447,  0.00189658,
        0.01175984,  0.02741782,  0.01889893,  0.018088  , -0.00300969,
        0.05002532, -0.01701955, -0.02548884, -0.0212687 , -0.00066898,
       -0.00137306,  0.01724564, -0.01435975, -0.02329919,  0.0442525 ,
       -0.00248004, -0.01821848,  0.0276741 ,  0.0217583 , -0.01027377,
       -0.06657801,  0.05199507, -0.03284782,  0.01022881, -0.00326401,
        0.01791922,  0.04573005, -0.07247989,  0.03182124, -0.07898516,
        0.03394183, -0.04119511, -0.05330145,  0.08151271,  0.06096885,
        0.01446256, -0.10069813, -0.00086334, -0.00224785,  0.10457657,
       -0.03048727,  0.08424894, -0.1153678 ,  0.02225815,  0.03754586,
       -0.07457668,  0.3032167 ,  0.01503319, -0.10581953,  0.15549258,
        0.09289552, -0.00456573,  0.07744588, -0.03183186, -0.00823961,
        0.01484506,  0.23307718,  0.14797703,  0.02567997, -0.07370993])

Then pairwise_distances([x,y], metric = 'cosine') returns

array([[2.22044605e-16, 1.17006352e+00],
       [1.17006352e+00, 0.00000000e+00]])

That 1.17 seems to be quite far off 1.

@wwz58
Copy link
Author

wwz58 commented Oct 13, 2022 via email

@lucyleeow
Copy link
Member

We are actually already clipping cosine_distances to [0, 2]. So it might make sense to move that np.clip with [-1, 1] bounds into cosine_similarity

The complexity here is that cosine_similarity returns dense or sparse, whereas cosine_distances only returns dense array.

np.clip is dense only. I don't think we want to do any sparse -> dense -> sparse conversion here. The sparse package includes a clip function, but it is not an existing dependency.

I think we could close this as won't fix, as we clip cosine_distances

@lucyleeow
Copy link
Member

@mapio pairwise_distances([x,y], metric = 'cosine') uses cosine_distances, which is defined as 1.0 minus the cosine similarity, it has range 0 to 2, so 1.17 not unusual.

@scikit-learn scikit-learn deleted a comment from wwz58 Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Decision Requires decision
Projects
None yet
Development

No branches or pull requests

6 participants