cosine_similarity function produces results greater than 1.0 #18122

wwz58 · 2020-08-08T12:29:02Z

scikit-learn/sklearn/metrics/pairwise.py

Line 1144 in 0fb307b

def cosine_similarity(X, Y=None, dense_output=True):

reproduce the greater than 1.0 case:

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

a = np.array([[ 1,  2],
       [ 1,  4],
       [ 1,  0],
       [10,  2],
       [10,  4],
       [10,  0]])
b = np.array([[10.,  2.],
       [ 1.,  2.]])
cosine_similarity(a, b)[3][0]
1.0000000000000002

centos-7.2
python: 3.6.10
sklearn: 0.23.2
numpy: 1.19.1
scipy: 1.5.2

may be we should use K = safe_sparse_dot(X, Y) / sqrt(||X||^2 * ||Y||^2) instead of normarlize X,Y first

jnothman · 2020-08-08T13:31:37Z

Maybe we should just clip the output

wwz58 · 2020-08-08T14:21:16Z

can someone explain what caused this problem?

thomasjpfan · 2020-08-08T19:20:43Z

Floating point arithmetic can lead to these results.

rth · 2020-08-10T08:56:35Z

If we start clipping this would apply to any metric that should be in 0, 1 or any other integer interval.

I would say this is the expected behavior since the error is within a few epsilon for float64,

>>> 1.0000000000000002 - 1
2.220446049250313e-16
>>> np.finfo(np.float64).eps
2.220446049250313e-16

+1 to close as "won't fix"

jnothman · 2020-08-11T00:11:59Z

Which other distances that we support have an upper bound? Others like hamming and jaccard probably don't do floating point subtraction or addition, so I don't immediately buy that kind of slippery slope argument.

rth · 2020-08-11T07:19:36Z

I meant that most of the the scores (accuracy_score etc) are also bound by integers and can have this behavior.

jnothman · 2020-08-11T10:00:32Z

Fair, but the properties of those metrics are rarely relied upon in algorithms as much as distances.

rth · 2020-08-11T12:03:02Z

OK. We are actually already clipping cosine_distances to [0, 2]. So it might make sense to move that np.clip with [-1, 1] bounds into cosine_similarity (which is also used by cosine_distances).

mapio · 2022-10-13T07:02:39Z

I would say this is the expected behavior since the error is within a few epsilon for float64,

Not sure it is. I've two vectors of 100 elements that give a distance of 1.1700635166989035. That induced a bug hard to spot in my code.

Vectors are

x = array([ 1.00960141e-01, -4.01809782e-02, -2.53052902e-02, -3.83598426e-02,
       -1.84514546e-02, -1.01344371e-02, -1.08593392e-02, -1.15668668e-02,
        1.38312853e-02, -4.68102711e-03, -8.92247190e-03,  1.34186163e-02,
       -2.02401316e-02,  1.97277158e-02,  4.84512549e-03, -2.46691672e-02,
       -1.43678337e-02, -1.08396096e-02, -1.66045133e-02, -1.87338203e-04,
        1.77866272e-02, -1.29115470e-03, -2.39823208e-03,  7.06287462e-03,
       -3.56051338e-03,  1.33550563e-02, -1.41241862e-03,  4.54950362e-03,
        4.61987170e-03, -4.62260326e-04, -2.25454665e-02,  3.17050421e-02,
        2.59202662e-02,  3.03234615e-03,  2.98438669e-02, -3.74388567e-02,
       -1.02059536e-03,  2.19098720e-02, -2.83177246e-03, -1.24929645e-02,
       -9.54568059e-03, -1.27019096e-03, -1.26543314e-03, -7.08116030e-03,
       -1.86535521e-02,  9.04097856e-03,  1.54570539e-02, -1.93908853e-02,
       -7.75302959e-03,  3.16961159e-02,  8.41512851e-03,  2.06787745e-02,
       -2.35410647e-02, -3.02719392e-02,  1.86432058e-02,  6.87715473e-03,
       -5.22584753e-02,  4.21931210e-02,  9.13354063e-03, -2.40448892e-03,
        1.67934159e-02,  2.57438216e-02,  4.81727041e-03, -5.89206481e-02,
       -1.57569599e-04,  3.90460641e-02, -2.21131258e-02, -1.61892075e-02,
        6.00268269e-02, -3.82423066e-02,  4.29680057e-02,  3.52531687e-02,
       -6.03344827e-02,  3.59303944e-02, -1.68842604e-02, -4.02963261e-02,
        3.93567168e-02,  1.76632723e-02, -4.36065930e-02,  4.10331271e-03,
        2.50906174e-03,  4.87005150e-02, -2.46634526e-02,  3.83284114e-02,
        9.80695532e-03, -1.20585818e-02, -8.86564999e-02, -5.56599786e-02,
       -6.17286390e-02,  5.27780188e-03, -3.37788466e-02, -1.93486034e-02,
       -7.68252159e-02, -6.58106279e-02, -1.80238122e-02, -8.06371077e-02,
       -1.63344850e-01, -1.31490290e-01, -4.32950452e-02, -7.88187113e-02])

and

y = array([ 0.14547191, -0.07930009, -0.02698467,  0.00195985,  0.0063571 ,
       -0.0406509 ,  0.10264891, -0.0016326 ,  0.07884686,  0.0015746 ,
       -0.01817623,  0.00044795, -0.00437634, -0.01084891, -0.01855354,
        0.00257071, -0.02612611, -0.02764799, -0.00324183, -0.0013933 ,
       -0.00085558, -0.011029  , -0.0112539 , -0.02094967,  0.01518229,
        0.007486  ,  0.01591899, -0.01279154, -0.01077763, -0.01469155,
       -0.02087268, -0.00257903, -0.00752398,  0.00423674,  0.03864373,
        0.00284597,  0.00111167,  0.00524912,  0.00654447,  0.00189658,
        0.01175984,  0.02741782,  0.01889893,  0.018088  , -0.00300969,
        0.05002532, -0.01701955, -0.02548884, -0.0212687 , -0.00066898,
       -0.00137306,  0.01724564, -0.01435975, -0.02329919,  0.0442525 ,
       -0.00248004, -0.01821848,  0.0276741 ,  0.0217583 , -0.01027377,
       -0.06657801,  0.05199507, -0.03284782,  0.01022881, -0.00326401,
        0.01791922,  0.04573005, -0.07247989,  0.03182124, -0.07898516,
        0.03394183, -0.04119511, -0.05330145,  0.08151271,  0.06096885,
        0.01446256, -0.10069813, -0.00086334, -0.00224785,  0.10457657,
       -0.03048727,  0.08424894, -0.1153678 ,  0.02225815,  0.03754586,
       -0.07457668,  0.3032167 ,  0.01503319, -0.10581953,  0.15549258,
        0.09289552, -0.00456573,  0.07744588, -0.03183186, -0.00823961,
        0.01484506,  0.23307718,  0.14797703,  0.02567997, -0.07370993])

Then pairwise_distances([x,y], metric = 'cosine') returns

array([[2.22044605e-16, 1.17006352e+00],
       [1.17006352e+00, 0.00000000e+00]])

That 1.17 seems to be quite far off 1.

wwz58 · 2022-10-13T07:03:01Z

这是来自QQ邮箱的假期自动回复邮件。你好，我最近正在休假中，无法亲自回复你的邮件。我将在假期结束后，尽快给你回复。

lucyleeow · 2024-04-26T05:29:10Z

We are actually already clipping cosine_distances to [0, 2]. So it might make sense to move that np.clip with [-1, 1] bounds into cosine_similarity

The complexity here is that cosine_similarity returns dense or sparse, whereas cosine_distances only returns dense array.

np.clip is dense only. I don't think we want to do any sparse -> dense -> sparse conversion here. The sparse package includes a clip function, but it is not an existing dependency.

I think we could close this as won't fix, as we clip cosine_distances

lucyleeow · 2024-04-26T05:30:21Z

@mapio pairwise_distances([x,y], metric = 'cosine') uses cosine_distances, which is defined as 1.0 minus the cosine similarity, it has range 0 to 2, so 1.17 not unusual.

wwz58 changed the title ~~cosine_similarity function produces results more than 1.0~~ cosine_similarity function produces results greater than 1.0 Aug 8, 2020

rth added the Needs Decision Requires decision label Aug 10, 2020

sdehm mentioned this issue Oct 11, 2023

Suggestion: Clip default hierarchy distance function cosine similarity MaartenGr/BERTopic#1573

Open

scikit-learn deleted a comment from wwz58 Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cosine_similarity function produces results greater than 1.0 #18122

cosine_similarity function produces results greater than 1.0 #18122

wwz58 commented Aug 8, 2020 •

edited

jnothman commented Aug 8, 2020 via email

wwz58 commented Aug 8, 2020

thomasjpfan commented Aug 8, 2020

rth commented Aug 10, 2020

jnothman commented Aug 11, 2020 via email

rth commented Aug 11, 2020 •

edited

jnothman commented Aug 11, 2020 via email

rth commented Aug 11, 2020

mapio commented Oct 13, 2022

wwz58 commented Oct 13, 2022 via email

lucyleeow commented Apr 26, 2024

lucyleeow commented Apr 26, 2024

cosine_similarity function produces results greater than 1.0 #18122

cosine_similarity function produces results greater than 1.0 #18122

Comments

wwz58 commented Aug 8, 2020 • edited

jnothman commented Aug 8, 2020 via email

wwz58 commented Aug 8, 2020

thomasjpfan commented Aug 8, 2020

rth commented Aug 10, 2020

jnothman commented Aug 11, 2020 via email

rth commented Aug 11, 2020 • edited

jnothman commented Aug 11, 2020 via email

rth commented Aug 11, 2020

mapio commented Oct 13, 2022

wwz58 commented Oct 13, 2022 via email

lucyleeow commented Apr 26, 2024

lucyleeow commented Apr 26, 2024

wwz58 commented Aug 8, 2020 •

edited

rth commented Aug 11, 2020 •

edited