New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cosine_similarity function produces results greater than 1.0 #18122
Comments
Maybe we should just clip the output
|
can someone explain what caused this problem? |
Floating point arithmetic can lead to these results. |
If we start clipping this would apply to any metric that should be in 0, 1 or any other integer interval. I would say this is the expected behavior since the error is within a few epsilon for float64, >>> 1.0000000000000002 - 1
2.220446049250313e-16
>>> np.finfo(np.float64).eps
2.220446049250313e-16 +1 to close as "won't fix" |
Which other distances that we support have an upper bound? Others like
hamming and jaccard probably don't do floating point subtraction or
addition, so I don't immediately buy that kind of slippery slope argument.
|
I meant that most of the the scores ( |
Fair, but the properties of those metrics are rarely relied upon in
algorithms as much as distances.
|
OK. We are actually already clipping cosine_distances to [0, 2]. So it might make sense to move that np.clip with [-1, 1] bounds into cosine_similarity (which is also used by cosine_distances). |
Not sure it is. I've two vectors of 100 elements that give a distance of Vectors are
and
Then
That |
这是来自QQ邮箱的假期自动回复邮件。你好,我最近正在休假中,无法亲自回复你的邮件。我将在假期结束后,尽快给你回复。
|
The complexity here is that
I think we could close this as won't fix, as we clip |
@mapio |
scikit-learn/sklearn/metrics/pairwise.py
Line 1144 in 0fb307b
reproduce the greater than 1.0 case:
centos-7.2
python: 3.6.10
sklearn: 0.23.2
numpy: 1.19.1
scipy: 1.5.2
may be we should use K = safe_sparse_dot(X, Y) / sqrt(||X||^2 * ||Y||^2) instead of normarlize X,Y first
The text was updated successfully, but these errors were encountered: