Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cosine distance of vector to self returns small non-zero answer when using cdist, pdist but not when using cosine #17754

Closed
wpdonders opened this issue Jan 9, 2023 · 3 comments
Labels
query A question or suggestion that requires further information scipy.spatial
Milestone

Comments

@wpdonders
Copy link

The beginning

I was doing some sanity checks by testing whether a vector's cosine distance to itself would be zero. This did not check out when I used the cdist function.

import numpy as np
from scipy.spatial.distance import cdist, pdist, cosine

# Randomly sized vector of ones
A = np.ones(shape=(1, 5))

# Distances
d_cosine = cosine(A[0], A[0])  # returns 0, as expected
d_cdist = cdist(A, A, 'cosine')  # returns 2.220446e-16, unexpected!

Dimension-dependency

So I did some further checking and it turns out that this occurs sometimes. For example, it depends on the number of dimensions of the input vector. It doesn't happen when the number of dimension is smaller than five (I got "lucky" on my random sized vector). So I wrote a loop and checked all the ones-vectors with dimensions between 1 and 32 (inclusive):

for n in range(1, 33):
    A = np.ones(shape=(1, n))

    # Distance to self
    d_cosine = cosine(A[0], A[0])
    d_cdist = cdist(A, A, 'cosine')

    if (cdist_distance := d_cdist[0][0]) != 0:
        print(f"{n = :02}, {cdist_distance = }")
    if (cosine_distance := d_cosine) != 0:
        print(f"{n = :02}, {cosine_distance = }")

Output:

n = 02, cdist_distance = 2.220446049250313e-16
n = 05, cdist_distance = 2.220446049250313e-16
n = 07, cdist_distance = 1.1102230246251565e-16
n = 08, cdist_distance = 2.220446049250313e-16
n = 10, cdist_distance = 2.220446049250313e-16
n = 15, cdist_distance = 1.1102230246251565e-16
n = 19, cdist_distance = 2.220446049250313e-16
n = 20, cdist_distance = 2.220446049250313e-16
n = 28, cdist_distance = 1.1102230246251565e-16
n = 32, cdist_distance = 2.220446049250313e-16

Which was kind of weird.

Value dependency

I went on to check what happens if my test vectors do not just contain ones, but some other value. I nested a loop in which I multiply my generated test vector with a factor between 1 and 32 (inclusive).

import pandas as pd

def records(N=32):
    for n in range(1, N + 1):
        for factor in range(1, N+1):
            A = np.ones(shape=(1, n)) * factor

            d_cosine = cosine(A[0], A[0])
            d_cdist = cdist(A, A, 'cosine')
            d_pdist = pdist(np.concatenate([A, A]), 'cosine')

            yield (n, factor, d_cosine, d_cdist[0][0], d_pdist[0])


df = pd.DataFrame.from_records(records(), columns=("n", "factor", "cosine", "cdist", "pdist"))

Output:

This allowed me to create a kind of "error heatmap" / The rows indicate the dimension of the test vector, with a vector of dimension 1 on the top and 32 on the bottom. The columns indicate the factor which which I multiplied the test vector, going from 1 (left) to 32 (right). so at row 5 and column 12, you would have a a test vector equal to np.ones(shape=(1, 5)) * 12.

import seaborn as sns
sns.set_context('talk')
sns.heatmap(df.pivot("n", "factor", "cdist"), vmin=0, cmap=sns.light_palette("#ff0000", reverse=False, as_cmap=True))

image

What's weird is that errors can pop up at some particular combination of dimension and factor, but it is never the case that a particular dimension consistently gives errors (independent of factor; you would see a horizontal red line in the heat map) or that a particular factor consistently gives errors (independent of dimension; you would see a vertical red line in the heat map).

Error values

The error values that I found so far were consistently either 1.11022302e-16 or 2.22044605e-16. I checked a larger (N=10000) and more random sample (dimension sized between 1 and 10000):

from numpy.random import default_rng
from collections import Counter

N = 10_000
rng = default_rng()

# Sample matrix
A = rng.uniform(0, 1, size=(N, N))
dimensions = rng.integers(low=1, high=N, endpoint=True, size=N)

errors = np.zeros(N)
for i in range(0, N):
    d = dimensions[i]
    a = A[i][:d]
    
    errors[i] = cdist(np.array([a]), np.array([a]), 'cosine')[0][0]
Counter(errors)

Output

Yep, just these two error values.

Counter({1.1102230246251565e-16: 1534, 0.0: 7618, 2.220446049250313e-16: 848})

?

So anyone know what's going on and why the cosine function is giving the correct values when calculating the self-distance to a vector but cdist (and pdist) isn't?

@wpdonders wpdonders changed the title Cosine distance of vector to self returns small non-zero answer when using cdist, pdist but when using cosine for some Cosine distance of vector to self returns small non-zero answer when using cdist, pdist but not when using cosine Jan 9, 2023
@j-bowhay j-bowhay added scipy.spatial query A question or suggestion that requires further information labels Jan 9, 2023
@tupui
Copy link
Member

tupui commented Jan 9, 2023

Hi @wpdonders, thank you for reporting (and the quality of your reporting 👏). I think this is ok since the resolution of float64 would be 1e-15.

The difference comes from the fact that cosine is a Python function while cdist is calling a C implementation. If you follow the code, an output array is initialised with np.empty(expected_shape, dtype=dtype) and here dtype = np.double.

@czgdp1807 is there anything we can/should do here?

@czgdp1807
Copy link
Member

czgdp1807 commented Jan 11, 2023

@czgdp1807 is there anything we can/should do here?

I don’t think the accuracy can be improved further. I would ignore any errors which are smaller than 1e-12 for np.float64. Here its <= 1e-15, so I personally wouldn’t worry much.
I would recommend to wait for more people report/comment on this issue for a further investigation. For now I would put this on hold until more requests come in or a use case comes up which needs 100% accurate results.

@tupui
Copy link
Member

tupui commented Jan 11, 2023

Thank you Gagan, let's close it for now then.

@tupui tupui closed this as not planned Won't fix, can't repro, duplicate, stale Jan 11, 2023
@tupui tupui added this to the 1.11.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
query A question or suggestion that requires further information scipy.spatial
Projects
None yet
Development

No branches or pull requests

4 participants