Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdist and cdist disagree for 'seuclidean' and 'mahalanobis' metrics #9555

Open
jeremiedbb opened this issue Nov 30, 2018 · 6 comments
Open
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.spatial

Comments

@jeremiedbb
Copy link
Contributor

jeremiedbb commented Nov 30, 2018

When XB==XA, cdist does not give the same result as pdist for 'seuclidean' and 'mahalanobis' metrics, if metrics params are left to None.

Reproducible example:

import numpy as np
from scipy.spatial.distance import pdist, cdist, squareform
X = np.random.random_sample((1000,10))

>>> np.allclose(squareform(pdist(X,metric='mahalanobis')), cdist(X,X,metric='mahalanobis'))
False

>>> np.allclose(squareform(pdist(X,metric='seuclidean')), cdist(X,X,metric='seuclidean'))
False

This is probably due to the way the metrics params V and VI are precomputed in pdist and cdist.

@rgommers
Copy link
Member

rgommers commented Dec 2, 2018

I can reproduce this. The differences are small, but significant:

>>> np.allclose(squareform(pdist(X,metric='mahalanobis')), cdist(X,X,metric='mahalanobis'),
...             rtol=3e-4, atol=1e-8)
True
>>> np.allclose(squareform(pdist(X,metric='mahalanobis')), cdist(X,X,metric='mahalanobis'), 
...             rtol=2e-4, atol=1e-8)
False

@rgommers rgommers added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label Dec 2, 2018
@Samthos
Copy link

Samthos commented Dec 3, 2018

I looked at the documentation and source for cdist and pdist. cdist uses both inputs arrays to estimate the covariance, i.e., cov(vstack([XA, XB].T)), when the mahalanobis metric is requested while pdist uses cov(XA.T) to estimate the covariance. Since np.cov sets ddof=1 by default, it makes sense that the results are close but different.

Perhaps cdist could raise a warning stating that pdist is a more appropriate routine if XA is XB. I could implement this if it is a reasonable fix.

@jeremiedbb
Copy link
Contributor Author

Perhaps cdist could raise a warning stating that pdist is a more appropriate routine if XA is XB

But it won't raise if XB equals XA and XB is not XA, and it would be too costly to check element-wise equality between XA and XB.

I'm not sure a warning is enough. They should return the same, don't they ? Maybe ddof should be 0 by default ?

@rgommers
Copy link
Member

rgommers commented Dec 5, 2018

Maybe ddof should be 0 by default ?

ddof=1 seems right. and changing that would be a much larger change than is appropriate given that it's not clear that this is a bug or expected.

The convention for seuclidean that it's var(ddof=1) is explicitly documented. So I'm inclined to say that they're not expected to be the same.

Anyone have another implementation (R, Matlab, ...) that they can check this for?

@jeremiedbb
Copy link
Contributor Author

So I'm inclined to say that they're not expected to be the same.

After more thoughts and discussions, I agree. For cdist(X,X) X and X are two sets of samples from a distribution which happens to take the same values, so var and cov should be estimated on (X,X).

However, from a statistical point of vue, maybe a special case could be done in cdist when XB is XA, returning squareform(pdist(XA)), because when XB is XA, XB and XA are the same set of sample from the distribution and therefore var and cov should be estimated on XA only.

@rgommers
Copy link
Member

rgommers commented Dec 6, 2018

I'm fine with adding a note to the documentation (e.g. in the Notes section of cdist), but special-casing XA is XB isn't desirable, that will just lead to harder to maintain code and other corner cases. E.g then cdist(X, X) isn't equal to cdist(X, X.copy()).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.spatial
Projects
None yet
Development

No branches or pull requests

4 participants