pdist and cdist disagree for 'seuclidean' and 'mahalanobis' metrics #9555

jeremiedbb · 2018-11-30T13:58:34Z

When XB==XA, cdist does not give the same result as pdist for 'seuclidean' and 'mahalanobis' metrics, if metrics params are left to None.

Reproducible example:

import numpy as np
from scipy.spatial.distance import pdist, cdist, squareform
X = np.random.random_sample((1000,10))

>>> np.allclose(squareform(pdist(X,metric='mahalanobis')), cdist(X,X,metric='mahalanobis'))
False

>>> np.allclose(squareform(pdist(X,metric='seuclidean')), cdist(X,X,metric='seuclidean'))
False

This is probably due to the way the metrics params V and VI are precomputed in pdist and cdist.

The text was updated successfully, but these errors were encountered:

rgommers · 2018-12-02T03:42:58Z

I can reproduce this. The differences are small, but significant:

>>> np.allclose(squareform(pdist(X,metric='mahalanobis')), cdist(X,X,metric='mahalanobis'),
...             rtol=3e-4, atol=1e-8)
True
>>> np.allclose(squareform(pdist(X,metric='mahalanobis')), cdist(X,X,metric='mahalanobis'), 
...             rtol=2e-4, atol=1e-8)
False

Samthos · 2018-12-03T02:01:21Z

I looked at the documentation and source for cdist and pdist. cdist uses both inputs arrays to estimate the covariance, i.e., cov(vstack([XA, XB].T)), when the mahalanobis metric is requested while pdist uses cov(XA.T) to estimate the covariance. Since np.cov sets ddof=1 by default, it makes sense that the results are close but different.

Perhaps cdist could raise a warning stating that pdist is a more appropriate routine if XA is XB. I could implement this if it is a reasonable fix.

jeremiedbb · 2018-12-04T13:31:35Z

Perhaps cdist could raise a warning stating that pdist is a more appropriate routine if XA is XB

But it won't raise if XB equals XA and XB is not XA, and it would be too costly to check element-wise equality between XA and XB.

I'm not sure a warning is enough. They should return the same, don't they ? Maybe ddof should be 0 by default ?

rgommers · 2018-12-05T05:23:29Z

Maybe ddof should be 0 by default ?

ddof=1 seems right. and changing that would be a much larger change than is appropriate given that it's not clear that this is a bug or expected.

The convention for seuclidean that it's var(ddof=1) is explicitly documented. So I'm inclined to say that they're not expected to be the same.

Anyone have another implementation (R, Matlab, ...) that they can check this for?

jeremiedbb · 2018-12-06T09:55:23Z

So I'm inclined to say that they're not expected to be the same.

After more thoughts and discussions, I agree. For cdist(X,X) X and X are two sets of samples from a distribution which happens to take the same values, so var and cov should be estimated on (X,X).

However, from a statistical point of vue, maybe a special case could be done in cdist when XB is XA, returning squareform(pdist(XA)), because when XB is XA, XB and XA are the same set of sample from the distribution and therefore var and cov should be estimated on XA only.

rgommers · 2018-12-06T18:26:28Z

I'm fine with adding a note to the documentation (e.g. in the Notes section of cdist), but special-casing XA is XB isn't desirable, that will just lead to harder to maintain code and other corner cases. E.g then cdist(X, X) isn't equal to cdist(X, X.copy()).

jeremiedbb mentioned this issue Nov 30, 2018

KNeighborsRegressor gives different results for different n_jobs values scikit-learn/scikit-learn#12672

Closed

tylerjereddy added the scipy.spatial label Dec 1, 2018

rgommers added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label Dec 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdist and cdist disagree for 'seuclidean' and 'mahalanobis' metrics #9555

pdist and cdist disagree for 'seuclidean' and 'mahalanobis' metrics #9555

jeremiedbb commented Nov 30, 2018 •

edited by rgommers

rgommers commented Dec 2, 2018 •

edited

Samthos commented Dec 3, 2018 •

edited

jeremiedbb commented Dec 4, 2018

rgommers commented Dec 5, 2018

jeremiedbb commented Dec 6, 2018

rgommers commented Dec 6, 2018

pdist and cdist disagree for 'seuclidean' and 'mahalanobis' metrics #9555

pdist and cdist disagree for 'seuclidean' and 'mahalanobis' metrics #9555

Comments

jeremiedbb commented Nov 30, 2018 • edited by rgommers

rgommers commented Dec 2, 2018 • edited

Samthos commented Dec 3, 2018 • edited

jeremiedbb commented Dec 4, 2018

rgommers commented Dec 5, 2018

jeremiedbb commented Dec 6, 2018

rgommers commented Dec 6, 2018

jeremiedbb commented Nov 30, 2018 •

edited by rgommers

rgommers commented Dec 2, 2018 •

edited

Samthos commented Dec 3, 2018 •

edited