New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pdist and cdist disagree for 'seuclidean' and 'mahalanobis' metrics #9555
Comments
I can reproduce this. The differences are small, but significant:
|
I looked at the documentation and source for cdist and pdist. cdist uses both inputs arrays to estimate the covariance, i.e., cov(vstack([XA, XB].T)), when the mahalanobis metric is requested while pdist uses cov(XA.T) to estimate the covariance. Since np.cov sets ddof=1 by default, it makes sense that the results are close but different. Perhaps cdist could raise a warning stating that pdist is a more appropriate routine if XA is XB. I could implement this if it is a reasonable fix. |
But it won't raise if XB equals XA and XB is not XA, and it would be too costly to check element-wise equality between XA and XB. I'm not sure a warning is enough. They should return the same, don't they ? Maybe ddof should be 0 by default ? |
The convention for Anyone have another implementation (R, Matlab, ...) that they can check this for? |
After more thoughts and discussions, I agree. For However, from a statistical point of vue, maybe a special case could be done in cdist when |
I'm fine with adding a note to the documentation (e.g. in the Notes section of |
When
XB==XA
, cdist does not give the same result as pdist for 'seuclidean' and 'mahalanobis' metrics, if metrics params are left to None.Reproducible example:
This is probably due to the way the metrics params
V
andVI
are precomputed in pdist and cdist.The text was updated successfully, but these errors were encountered: