Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement all scipy.spatial.distance metrics in _distance_pybind #14077

Open
7 of 23 tasks
peterbell10 opened this issue May 17, 2021 · 4 comments
Open
7 of 23 tasks

Implement all scipy.spatial.distance metrics in _distance_pybind #14077

peterbell10 opened this issue May 17, 2021 · 4 comments
Labels
enhancement A new feature or improvement scipy.spatial task A straightforward change, verification or fix.

Comments

@peterbell10
Copy link
Member

peterbell10 commented May 17, 2021

This issue is to track progress of migrating scipy.spatial.distance per-metric cdist/pdist implementations from the original _distance_wrap module to the new _distance_pybind module and C++ framework (gh-13907).

Some of these metrics operate on bool input, which isn't supported by the framework yet. But for most, it's just a case of writing the core distance calculation.

Metrics

@h-vetinari
Copy link
Member

h-vetinari commented Aug 15, 2021

The following comment has been carried around in various distance issues for a while. I was wondering to open specific issues for that, but it seems that if a rewrite happens, the points would naturally fit already...?

Quote of comment for reference
  1. hamming doesn't compute the Hamming distance, but scales it dividing by the vector dimension. They are closely related, but are not quite the same. I would like to have matching and sokalmichener return the old value, but make hamming return the proper unscaled Hamming distance. Can this be happily done, or is it best to first add some form of deprecation warning?

  2. sokalmichener is also wrong in the current implementation. It returns the same as rogerstanimoto, but it should return the same as matching. I think there was an open issue complaining about this already.

  3. wminkowski is also computing a slightly erroneous version of the weighted Minkowski distance. According to the documentation, components are subtracted, their absolute value raised to the p-th power, and then get multiplied by the corresponding weight. This is consistent with the few definitions I have seen in the literature. But the actual implementation multiplies by the weight before raising to the p-th power. Can this be changed happily?

  4. minkowski has a check for p >= 1. That condition is usually enforced for the Minkowski distance to be a metric, i.e. for the triangle inequality to hold. This is a very restrictive condition, because many of the other distances are not proper metrics. Furthermore, the condition is not enforced either in pdist or cdist, which will happily take ps smaller than 1. I would like to drop this as well.

  5. kulsinski returns a value based on a formula that I have found in a paper, but which is not consistent with most of the literature. I would like to either add a kulczynski1 function implementing the proper Kulczynski 1 distance, and eventually remove the current kulsinski, or directly replace the current implementation with the new one. Thoughts on this?

AFAICT [this has been edited over time],

  1. Tracked (most closely) in Hamming distance returns unexpected value for string comparisons #11991, WIP PR in ENH: hamming distance for string comparison #12027; still tangentially related to 'sokalmichener' and 'rogerstanimoto' distances are the same (Trac #1486) #2011
  2. Tracked in 'sokalmichener' and 'rogerstanimoto' distances are the same (Trac #1486) #2011, though possibly obsoleted by The calculation using sokalmichener with normalized weight and unnormalized weight are different #9052 BUG: sokalmichener appears to incorrectly apply weights #13693 BUG: spatial: fix weight handling of distance.sokalmichener. #13944
  3. Was tracked in Wrong computation of weighted minkowski distance in cdist #5718, removed in ENH: weighted metrics #7630, restored in REV: restore wminkowski #7905, deprecated in DEP: Deprecate distance.wminkowski #12374, removed in MAINT: Remove wminkowski #15054
  4. Not tracked (edit: now tracked in ENH: allow p>0 for minkowski metric #14710), see also this comment; done in ENH: allow p>0 for Minkowski distance #15055
  5. Tracked in "Kulsinski" dissimilarity seems wrong (Trac #1484) #2009, implemented in ENH: Added kulczynski1 to scipy.spatial.distance #14588; deprecation tracked in Deprecate scipy.spatial.distance.kulsinski #15125, PR DEP: deprecate spatial.distance.kulsinski #15180

@czgdp1807
Copy link
Member

@peterbell10 @h-vetinari Seems like there is a lot to do here. Can I help with some of the distance metrics?

@czgdp1807
Copy link
Member

czgdp1807 commented Aug 25, 2021

Following distance measure might be tricky to fit in the current C++ pybind framework,

  1. mahalanobis - It has a different API and requires matrix multiplication. Accumulating results isn't required.
  2. seucledian - Different API. Needs V (component variances) as extra input.
  3. wminkowski - Slightly different API. Needs p as an extra input. Though, I think this should be easy to adjust into the current framework.
  4. jensenshannon - A different concept. Needs, m ((p + q)/2) to be computed. Though then the accumulation stuff can be applied i.e., by using _transform_reduce_2d thing.
  5. cosine - Needs row wise norms which are used in the project step of the framework.
  6. correlation - Need row wise averages and norms.

Rest of the distance metrics can be easily added into the pybind framework. For the above 6 we need to figure out whether to add them separately or factor out some common parts among these metrics and then change the framework. It might be a tricky task.

cc @peterbell10 @rgommers

@h-vetinari
Copy link
Member

Following distance measure might be tricky to fit in the current C++ pybind framework,
[...]
3. wminkowski - Slightly different API. Needs p as an extra input. Though, I think this should be easy to adjust into the current framework.

wminkowski is deprecated since #12374 and slated for removal in scipy 1.8; should be no need to change anything (other than remove it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.spatial task A straightforward change, verification or fix.
Projects
None yet
Development

No branches or pull requests

3 participants