Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: spatial: Port cosine to _distance_pybind #14155

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

afozk95
Copy link

@afozk95 afozk95 commented May 30, 2021

Reference issue

Part of gh-14077.

What does this implement/fix?

This implements the Cosine distance in _distance_pybind and uses it in cdist/pdist.

Benchmarks

I tried to run benchmarks and results are given below. But it does not make much sense to me as benchmarks of distance functions other than cosine also changed significantly. So take the results with grain of salt. Any help regarding how to run benchmarks correctly would be appreciated.

       before           after         ratio
     [59b4a295]       [ad9fda2b]
     <master>                   
+     18.8±0.09μs         23.9±3μs     1.27  spatial.XdistWeighted.time_pdist(10, 'hamming')
+     45.9±0.04μs        57.3±10μs     1.25  spatial.XdistWeighted.time_cdist(100, 'chebyshev')
+     1.68±0.04ms      2.07±0.01ms     1.23  spatial.Xdist.time_pdist(1000, 'cosine')
+     2.61±0.03ms      3.10±0.08ms     1.19  spatial.Xdist.time_pdist(1000, 'canberra')
+      56.2±0.1μs       66.1±0.4μs     1.18  spatial.Xdist.time_cdist(100, 'canberra')
+     3.53±0.02ms      4.10±0.05ms     1.16  spatial.Xdist.time_cdist(1000, 'cosine')
+      5.37±0.2ms      6.05±0.03ms     1.13  spatial.Xdist.time_cdist(1000, 'canberra')
+      68.6±0.1μs         76.7±7μs     1.12  spatial.XdistWeighted.time_cdist(100, 'canberra')
+     57.0±0.04μs         62.8±6μs     1.10  spatial.XdistWeighted.time_cdist(100, 'braycurtis')
+      8.15±0.2μs       8.87±0.3μs     1.09  spatial.Xdist.time_cdist(10, 'dice')
+     36.4±0.07μs       39.4±0.2μs     1.08  spatial.Xdist.time_cdist(100, 'braycurtis')
+     13.1±0.04μs         14.1±2μs     1.07  spatial.Xdist.time_cdist(10, 'jensenshannon')
+     4.18±0.07μs       4.40±0.4μs     1.05  spatial.XdistWeighted.time_pdist(10, 'euclidean')
-         782±5μs          742±2μs     0.95  spatial.Xdist.time_pdist(1000, 'sqeuclidean')
-     1.75±0.04ms         1.64±0ms     0.94  spatial.Xdist.time_pdist(1000, 'correlation')
-     2.88±0.07ms         2.68±0ms     0.93  spatial.Xdist.time_pdist(1000, 'sokalsneath')
-      5.90±0.6ms      5.33±0.02ms     0.90  spatial.Xdist.time_cdist(1000, 'dice')
-      2.60±0.4ms         2.26±0ms     0.87  spatial.Xdist.time_pdist(1000, 'sokalmichener')
-      6.15±0.3μs       3.42±0.1μs     0.56  spatial.Xdist.time_pdist(10, 'cosine')
-     7.40±0.04μs      3.78±0.01μs     0.51  spatial.Xdist.time_cdist(10, 'cosine')
-     2.63±0.02ms       4.75±0.4μs     0.00  spatial.XdistWeighted.time_pdist(10, 'cosine')
-      5.75±0.3ms      4.79±0.07μs     0.00  spatial.XdistWeighted.time_cdist(10, 'cosine')
-      10.9±0.2ms         6.65±1μs     0.00  spatial.XdistWeighted.time_pdist(20, 'cosine')
-        25.0±2ms      7.52±0.04μs     0.00  spatial.XdistWeighted.time_cdist(20, 'cosine')
-         290±3ms         42.7±2μs     0.00  spatial.XdistWeighted.time_pdist(100, 'cosine')
-        595±70ms         76.2±1μs     0.00  spatial.XdistWeighted.time_cdist(100, 'cosine')

@peterbell10
Copy link
Member

peterbell10 commented Jun 2, 2021

Unfortunately, my benchmarks show this as a performance regression:

       before           after         ratio
     [59b4a295]       [ad9fda2b]
     <spatial-distance-cosine~1>       <spatial-distance-cosine>
+     2.82±0.02ms      4.92±0.06ms     1.74  spatial.Xdist.time_cdist(1000, 'cosine')
+      38.9±0.3μs       56.0±0.4μs     1.44  spatial.Xdist.time_cdist(100, 'cosine')
-      8.48±0.1μs      4.07±0.05μs     0.48  spatial.Xdist.time_cdist(10, 'cosine')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

The current cosine distance pre-calculates the norms of each vector in x and y. That way it only has to do O(n + m) sqrt operations instead of O(n*m).

@afozk95
Copy link
Author

afozk95 commented Jun 2, 2021

Thanks @peterbell10 for the explanation.

The current cosine distance pre-calculates the norms of each vector in x and y.

What would be the best way to replicate the same behavior within _distance_pybind? If that is a sensible way to you, I can try to implement that.

@peterbell10
Copy link
Member

It basically needs to have it's own versions of cdist_impl and cdist_weighted_impl which aren't customization points at the moment. Perhaps you could add a new version of cdist/pdist where the _impl functions are a templated argument.

@czgdp1807
Copy link
Member

N.B. - #14611 (in progress).

@j-bowhay j-bowhay added the needs-work Items that are pending response from the author label Nov 6, 2022
@melissawm
Copy link
Contributor

Hi folks - what is the status of this PR? What would it take to move this forward?

@lucascolley lucascolley changed the title ENH: Port cosine to _distance_pybind ENH: spatial: Port cosine to _distance_pybind Jun 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement needs-work Items that are pending response from the author scipy.spatial
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants