Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: spatial: faster Chebyshev distance #20570

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lozanodorian
Copy link

ENH: spatial: faster Chebychev distance
Implemented issue #20561

Replaced max by np.max at the end of chebyshev.
The np.max function is faster than max for large arrays.
For (very) short arrays (len < 40), max is faster than np.max but the times of computation are very short.
For very short arrays, if the parameter w (weights) is specified, the times of computation of max and np.max are nearly comparable.

Reference issue

See #20561.

What does this implement/fix?

Faster computation of the Chebyshev distance for large arrays.

Additional information

Two additional benchmarks.
In the two benchmarks, we compare the times of computation of three implementations of chebyshev. The only difference between the three functions is that:

  • Cheb max returns max(abs(u - v)) <-- scipy right now
  • Cheb np.max returns np.max(abs(u - v)) <-- proposition
  • Cheb cond returns max(abs(u - v)) if len(u) < 40 else np.max(abs(u - v))

Each point on the plot quantifies the time taken by 5 computations of chebyshev.
Entries (u, v -- and w for the "weight" plot) are obtained with np.random.random.
no weights means that w was set to None, weights means that an array w (np.random.random) was given.

Cheb cond provides a good trade-off, but the value used (here 40) may depend on the machine used.

scipy11
scipy12

Replaced max by np.max.
np.max is faster than max for large arrays.
For (very) short arrays (len < 50), max is faster than np.max but the times of computation are very short.
For very short arrays, if the parameter w (weights) is specified, the times of computation of max and np.max are nearly comparable.
@github-actions github-actions bot added scipy.spatial enhancement A new feature or improvement labels Apr 24, 2024
@lozanodorian lozanodorian changed the title ENH: spatial: faster Chebychev distance ENH: spatial: faster Chebyshev distance Apr 24, 2024
@@ -1077,7 +1077,7 @@ def chebyshev(u, v, w=None):
if has_weight.sum() < w.size:
u = u[has_weight]
v = v[has_weight]
return max(abs(u - v))
return np.max(abs(u - v))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really only making things faster for high-dimensionality data, which may not really be the primary use case (3 dimensions are pretty common obviously). Most of the workhorse usage of the distance metrics probably happens through pdist and cdist where multiple points can be compared.

In fact, our formal asv benchmarks probably reflect this PR being a step back for the common low-dimensions scenario. Need to check with asv continuous -e -b "SingleDist.*" main enh_20561 or similar (wasn't working for me locally.. need to open an issue for that..).

If the argument is performance, then perhaps the middle-ground approach combined with adjustment of our asv benchmarks to probe performance in both dimensionality regimes would make sense.

There may be some array API argument for xp.max approach. But then I'm not sure we should really be framing this as performance focused, since > 40 dimensions is quite a lot for practical use I think. If other devs are happy with the array API argument then I probably care less about benchmarking with regular NumPy arrays, but I do want us to be clear on the purpose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not surprising, but here is the current output of asv continuous -e -v -b "SingleDist.*" main enh_20561 showing slower performance. This probably isn't worth too much debate over a single line change, but maybe folks should pick a reason to go in one direction or another.

| Change   | Before [85736e16] <main>   | After [91770f5c] <enh_20561>   |   Ratio | Benchmark (Parameter)                                      |
|----------|----------------------------|--------------------------------|---------|------------------------------------------------------------|
| +        | 1.47±0.01μs                | 2.22±0.01μs                    |    1.51 | spatial.SingleDist.time_dist('chebyshev')                  |
| +        | 5.27±0.02μs                | 6.06±0.01μs                    |    1.15 | spatial.SingleDistWeighted.time_dist_weighted('chebyshev') |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more precise, avoid us merging this and then in 6 months someone comes along and reverts it, citing this benchmark in our suite. So maybe a comment explaining tradeoffs if folks agree on some reason for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.spatial
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants