ENH: spatial: faster Chebyshev distance #20570

lozanodorian · 2024-04-24T12:41:07Z

ENH: spatial: faster Chebychev distance
Implemented issue #20561

Replaced max by np.max at the end of chebyshev.
The np.max function is faster than max for large arrays.
For (very) short arrays (len < 40), max is faster than np.max but the times of computation are very short.
For very short arrays, if the parameter w (weights) is specified, the times of computation of max and np.max are nearly comparable.

Reference issue

See #20561.

What does this implement/fix?

Faster computation of the Chebyshev distance for large arrays.

Additional information

Two additional benchmarks.
In the two benchmarks, we compare the times of computation of three implementations of chebyshev. The only difference between the three functions is that:

Cheb max returns max(abs(u - v)) <-- scipy right now
Cheb np.max returns np.max(abs(u - v)) <-- proposition
Cheb cond returns max(abs(u - v)) if len(u) < 40 else np.max(abs(u - v))

Each point on the plot quantifies the time taken by 5 computations of chebyshev.
Entries (u, v -- and w for the "weight" plot) are obtained with np.random.random.
no weights means that w was set to None, weights means that an array w (np.random.random) was given.

Cheb cond provides a good trade-off, but the value used (here 40) may depend on the machine used.

Replaced max by np.max. np.max is faster than max for large arrays. For (very) short arrays (len < 50), max is faster than np.max but the times of computation are very short. For very short arrays, if the parameter w (weights) is specified, the times of computation of max and np.max are nearly comparable.

tylerjereddy · 2024-04-24T16:17:30Z

scipy/spatial/distance.py

@@ -1077,7 +1077,7 @@ def chebyshev(u, v, w=None):
        if has_weight.sum() < w.size:
            u = u[has_weight]
            v = v[has_weight]
-    return max(abs(u - v))
+    return np.max(abs(u - v))


This is really only making things faster for high-dimensionality data, which may not really be the primary use case (3 dimensions are pretty common obviously). Most of the workhorse usage of the distance metrics probably happens through pdist and cdist where multiple points can be compared.

In fact, our formal asv benchmarks probably reflect this PR being a step back for the common low-dimensions scenario. Need to check with asv continuous -e -b "SingleDist.*" main enh_20561 or similar (wasn't working for me locally.. need to open an issue for that..).

If the argument is performance, then perhaps the middle-ground approach combined with adjustment of our asv benchmarks to probe performance in both dimensionality regimes would make sense.

There may be some array API argument for xp.max approach. But then I'm not sure we should really be framing this as performance focused, since > 40 dimensions is quite a lot for practical use I think. If other devs are happy with the array API argument then I probably care less about benchmarking with regular NumPy arrays, but I do want us to be clear on the purpose.

Not surprising, but here is the current output of asv continuous -e -v -b "SingleDist.*" main enh_20561 showing slower performance. This probably isn't worth too much debate over a single line change, but maybe folks should pick a reason to go in one direction or another.

| Change | Before [85736e16] <main> | After [91770f5c] <enh_20561> | Ratio | Benchmark (Parameter) | |----------|----------------------------|--------------------------------|---------|------------------------------------------------------------| | + | 1.47±0.01μs | 2.22±0.01μs | 1.51 | spatial.SingleDist.time_dist('chebyshev') | | + | 5.27±0.02μs | 6.06±0.01μs | 1.15 | spatial.SingleDistWeighted.time_dist_weighted('chebyshev') | SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. PERFORMANCE DECREASED.

To be more precise, avoid us merging this and then in 6 months someone comes along and reverts it, citing this benchmark in our suite. So maybe a comment explaining tradeoffs if folks agree on some reason for it.

lozanodorian requested review from tylerjereddy and peterbell10 as code owners April 24, 2024 12:41

github-actions bot added scipy.spatial enhancement A new feature or improvement labels Apr 24, 2024

lozanodorian changed the title ~~ENH: spatial: faster Chebychev distance~~ ENH: spatial: faster Chebyshev distance Apr 24, 2024

tylerjereddy reviewed Apr 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: spatial: faster Chebyshev distance #20570

ENH: spatial: faster Chebyshev distance #20570

lozanodorian commented Apr 24, 2024

tylerjereddy Apr 24, 2024

tylerjereddy Apr 26, 2024

tylerjereddy Apr 26, 2024

ENH: spatial: faster Chebyshev distance #20570

Are you sure you want to change the base?

ENH: spatial: faster Chebyshev distance #20570

Conversation

lozanodorian commented Apr 24, 2024

Reference issue

What does this implement/fix?

Additional information

tylerjereddy Apr 24, 2024

Choose a reason for hiding this comment

tylerjereddy Apr 26, 2024

Choose a reason for hiding this comment

tylerjereddy Apr 26, 2024

Choose a reason for hiding this comment