perf: aggregate min/max#8061
Open
joseph-isaacs wants to merge 5 commits into
Open
Conversation
Adds a divan benchmark exercising the min/max aggregation over primitive arrays (i32/i64/f64, with and without nulls) so we can measure and inspect the codegen of the max reduction path. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The all-valid primitive min/max path used `itertools::minmax_by` with a `total_compare` closure preceded by a NaN filter, which the autovectorizer could not lower to packed min/max, leaving a scalar cmov reduction. Route the all-true mask case for integer ptypes through a plain reduction. Integers have no NaNs, so the NaN filter is unnecessary and LLVM vectorizes the loop (pmaxub/pmaxsw, and pcmpgtd-based blends for i32/i64). Floats keep the existing NaN-aware path. Benchmarked over 1M elements: i32 all-valid ~2.93ms -> ~0.36ms, i64 ~3.02ms -> ~0.55ms. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Merging this PR will improve performance by 14.84%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| 🆕 | Simulation | max_f64 |
N/A | 1.1 ms | N/A |
| 🆕 | Simulation | max_i32 |
N/A | 223.3 µs | N/A |
| 🆕 | Simulation | max_i64 |
N/A | 486.1 µs | N/A |
| 🆕 | Simulation | sum_i32 |
N/A | 222.3 µs | N/A |
| 🆕 | Simulation | sum_i64 |
N/A | 600.7 µs | N/A |
| 🆕 | Simulation | sum_u32 |
N/A | 269.6 µs | N/A |
| ❌ | Simulation | chunked_varbinview_canonical_into[(100, 100)] |
273.1 µs | 308 µs | -11.32% |
| ⚡ | Simulation | encode_primitives[u8, (10000, 2)] |
313.9 µs | 278 µs | +12.9% |
| ⚡ | Simulation | encode_primitives[u8, (10000, 32)] |
318.4 µs | 282.3 µs | +12.81% |
| ⚡ | Simulation | encode_primitives[u8, (10000, 4)] |
314.3 µs | 278.2 µs | +12.95% |
| ⚡ | Simulation | encode_primitives[u8, (10000, 512)] |
335.2 µs | 299 µs | +12.09% |
| ⚡ | Simulation | encode_primitives[u8, (10000, 8)] |
315.1 µs | 279 µs | +12.93% |
| ⚡ | Simulation | for_compress_i32 |
753.4 µs | 443.8 µs | +69.76% |
| ⚡ | Simulation | take_10k_contiguous |
309.7 µs | 280.6 µs | +10.38% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/great-edison-jrGY0 (8b98b5d) with develop (ae19fe7)
Keep a single all-valid bench for i32, i64, and f64 instead of the per-type all-valid/half-null pairs. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
robert3005
approved these changes
May 22, 2026
robert3005
reviewed
May 22, 2026
| .with_inputs(|| PrimitiveArray::from_iter(data.iter().copied()).into_array()) | ||
| .bench_refs(|a| { | ||
| a.statistics() | ||
| .compute_max::<i32>(&mut LEGACY_SESSION.create_execution_ctx()) |
Contributor
There was a problem hiding this comment.
can you create a local session here?
The all-valid integer sum did a per-element `checked_add`, whose overflow early-return branch blocked autovectorization, leaving a scalar loop. Sum narrower-than-64-bit integers in chunks of 65536 into a widened 64-bit accumulator with no per-element check: a chunk of <64-bit values cannot overflow the 64-bit accumulator (2^16 * (2^32-1) < 2^64), so only one checked add per chunk is needed. This lets the inner loop vectorize to packed widening adds (paddq + unpck). 64-bit inputs keep the per-element checked path since a chunk of 64-bit values could itself overflow. This observes overflow at chunk boundaries rather than per element, so a signed sum whose running total transiently leaves i64 range but ends in range now returns the true total instead of null. The final result is unchanged whenever the existing per-batch combine did not already overflow. Benchmarked over 100k elements: sum_i32 ~19us, sum_u32 ~15us, sum_i64 ~51us. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a divan benchmark exercising the min/max aggregation over primitive
arrays (i32/i64/f64, with and without nulls) so we can measure and inspect
the codegen of the max reduction path.
Signed-off-by: Joe Isaacs joe.isaacs@live.co.uk