Skip to content

perf: aggregate min/max#8061

Open
joseph-isaacs wants to merge 5 commits into
developfrom
claude/great-edison-jrGY0
Open

perf: aggregate min/max#8061
joseph-isaacs wants to merge 5 commits into
developfrom
claude/great-edison-jrGY0

Conversation

@joseph-isaacs
Copy link
Copy Markdown
Contributor

Adds a divan benchmark exercising the min/max aggregation over primitive
arrays (i32/i64/f64, with and without nulls) so we can measure and inspect
the codegen of the max reduction path.

Signed-off-by: Joe Isaacs joe.isaacs@live.co.uk

claude added 2 commits May 22, 2026 11:13
Adds a divan benchmark exercising the min/max aggregation over primitive
arrays (i32/i64/f64, with and without nulls) so we can measure and inspect
the codegen of the max reduction path.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The all-valid primitive min/max path used `itertools::minmax_by` with a
`total_compare` closure preceded by a NaN filter, which the autovectorizer
could not lower to packed min/max, leaving a scalar cmov reduction.

Route the all-true mask case for integer ptypes through a plain reduction.
Integers have no NaNs, so the NaN filter is unnecessary and LLVM vectorizes
the loop (pmaxub/pmaxsw, and pcmpgtd-based blends for i32/i64). Floats keep
the existing NaN-aware path.

Benchmarked over 1M elements: i32 all-valid ~2.93ms -> ~0.36ms, i64
~3.02ms -> ~0.55ms.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs changed the title Add aggregate max divan benchmark [claude] Add aggregate max divan benchmark May 22, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 22, 2026

Merging this PR will improve performance by 14.84%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 7 improved benchmarks
❌ 1 regressed benchmark
✅ 1243 untouched benchmarks
🆕 6 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
🆕 Simulation max_f64 N/A 1.1 ms N/A
🆕 Simulation max_i32 N/A 223.3 µs N/A
🆕 Simulation max_i64 N/A 486.1 µs N/A
🆕 Simulation sum_i32 N/A 222.3 µs N/A
🆕 Simulation sum_i64 N/A 600.7 µs N/A
🆕 Simulation sum_u32 N/A 269.6 µs N/A
Simulation chunked_varbinview_canonical_into[(100, 100)] 273.1 µs 308 µs -11.32%
Simulation encode_primitives[u8, (10000, 2)] 313.9 µs 278 µs +12.9%
Simulation encode_primitives[u8, (10000, 32)] 318.4 µs 282.3 µs +12.81%
Simulation encode_primitives[u8, (10000, 4)] 314.3 µs 278.2 µs +12.95%
Simulation encode_primitives[u8, (10000, 512)] 335.2 µs 299 µs +12.09%
Simulation encode_primitives[u8, (10000, 8)] 315.1 µs 279 µs +12.93%
Simulation for_compress_i32 753.4 µs 443.8 µs +69.76%
Simulation take_10k_contiguous 309.7 µs 280.6 µs +10.38%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/great-edison-jrGY0 (8b98b5d) with develop (ae19fe7)

Open in CodSpeed

claude added 2 commits May 22, 2026 13:18
Keep a single all-valid bench for i32, i64, and f64 instead of the
per-type all-valid/half-null pairs.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs requested a review from robert3005 May 22, 2026 14:30
@joseph-isaacs joseph-isaacs changed the title [claude] Add aggregate max divan benchmark perf: aggregate min/max May 22, 2026
@joseph-isaacs joseph-isaacs added the changelog/performance A performance improvement label May 22, 2026
.with_inputs(|| PrimitiveArray::from_iter(data.iter().copied()).into_array())
.bench_refs(|a| {
a.statistics()
.compute_max::<i32>(&mut LEGACY_SESSION.create_execution_ctx())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you create a local session here?

The all-valid integer sum did a per-element `checked_add`, whose overflow
early-return branch blocked autovectorization, leaving a scalar loop.

Sum narrower-than-64-bit integers in chunks of 65536 into a widened 64-bit
accumulator with no per-element check: a chunk of <64-bit values cannot
overflow the 64-bit accumulator (2^16 * (2^32-1) < 2^64), so only one
checked add per chunk is needed. This lets the inner loop vectorize to
packed widening adds (paddq + unpck). 64-bit inputs keep the per-element
checked path since a chunk of 64-bit values could itself overflow.

This observes overflow at chunk boundaries rather than per element, so a
signed sum whose running total transiently leaves i64 range but ends in
range now returns the true total instead of null. The final result is
unchanged whenever the existing per-batch combine did not already overflow.

Benchmarked over 100k elements: sum_i32 ~19us, sum_u32 ~15us, sum_i64 ~51us.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/performance A performance improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants