Skip to content

perf: branchless boolean zip kernel#8275

Open
joseph-isaacs wants to merge 2 commits into
developfrom
claude/bool-branchless-zip
Open

perf: branchless boolean zip kernel#8275
joseph-isaacs wants to merge 2 commits into
developfrom
claude/bool-branchless-zip

Conversation

@joseph-isaacs
Copy link
Copy Markdown
Contributor

Summary

Adds a dedicated, branchless ZipKernel for Bool. Booleans are bit-packed, so selecting if_true where the mask is set and if_false where it is not is a single bitwise blend over the packed words — (true & mask) | (false & !mask) — instead of the generic per-run builder (which degrades to per-element work on fragmented masks). The blend is mask-shape-independent.

It also introduces a shared zip_validity helper in the zip module: the result validity is itself a zip over the two boolean validity bitmaps, so it's built as a (lazy) zip array reusing this kernel. This gives the per-encoding zip kernels one shared validity-selection path, and the recursion terminates immediately because validity bitmaps are always Bool(NonNullable).

This lands first; the primitive and listview branchless zip kernels build on it (their nullable validity selection becomes a fast bool zip rather than the slow generic builder).

Changes

  • vortex-array/src/arrays/bool/compute/zip.rs (new): the kernel.
  • vortex-array/src/arrays/bool/compute/mod.rs, .../vtable/kernel.rs: register it.
  • vortex-array/src/scalar_fn/fns/zip/mod.rs: shared pub(crate) fn zip_validity.
  • vortex-array/benches/bool_zip.rs (new): a small divan bench.

Performance (divan, 65,536 bools, median)

case time
nonnull ~7 µs
nullable ~14 µs

Testing

  • Unit tests for non-nullable and Validity::Array inputs spanning the 64-bit mask chunk boundary + remainder.
  • Full vortex-array lib suite passes (2933 tests); cargo +nightly fmt and clippy -D warnings (default + all-features) clean.

https://claude.ai/code/session_01N5ivPiCJy7dGQjMEP7ips9


Generated by Claude Code

Add a dedicated `ZipKernel for Bool` that blends the two value bitmaps with the
mask in a single bitwise pass -- `(true & mask) | (false & !mask)` -- instead of
the generic per-run builder, so boolean zips are branch-free and mask-shape
independent.

Also add a shared `zip_validity` helper to the zip module that builds the result
validity as a (lazy) zip over the two boolean validity bitmaps, reusing this
kernel. This gives the per-encoding zip kernels one shared validity-selection
path; the recursion terminates immediately because validity bitmaps are
non-nullable.

Adds a small `bool_zip` divan benchmark (nonnull ~7us, nullable ~14us).

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs requested a review from a team June 5, 2026 16:37
@joseph-isaacs joseph-isaacs added the changelog/performance A performance improvement label Jun 5, 2026 — with Claude
The doc comment on the public ZipKernel impl linked to the pub(crate)
zip_validity, which -D rustdoc::private-intra-doc-links rejects. Use a plain
code span instead.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Jun 5, 2026

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 4 improved benchmarks
❌ 6 regressed benchmarks
✅ 1503 untouched benchmarks
🆕 2 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_bool_canonical_into[(1000, 10)] 31.6 µs 46.6 µs -32.16%
Simulation chunked_varbinview_canonical_into[(1000, 10)] 161.9 µs 198.2 µs -18.33%
Simulation compare[15] 119.9 µs 145.8 µs -17.78%
Simulation chunked_varbinview_into_canonical[(1000, 10)] 177.2 µs 213.5 µs -17.02%
Simulation compare[14] 117.5 µs 141.5 µs -16.97%
Simulation compare[13] 115.5 µs 137.7 µs -16.1%
Simulation bitwise_not_vortex_buffer_mut[128] 275.3 ns 216.9 ns +26.89%
Simulation bitwise_not_vortex_buffer_mut[1024] 336.9 ns 278.6 ns +20.94%
Simulation bitwise_not_vortex_buffer_mut[2048] 400.6 ns 342.2 ns +17.05%
Simulation compare[5] 76.9 µs 69.2 µs +11.16%
🆕 Simulation nonnull N/A 77.5 µs N/A
🆕 Simulation nullable N/A 135.4 µs N/A

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/bool-branchless-zip (7ad4b18) with develop (e06d80b)

Open in CodSpeed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/performance A performance improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants