fastlanes: bit-packed compare-constant fast path + bitpack_constant kernel by joseph-isaacs · Pull Request #8013 · vortex-data/vortex

joseph-isaacs · 2026-05-18T17:04:24Z

Summary

Stacked on #8012. Speeds up the bitpack_compare bench from the parent PR with two complementary optimizations driven by the same observation — a bit-packed lane holds values in [0, 2^bit_width - 1], so a constant outside that range can be answered analytically without touching the packed buffer.

Compare-constant fast path (`compute/compare.rs`)

Register a CompareKernel for BitPacked that short-circuits when the RHS constant c is outside [0, 2^bit_width - 1]. For each operator the answer is a constant boolean modulo patches and validity:

Operator	Outside range result
`Eq` / `NotEq`	`false` / `true` everywhere
`Lt` / `Lte` / `Gt` / `Gte`	constant once `c` is on either side of the range

Detecting the range is an O(1) i128 check via the new BitPackedData::value_fits_bit_width helper. With no patches and no nulls the kernel returns a ConstantArray<bool> (also O(1)); otherwise it allocates a BitBuffer, fills it with the constant result, and overlays the per-position outcome at each patch index. In-range constants fall through to the canonical decompress + Arrow compare path; tests exercise both fall-throughs.

`bitpack_constant` analytical encoder (`array/bitpack_compress.rs`)

Add a constant-only pack kernel that builds the FastLanes bit pattern for a [constant; len] input without calling BitPacking::pack. For constant input every lane produces the same bit_width output words; we compute those words analytically — each output word's j-th bit is bit (k * T_bits + j) mod bit_width of c — then memset each word LANES times into a stack chunk template and memcpy the template into every full chunk. The standard packer is only invoked for the partial tail (zero-padded past len). bitpack_encode_constant wraps the buffer up as a BitPackedArray. A bitwise-equivalence rstest covers byte-identity with BitPacking::pack across lengths, widths, and constants.

Benches

bitpack_compare (added in bench: bit-packed compare-constant baseline #8012) on this branch now exercises the fast path; at bit_width ∈ {4, 16}, len ∈ {1024, 65536} it runs in ~1.4–1.5 µs vs 8–125 µs for the decompress + Arrow baseline.
New bitpack_constant bench compares the analytical kernel against the full bitpack_encode pipeline on uniform-constant input; at 64 K u32 elements the analytical kernel is roughly 23–62× faster.

Plan doc (`docs/inrange_compare_plan.md`)

Documents the follow-up plan to accelerate in-range ordering comparisons: compare the packed array against the packed constant via SWAR less-than per supported bit width (Routes A/B/C, including Knuth broadword with rotation tables for widths that straddle word boundaries), derive the four ordering operators from one Lt primitive, and benchmark against the canonical SIMD baseline before landing.

Test plan

cargo nextest run -p vortex-fastlanes --all-features → 265/265 pass locally
cargo check -p vortex-fastlanes --benches --all-features
cargo bench -p vortex-fastlanes --bench bitpack_compare shows the fast-path speedup vs the baseline from bench: bit-packed compare-constant baseline #8012
cargo bench -p vortex-fastlanes --bench bitpack_constant shows the analytical encoder speedup
./scripts/public-api.sh agrees with the committed lock file
cargo clippy --all-targets --all-features

Supersedes #8011 (split into bench + speedup).

🤖 Generated with Claude Code

Add `bitpack_compare` divan bench in vortex-fastlanes that pits a binary `Operator::Eq` / `Operator::Lt` against an out-of-range constant on a `BitPackedData` array against an explicit "decompress, then Arrow compare" baseline that materialises the unpacked `PrimitiveArray` first. The constant is chosen as `1 << BW`, i.e. just past the packable range, so a future kernel that recognises out-of-range constants can short-circuit it. Today both arms decompress; the benchmark establishes a baseline for that upcoming optimization to land against. Sized small (`len ∈ {1024, 65536}`, `bit_width ∈ {4, 16}`, Eq + Lt) so it finishes quickly. Run with `cargo bench -p vortex-fastlanes --bench bitpack_compare`. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

…ernel Speeds up the `bitpack_compare` bench from the parent commit with two independent optimizations driven by the same observation — a bit-packed lane holds values in `[0, 2^bit_width - 1]`, so a constant outside that range can be answered analytically without touching the packed buffer. **Compare-constant fast path (`compute/compare.rs`)** Register a `CompareKernel` for `BitPacked` that short-circuits when the RHS constant `c` is outside `[0, 2^bit_width - 1]`. For each operator the answer is a constant boolean modulo patches and validity: Eq/NotEq - false / true everywhere Lt/Lte/Gt/Gte - constant once `c` is on either side of the range Detecting the range is an `O(1)` `i128` check via the new `BitPackedData::value_fits_bit_width` helper. With no patches and no nulls the kernel returns a `ConstantArray<bool>` (also `O(1)`); otherwise it allocates a `BitBuffer`, fills it with the constant result, and overlays the per-position outcome at each patch index. In-range constants fall through to the canonical decompress + Arrow compare path; tests exercise both fall-throughs. **`bitpack_constant` analytical encoder (`array/bitpack_compress.rs`)** Add a constant-only pack kernel that builds the FastLanes bit pattern for a `[constant; len]` input without calling `BitPacking::pack`. For constant input every lane produces the same `bit_width` output words; we compute those words analytically - each output word's `j`-th bit is bit `(k * T_bits + j) mod bit_width` of `c` - then `memset` each word `LANES` times into a stack chunk template and `memcpy` the template into every full chunk. The standard packer is only invoked for the partial tail (zero-padded past `len`). `bitpack_encode_constant` wraps the buffer up as a `BitPackedArray`. A bitwise-equivalence rstest covers byte-identity with `BitPacking::pack` across lengths, widths, and constants. **Benches** * `bitpack_compare` (added in the parent commit) on this branch now exercises the fast path; at `bit_width ∈ {4, 16}`, `len ∈ {1024, 65536}` it runs in ~1.4-1.5 µs vs 8-125 µs for the decompress + Arrow baseline. * New `bitpack_constant` bench compares the analytical kernel against the full `bitpack_encode` pipeline on uniform-constant input; at 64 K u32 elements the analytical kernel is roughly 23-62x faster. **Plan doc (`docs/inrange_compare_plan.md`)** Document the follow-up plan to accelerate *in-range* ordering comparisons: compare the packed array against the packed constant via SWAR less-than per supported bit width (Routes A/B/C, including Knuth broadword with rotation tables for widths that straddle word boundaries), derive the four ordering operators from one `Lt` primitive, and benchmark against the canonical SIMD baseline before landing. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

codspeed-hq · 2026-05-18T17:12:57Z

Merging this PR will improve performance by ×2.6

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️

Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 12 improved benchmarks
❌ 1 regressed benchmark
✅ 1224 untouched benchmarks
🆕 8 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	Simulation	`chunked_varbinview_canonical_into[(1000, 10)]`	197.9 µs	162 µs	+22.19%
⚡	Simulation	`chunked_varbinview_into_canonical[(100, 100)]`	358.4 µs	323.5 µs	+10.78%
⚡	Simulation	`chunked_varbinview_into_canonical[(1000, 10)]`	211.2 µs	175.8 µs	+20.11%
⚡	Simulation	`chunked_varbinview_opt_canonical_into[(1000, 10)]`	224.8 µs	188.6 µs	+19.23%
🆕	Simulation	`full_encode[4, 1024]`	N/A	19.2 µs	N/A
🆕	Simulation	`fast_encode[4, 65536]`	N/A	30.5 µs	N/A
🆕	Simulation	`full_encode[16, 1024]`	N/A	17.3 µs	N/A
🆕	Simulation	`full_encode[16, 65536]`	N/A	358 µs	N/A
🆕	Simulation	`full_encode[4, 65536]`	N/A	313.6 µs	N/A
⚡	Simulation	`fast_eq_out_of_range[4, 1024]`	67 µs	26.9 µs	×2.5
❌	Simulation	`baseline_lt[4, 1024]`	64.1 µs	79 µs	-18.86%
⚡	Simulation	`fast_eq_out_of_range[16, 1024]`	67.7 µs	26.8 µs	×2.5
⚡	Simulation	`fast_eq_out_of_range[4, 65536]`	246 µs	35.2 µs	×7
⚡	Simulation	`fast_lt_out_of_range[4, 1024]`	87.5 µs	32.8 µs	×2.7
🆕	Simulation	`fast_encode[16, 65536]`	N/A	81.5 µs	N/A
⚡	Simulation	`fast_lt_out_of_range[16, 1024]`	67.8 µs	25.7 µs	×2.6
⚡	Simulation	`fast_eq_out_of_range[16, 65536]`	291.1 µs	35.6 µs	×8.2
⚡	Simulation	`fast_lt_out_of_range[4, 65536]`	262 µs	35.1 µs	×7.5
🆕	Simulation	`fast_encode[4, 1024]`	N/A	11.6 µs	N/A
⚡	Simulation	`fast_lt_out_of_range[16, 65536]`	306.3 µs	35.2 µs	×8.7
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing claude/bitpack-compare-speedup-KGPS3 (3b1b8cf) with develop (7b47788)}

…pare-speedup-KGPS3 Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> # Conflicts: # encodings/fastlanes/Cargo.toml

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

joseph-isaacs added 2 commits May 18, 2026 17:53

joseph-isaacs mentioned this pull request May 18, 2026

Fast-path comparison and constant encoding for bit-packed arrays #8011

Closed

Base automatically changed from claude/bitpack-compare-bench-KGPS3 to develop May 18, 2026 17:26

joseph-isaacs added 2 commits May 18, 2026 18:28

Merge remote-tracking branch 'origin/develop' into claude/bitpack-com…

64284d2

…pare-speedup-KGPS3 Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> # Conflicts: # encodings/fastlanes/Cargo.toml

u

3b1b8cf

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fastlanes: bit-packed compare-constant fast path + bitpack_constant kernel#8013

fastlanes: bit-packed compare-constant fast path + bitpack_constant kernel#8013
joseph-isaacs wants to merge 4 commits into
developfrom
claude/bitpack-compare-speedup-KGPS3

joseph-isaacs commented May 18, 2026

Uh oh!

codspeed-hq Bot commented May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joseph-isaacs commented May 18, 2026

Summary

Compare-constant fast path (compute/compare.rs)

bitpack_constant analytical encoder (array/bitpack_compress.rs)

Benches

Plan doc (docs/inrange_compare_plan.md)

Test plan

Uh oh!

codspeed-hq Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by ×2.6

Performance Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Compare-constant fast path (`compute/compare.rs`)

`bitpack_constant` analytical encoder (`array/bitpack_compress.rs`)

Plan doc (`docs/inrange_compare_plan.md`)

codspeed-hq Bot commented May 18, 2026 •

edited

Loading