Skip to content

fastlanes: bit-packed compare-constant fast path + bitpack_constant kernel#8013

Open
joseph-isaacs wants to merge 4 commits into
developfrom
claude/bitpack-compare-speedup-KGPS3
Open

fastlanes: bit-packed compare-constant fast path + bitpack_constant kernel#8013
joseph-isaacs wants to merge 4 commits into
developfrom
claude/bitpack-compare-speedup-KGPS3

Conversation

@joseph-isaacs
Copy link
Copy Markdown
Contributor

Summary

Stacked on #8012. Speeds up the bitpack_compare bench from the parent PR with two complementary optimizations driven by the same observation — a bit-packed lane holds values in [0, 2^bit_width - 1], so a constant outside that range can be answered analytically without touching the packed buffer.

Compare-constant fast path (compute/compare.rs)

Register a CompareKernel for BitPacked that short-circuits when the RHS constant c is outside [0, 2^bit_width - 1]. For each operator the answer is a constant boolean modulo patches and validity:

Operator Outside range result
Eq / NotEq false / true everywhere
Lt / Lte / Gt / Gte constant once c is on either side of the range

Detecting the range is an O(1) i128 check via the new BitPackedData::value_fits_bit_width helper. With no patches and no nulls the kernel returns a ConstantArray<bool> (also O(1)); otherwise it allocates a BitBuffer, fills it with the constant result, and overlays the per-position outcome at each patch index. In-range constants fall through to the canonical decompress + Arrow compare path; tests exercise both fall-throughs.

bitpack_constant analytical encoder (array/bitpack_compress.rs)

Add a constant-only pack kernel that builds the FastLanes bit pattern for a [constant; len] input without calling BitPacking::pack. For constant input every lane produces the same bit_width output words; we compute those words analytically — each output word's j-th bit is bit (k * T_bits + j) mod bit_width of c — then memset each word LANES times into a stack chunk template and memcpy the template into every full chunk. The standard packer is only invoked for the partial tail (zero-padded past len). bitpack_encode_constant wraps the buffer up as a BitPackedArray. A bitwise-equivalence rstest covers byte-identity with BitPacking::pack across lengths, widths, and constants.

Benches

  • bitpack_compare (added in bench: bit-packed compare-constant baseline #8012) on this branch now exercises the fast path; at bit_width ∈ {4, 16}, len ∈ {1024, 65536} it runs in ~1.4–1.5 µs vs 8–125 µs for the decompress + Arrow baseline.
  • New bitpack_constant bench compares the analytical kernel against the full bitpack_encode pipeline on uniform-constant input; at 64 K u32 elements the analytical kernel is roughly 23–62× faster.

Plan doc (docs/inrange_compare_plan.md)

Documents the follow-up plan to accelerate in-range ordering comparisons: compare the packed array against the packed constant via SWAR less-than per supported bit width (Routes A/B/C, including Knuth broadword with rotation tables for widths that straddle word boundaries), derive the four ordering operators from one Lt primitive, and benchmark against the canonical SIMD baseline before landing.

Test plan

  • cargo nextest run -p vortex-fastlanes --all-features → 265/265 pass locally
  • cargo check -p vortex-fastlanes --benches --all-features
  • cargo bench -p vortex-fastlanes --bench bitpack_compare shows the fast-path speedup vs the baseline from bench: bit-packed compare-constant baseline #8012
  • cargo bench -p vortex-fastlanes --bench bitpack_constant shows the analytical encoder speedup
  • ./scripts/public-api.sh agrees with the committed lock file
  • cargo clippy --all-targets --all-features

Supersedes #8011 (split into bench + speedup).

🤖 Generated with Claude Code

Add `bitpack_compare` divan bench in vortex-fastlanes that pits a binary
`Operator::Eq` / `Operator::Lt` against an out-of-range constant on a
`BitPackedData` array against an explicit "decompress, then Arrow compare"
baseline that materialises the unpacked `PrimitiveArray` first.

The constant is chosen as `1 << BW`, i.e. just past the packable range, so a
future kernel that recognises out-of-range constants can short-circuit it.
Today both arms decompress; the benchmark establishes a baseline for that
upcoming optimization to land against. Sized small (`len ∈ {1024, 65536}`,
`bit_width ∈ {4, 16}`, Eq + Lt) so it finishes quickly.

Run with `cargo bench -p vortex-fastlanes --bench bitpack_compare`.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…ernel

Speeds up the `bitpack_compare` bench from the parent commit with two
independent optimizations driven by the same observation — a bit-packed lane
holds values in `[0, 2^bit_width - 1]`, so a constant outside that range can
be answered analytically without touching the packed buffer.

**Compare-constant fast path (`compute/compare.rs`)**

Register a `CompareKernel` for `BitPacked` that short-circuits when the RHS
constant `c` is outside `[0, 2^bit_width - 1]`. For each operator the answer
is a constant boolean modulo patches and validity:

  Eq/NotEq           - false / true everywhere
  Lt/Lte/Gt/Gte      - constant once `c` is on either side of the range

Detecting the range is an `O(1)` `i128` check via the new
`BitPackedData::value_fits_bit_width` helper. With no patches and no nulls the
kernel returns a `ConstantArray<bool>` (also `O(1)`); otherwise it allocates a
`BitBuffer`, fills it with the constant result, and overlays the per-position
outcome at each patch index. In-range constants fall through to the canonical
decompress + Arrow compare path; tests exercise both fall-throughs.

**`bitpack_constant` analytical encoder (`array/bitpack_compress.rs`)**

Add a constant-only pack kernel that builds the FastLanes bit pattern for a
`[constant; len]` input without calling `BitPacking::pack`. For constant input
every lane produces the same `bit_width` output words; we compute those words
analytically - each output word's `j`-th bit is bit `(k * T_bits + j) mod
bit_width` of `c` - then `memset` each word `LANES` times into a stack chunk
template and `memcpy` the template into every full chunk. The standard packer
is only invoked for the partial tail (zero-padded past `len`).
`bitpack_encode_constant` wraps the buffer up as a `BitPackedArray`. A
bitwise-equivalence rstest covers byte-identity with `BitPacking::pack` across
lengths, widths, and constants.

**Benches**

* `bitpack_compare` (added in the parent commit) on this branch now exercises
  the fast path; at `bit_width ∈ {4, 16}`, `len ∈ {1024, 65536}` it runs in
  ~1.4-1.5 µs vs 8-125 µs for the decompress + Arrow baseline.
* New `bitpack_constant` bench compares the analytical kernel against the
  full `bitpack_encode` pipeline on uniform-constant input; at 64 K u32
  elements the analytical kernel is roughly 23-62x faster.

**Plan doc (`docs/inrange_compare_plan.md`)**

Document the follow-up plan to accelerate *in-range* ordering comparisons:
compare the packed array against the packed constant via SWAR less-than per
supported bit width (Routes A/B/C, including Knuth broadword with rotation
tables for widths that straddle word boundaries), derive the four ordering
operators from one `Lt` primitive, and benchmark against the canonical SIMD
baseline before landing.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 18, 2026

Merging this PR will improve performance by ×2.6

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 12 improved benchmarks
❌ 1 regressed benchmark
✅ 1224 untouched benchmarks
🆕 8 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_varbinview_canonical_into[(1000, 10)] 197.9 µs 162 µs +22.19%
Simulation chunked_varbinview_into_canonical[(100, 100)] 358.4 µs 323.5 µs +10.78%
Simulation chunked_varbinview_into_canonical[(1000, 10)] 211.2 µs 175.8 µs +20.11%
Simulation chunked_varbinview_opt_canonical_into[(1000, 10)] 224.8 µs 188.6 µs +19.23%
🆕 Simulation full_encode[4, 1024] N/A 19.2 µs N/A
🆕 Simulation fast_encode[4, 65536] N/A 30.5 µs N/A
🆕 Simulation full_encode[16, 1024] N/A 17.3 µs N/A
🆕 Simulation full_encode[16, 65536] N/A 358 µs N/A
🆕 Simulation full_encode[4, 65536] N/A 313.6 µs N/A
Simulation fast_eq_out_of_range[4, 1024] 67 µs 26.9 µs ×2.5
Simulation baseline_lt[4, 1024] 64.1 µs 79 µs -18.86%
Simulation fast_eq_out_of_range[16, 1024] 67.7 µs 26.8 µs ×2.5
Simulation fast_eq_out_of_range[4, 65536] 246 µs 35.2 µs ×7
Simulation fast_lt_out_of_range[4, 1024] 87.5 µs 32.8 µs ×2.7
🆕 Simulation fast_encode[16, 65536] N/A 81.5 µs N/A
Simulation fast_lt_out_of_range[16, 1024] 67.8 µs 25.7 µs ×2.6
Simulation fast_eq_out_of_range[16, 65536] 291.1 µs 35.6 µs ×8.2
Simulation fast_lt_out_of_range[4, 65536] 262 µs 35.1 µs ×7.5
🆕 Simulation fast_encode[4, 1024] N/A 11.6 µs N/A
Simulation fast_lt_out_of_range[16, 65536] 306.3 µs 35.2 µs ×8.7
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/bitpack-compare-speedup-KGPS3 (3b1b8cf) with develop (7b47788)

Open in CodSpeed

Base automatically changed from claude/bitpack-compare-bench-KGPS3 to develop May 18, 2026 17:26
…pare-speedup-KGPS3

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

# Conflicts:
#	encodings/fastlanes/Cargo.toml
u
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant