fastlanes: bit-packed compare-constant fast path + bitpack_constant kernel#8013
fastlanes: bit-packed compare-constant fast path + bitpack_constant kernel#8013joseph-isaacs wants to merge 4 commits into
Conversation
Add `bitpack_compare` divan bench in vortex-fastlanes that pits a binary
`Operator::Eq` / `Operator::Lt` against an out-of-range constant on a
`BitPackedData` array against an explicit "decompress, then Arrow compare"
baseline that materialises the unpacked `PrimitiveArray` first.
The constant is chosen as `1 << BW`, i.e. just past the packable range, so a
future kernel that recognises out-of-range constants can short-circuit it.
Today both arms decompress; the benchmark establishes a baseline for that
upcoming optimization to land against. Sized small (`len ∈ {1024, 65536}`,
`bit_width ∈ {4, 16}`, Eq + Lt) so it finishes quickly.
Run with `cargo bench -p vortex-fastlanes --bench bitpack_compare`.
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…ernel
Speeds up the `bitpack_compare` bench from the parent commit with two
independent optimizations driven by the same observation — a bit-packed lane
holds values in `[0, 2^bit_width - 1]`, so a constant outside that range can
be answered analytically without touching the packed buffer.
**Compare-constant fast path (`compute/compare.rs`)**
Register a `CompareKernel` for `BitPacked` that short-circuits when the RHS
constant `c` is outside `[0, 2^bit_width - 1]`. For each operator the answer
is a constant boolean modulo patches and validity:
Eq/NotEq - false / true everywhere
Lt/Lte/Gt/Gte - constant once `c` is on either side of the range
Detecting the range is an `O(1)` `i128` check via the new
`BitPackedData::value_fits_bit_width` helper. With no patches and no nulls the
kernel returns a `ConstantArray<bool>` (also `O(1)`); otherwise it allocates a
`BitBuffer`, fills it with the constant result, and overlays the per-position
outcome at each patch index. In-range constants fall through to the canonical
decompress + Arrow compare path; tests exercise both fall-throughs.
**`bitpack_constant` analytical encoder (`array/bitpack_compress.rs`)**
Add a constant-only pack kernel that builds the FastLanes bit pattern for a
`[constant; len]` input without calling `BitPacking::pack`. For constant input
every lane produces the same `bit_width` output words; we compute those words
analytically - each output word's `j`-th bit is bit `(k * T_bits + j) mod
bit_width` of `c` - then `memset` each word `LANES` times into a stack chunk
template and `memcpy` the template into every full chunk. The standard packer
is only invoked for the partial tail (zero-padded past `len`).
`bitpack_encode_constant` wraps the buffer up as a `BitPackedArray`. A
bitwise-equivalence rstest covers byte-identity with `BitPacking::pack` across
lengths, widths, and constants.
**Benches**
* `bitpack_compare` (added in the parent commit) on this branch now exercises
the fast path; at `bit_width ∈ {4, 16}`, `len ∈ {1024, 65536}` it runs in
~1.4-1.5 µs vs 8-125 µs for the decompress + Arrow baseline.
* New `bitpack_constant` bench compares the analytical kernel against the
full `bitpack_encode` pipeline on uniform-constant input; at 64 K u32
elements the analytical kernel is roughly 23-62x faster.
**Plan doc (`docs/inrange_compare_plan.md`)**
Document the follow-up plan to accelerate *in-range* ordering comparisons:
compare the packed array against the packed constant via SWAR less-than per
supported bit width (Routes A/B/C, including Knuth broadword with rotation
tables for widths that straddle word boundaries), derive the four ordering
operators from one `Lt` primitive, and benchmark against the canonical SIMD
baseline before landing.
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Merging this PR will improve performance by ×2.6
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Simulation | chunked_varbinview_canonical_into[(1000, 10)] |
197.9 µs | 162 µs | +22.19% |
| ⚡ | Simulation | chunked_varbinview_into_canonical[(100, 100)] |
358.4 µs | 323.5 µs | +10.78% |
| ⚡ | Simulation | chunked_varbinview_into_canonical[(1000, 10)] |
211.2 µs | 175.8 µs | +20.11% |
| ⚡ | Simulation | chunked_varbinview_opt_canonical_into[(1000, 10)] |
224.8 µs | 188.6 µs | +19.23% |
| 🆕 | Simulation | full_encode[4, 1024] |
N/A | 19.2 µs | N/A |
| 🆕 | Simulation | fast_encode[4, 65536] |
N/A | 30.5 µs | N/A |
| 🆕 | Simulation | full_encode[16, 1024] |
N/A | 17.3 µs | N/A |
| 🆕 | Simulation | full_encode[16, 65536] |
N/A | 358 µs | N/A |
| 🆕 | Simulation | full_encode[4, 65536] |
N/A | 313.6 µs | N/A |
| ⚡ | Simulation | fast_eq_out_of_range[4, 1024] |
67 µs | 26.9 µs | ×2.5 |
| ❌ | Simulation | baseline_lt[4, 1024] |
64.1 µs | 79 µs | -18.86% |
| ⚡ | Simulation | fast_eq_out_of_range[16, 1024] |
67.7 µs | 26.8 µs | ×2.5 |
| ⚡ | Simulation | fast_eq_out_of_range[4, 65536] |
246 µs | 35.2 µs | ×7 |
| ⚡ | Simulation | fast_lt_out_of_range[4, 1024] |
87.5 µs | 32.8 µs | ×2.7 |
| 🆕 | Simulation | fast_encode[16, 65536] |
N/A | 81.5 µs | N/A |
| ⚡ | Simulation | fast_lt_out_of_range[16, 1024] |
67.8 µs | 25.7 µs | ×2.6 |
| ⚡ | Simulation | fast_eq_out_of_range[16, 65536] |
291.1 µs | 35.6 µs | ×8.2 |
| ⚡ | Simulation | fast_lt_out_of_range[4, 65536] |
262 µs | 35.1 µs | ×7.5 |
| 🆕 | Simulation | fast_encode[4, 1024] |
N/A | 11.6 µs | N/A |
| ⚡ | Simulation | fast_lt_out_of_range[16, 65536] |
306.3 µs | 35.2 µs | ×8.7 |
| ... | ... | ... | ... | ... | ... |
ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/bitpack-compare-speedup-KGPS3 (3b1b8cf) with develop (7b47788)
…pare-speedup-KGPS3 Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> # Conflicts: # encodings/fastlanes/Cargo.toml
Summary
Stacked on #8012. Speeds up the
bitpack_comparebench from the parent PR with two complementary optimizations driven by the same observation — a bit-packed lane holds values in[0, 2^bit_width - 1], so a constant outside that range can be answered analytically without touching the packed buffer.Compare-constant fast path (
compute/compare.rs)Register a
CompareKernelforBitPackedthat short-circuits when the RHS constantcis outside[0, 2^bit_width - 1]. For each operator the answer is a constant boolean modulo patches and validity:Eq/NotEqfalse/trueeverywhereLt/Lte/Gt/Gtecis on either side of the rangeDetecting the range is an
O(1)i128check via the newBitPackedData::value_fits_bit_widthhelper. With no patches and no nulls the kernel returns aConstantArray<bool>(alsoO(1)); otherwise it allocates aBitBuffer, fills it with the constant result, and overlays the per-position outcome at each patch index. In-range constants fall through to the canonical decompress + Arrow compare path; tests exercise both fall-throughs.bitpack_constantanalytical encoder (array/bitpack_compress.rs)Add a constant-only pack kernel that builds the FastLanes bit pattern for a
[constant; len]input without callingBitPacking::pack. For constant input every lane produces the samebit_widthoutput words; we compute those words analytically — each output word'sj-th bit is bit(k * T_bits + j) mod bit_widthofc— thenmemseteach wordLANEStimes into a stack chunk template andmemcpythe template into every full chunk. The standard packer is only invoked for the partial tail (zero-padded pastlen).bitpack_encode_constantwraps the buffer up as aBitPackedArray. A bitwise-equivalence rstest covers byte-identity withBitPacking::packacross lengths, widths, and constants.Benches
bitpack_compare(added in bench: bit-packed compare-constant baseline #8012) on this branch now exercises the fast path; atbit_width ∈ {4, 16},len ∈ {1024, 65536}it runs in ~1.4–1.5 µs vs 8–125 µs for the decompress + Arrow baseline.bitpack_constantbench compares the analytical kernel against the fullbitpack_encodepipeline on uniform-constant input; at 64 K u32 elements the analytical kernel is roughly 23–62× faster.Plan doc (
docs/inrange_compare_plan.md)Documents the follow-up plan to accelerate in-range ordering comparisons: compare the packed array against the packed constant via SWAR less-than per supported bit width (Routes A/B/C, including Knuth broadword with rotation tables for widths that straddle word boundaries), derive the four ordering operators from one
Ltprimitive, and benchmark against the canonical SIMD baseline before landing.Test plan
cargo nextest run -p vortex-fastlanes --all-features→ 265/265 pass locallycargo check -p vortex-fastlanes --benches --all-featurescargo bench -p vortex-fastlanes --bench bitpack_compareshows the fast-path speedup vs the baseline from bench: bit-packed compare-constant baseline #8012cargo bench -p vortex-fastlanes --bench bitpack_constantshows the analytical encoder speedup./scripts/public-api.shagrees with the committed lock filecargo clippy --all-targets --all-featuresSupersedes #8011 (split into bench + speedup).
🤖 Generated with Claude Code