Fast-path comparison and constant encoding for bit-packed arrays#8011
Fast-path comparison and constant encoding for bit-packed arrays#8011joseph-isaacs wants to merge 2 commits into
Conversation
A bit-packed lane holds values in `[0, 2^bit_width - 1]`. When the RHS constant sits outside that range, no packed lane can equal it, so: Eq -> false everywhere NotEq -> true everywhere modulo patches (which carry the real value) and validity. Detecting the range is an `O(1)` `i128` check on the constant alone — strictly cheaper than encoding `c` into the bit-packed representation. Register a `CompareKernel` for `BitPacked` that short-circuits this case. With no patches and no nulls it returns a `ConstantArray<bool>` (also `O(1)`); otherwise it allocates a `BitBuffer`, fills it with the constant result, and overlays the per-position outcome at each patch index. Ordering operators (`Lt`/`Lte`/`Gt`/`Gte`) and in-range constants fall through to the canonical decompress + Arrow compare path; tests exercise both fall-throughs. Signed-off-by: Claude <noreply@anthropic.com>
…kernel and benches, plan in-range ordering
Ordering operators (Lt/Lte/Gt/Gte) now use the same out-of-range short-circuit
as Eq/NotEq: when `c` lies outside `[0, 2^bit_width - 1]`, every packed lane has
the same `Ordering` relative to `c`, so each of the six operators collapses to
a constant boolean (modulo patches and validity).
Add a constant-only pack kernel `bitpack_constant` that builds the FastLanes
bit pattern for a `[constant; len]` input without calling `BitPacking::pack`.
For constant input every lane produces the same `bit_width` output words; we
compute those words analytically — each output word's `j`-th bit is bit
`(k * T_bits + j) mod bit_width` of `c` — then `memset` each word `LANES` times
into a stack chunk template and `memcpy` the template into every full chunk.
The standard packer is only invoked for the partial tail (zero-padded past
`len`). `bitpack_encode_constant` wraps the buffer up as a `BitPackedArray`.
A bitwise equivalence rstest covers byte-identity with `BitPacking::pack`
across lengths, widths, and constants.
Bench `bitpack_constant` (analytical vs full `bitpack_encode`) on a small,
fast grid: at 64 K u32 elements the analytical kernel is roughly 23-62x faster
than the full encoder, since it skips the histogram, min-scan, patches gather,
and per-chunk SIMD pack call.
Bench `bitpack_compare` (out-of-range fast path vs explicit
"decompress + Arrow compare" baseline): 1.4-1.5 µs constant-array setup vs
8-125 µs for the baseline across `bit_width ∈ {4, 16}`, `len ∈ {1024, 65536}`
and Eq/Lt.
Add a `value_fits_bit_width` helper on `BitPackedData` exposing the same O(1)
range check used internally.
Plan how to accelerate **in-range** ordering comparisons in
`encodings/fastlanes/docs/inrange_compare_plan.md`: compare the packed array
against the packed constant via SWAR less-than per supported bit width, derive
the four ordering operators from one `Lt` primitive, and benchmark against
the canonical SIMD baseline before landing.
Signed-off-by: Claude <noreply@anthropic.com>
Merging this PR will not alter performance
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | chunked_varbinview_opt_canonical_into[(1000, 10)] |
188 µs | 224.7 µs | -16.33% |
| ⚡ | Simulation | chunked_varbinview_into_canonical[(100, 100)] |
358.4 µs | 323.3 µs | +10.86% |
| ⚡ | Simulation | chunked_varbinview_into_canonical[(1000, 10)] |
211.9 µs | 176.3 µs | +20.18% |
| 🆕 | Simulation | fast_encode[4, 1024] |
N/A | 11.5 µs | N/A |
| 🆕 | Simulation | full_encode[16, 65536] |
N/A | 358 µs | N/A |
| 🆕 | Simulation | full_encode[16, 1024] |
N/A | 17.3 µs | N/A |
| 🆕 | Simulation | fast_encode[4, 65536] |
N/A | 30.6 µs | N/A |
| 🆕 | Simulation | full_encode[4, 1024] |
N/A | 19.2 µs | N/A |
| 🆕 | Simulation | baseline_eq[4, 1024] |
N/A | 64 µs | N/A |
| 🆕 | Simulation | baseline_lt[4, 1024] |
N/A | 79 µs | N/A |
| 🆕 | Simulation | fast_lt_out_of_range[4, 65536] |
N/A | 35 µs | N/A |
| 🆕 | Simulation | fast_encode[16, 65536] |
N/A | 81.5 µs | N/A |
| 🆕 | Simulation | full_encode[4, 65536] |
N/A | 313.6 µs | N/A |
| 🆕 | Simulation | baseline_eq[16, 65536] |
N/A | 288.1 µs | N/A |
| 🆕 | Simulation | baseline_lt[16, 1024] |
N/A | 65.1 µs | N/A |
| 🆕 | Simulation | baseline_eq[16, 1024] |
N/A | 64.6 µs | N/A |
| 🆕 | Simulation | baseline_eq[4, 65536] |
N/A | 243.2 µs | N/A |
| 🆕 | Simulation | fast_lt_out_of_range[16, 65536] |
N/A | 35.1 µs | N/A |
| 🆕 | Simulation | fast_eq_out_of_range[16, 65536] |
N/A | 35.6 µs | N/A |
| 🆕 | Simulation | baseline_lt[16, 65536] |
N/A | 275.8 µs | N/A |
| ... | ... | ... | ... | ... | ... |
ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/optimize-bitpack-comparison-KGPS3 (a20f09d) with develop (52e26d1)1
Footnotes
Polar Signals Profiling ResultsLatest Run
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 1.166x ❌ datafusion / vortex-file-compressed (1.166x ❌, 0↑ 8↓)
|
File Sizes: PolarSignals ProfilingNo file size changes detected. |
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.143x ❌, 0↑ 4↓)
datafusion / vortex-compact (1.215x ❌, 0↑ 7↓)
datafusion / parquet (1.090x ➖, 0↑ 3↓)
duckdb / vortex-file-compressed (1.051x ➖, 1↑ 3↓)
duckdb / vortex-compact (1.153x ❌, 0↑ 5↓)
duckdb / parquet (1.091x ➖, 0↑ 4↓)
Full attributed analysis
|
File Sizes: FineWeb NVMeNo file size changes detected. |
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.033x ➖, 0↑ 2↓)
datafusion / vortex-compact (1.039x ➖, 0↑ 1↓)
datafusion / parquet (0.990x ➖, 0↑ 0↓)
datafusion / arrow (0.999x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (1.045x ➖, 1↑ 4↓)
duckdb / vortex-compact (1.105x ❌, 2↑ 12↓)
duckdb / parquet (0.979x ➖, 2↑ 0↓)
duckdb / duckdb (1.016x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=1 on NVMENo file size changes detected. |
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.105x ❌, 0↑ 51↓)
datafusion / vortex-compact (0.972x ➖, 9↑ 5↓)
datafusion / parquet (1.138x ❌, 0↑ 53↓)
duckdb / vortex-file-compressed (1.057x ➖, 6↑ 24↓)
duckdb / vortex-compact (1.108x ❌, 6↑ 52↓)
duckdb / parquet (0.983x ➖, 1↑ 2↓)
duckdb / duckdb (0.975x ➖, 5↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-DS SF=1 on NVMENo file size changes detected. |
Benchmarks: FineWeb S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.832x ➖, 2↑ 0↓)
datafusion / vortex-compact (0.931x ➖, 1↑ 0↓)
datafusion / parquet (1.059x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (0.993x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.966x ➖, 0↑ 0↓)
duckdb / parquet (0.993x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Random AccessVortex (geomean): 1.213x ❌ unknown / unknown (1.287x ❌, 0↑ 34↓)
|
Benchmarks: Statistical and Population GeneticsVerdict: Likely regression (medium confidence) duckdb / vortex-file-compressed (1.323x ❌, 1↑ 7↓)
duckdb / vortex-compact (1.466x ❌, 1↑ 7↓)
duckdb / parquet (0.987x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: Statistical and Population GeneticsNo file size changes detected. |
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.060x ➖, 0↑ 4↓)
datafusion / vortex-compact (1.111x ❌, 0↑ 8↓)
datafusion / parquet (1.019x ➖, 0↑ 1↓)
datafusion / arrow (1.035x ➖, 0↑ 2↓)
duckdb / vortex-file-compressed (1.094x ➖, 2↑ 10↓)
duckdb / vortex-compact (1.108x ❌, 2↑ 10↓)
duckdb / parquet (1.017x ➖, 0↑ 0↓)
duckdb / duckdb (1.003x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=10 on NVMENo file size changes detected. |
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.965x ➖, 1↑ 1↓)
datafusion / parquet (0.963x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.989x ➖, 0↑ 1↓)
duckdb / parquet (0.982x ➖, 2↑ 0↓)
duckdb / duckdb (0.981x ➖, 3↑ 0↓)
Full attributed analysis
|
File Sizes: Clickbench on NVMEFile Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
|
|
Superseded by a stacked split:
Closing this one to consolidate review on the split. The branch |
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.969x ➖, 1↑ 0↓)
datafusion / vortex-compact (0.982x ➖, 0↑ 1↓)
datafusion / parquet (0.945x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.056x ➖, 0↑ 1↓)
duckdb / vortex-compact (1.087x ➖, 0↑ 2↓)
duckdb / parquet (1.006x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: CompressionVortex (geomean): 1.011x ➖ unknown / unknown (0.994x ➖, 8↑ 5↓)
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.915x ➖, 5↑ 2↓)
datafusion / vortex-compact (1.050x ➖, 0↑ 3↓)
datafusion / parquet (0.939x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (1.004x ➖, 1↑ 1↓)
duckdb / vortex-compact (1.078x ➖, 1↑ 3↓)
duckdb / parquet (1.048x ➖, 0↑ 0↓)
Full attributed analysis
|
Summary
This PR adds two complementary optimizations for bit-packed arrays:
Out-of-range constant comparison (
compare.rs): When comparing aBitPackedArrayagainst a constant that falls outside the packable range[0, 2^bit_width - 1], every packed lane has the same ordering relation to the constant. This collapses the comparison to a constant boolean result (modulo patches and validity), reducing the result to aConstantArray<bool>in the hot path or aBitBufferwith per-position overlays at patched indices. The range check isO(1)on the constant alone and strictly cheaper than encoding it into the bit-packed representation.Constant-only bit-packing (
bitpack_constantandbitpack_encode_constant): When encoding a uniform-constant input, the standard FastLanes packing kernel is unnecessary. Instead, we analytically compute the bit pattern that each lane produces (a periodic stream of the constant's low bits), replicate it across chunks viamemset/memcpy, and fall back to a single standard pack call only for the trailing partial chunk if needed. This avoids callingBitPacking::packfor the common case of constant-filled arrays.Both optimizations are layout-aware and integrate cleanly with the existing encode/compare pipelines. In-range constants in comparisons fall through to the canonical decompress-then-compare path; the plan for accelerating those cases is documented in
docs/inrange_compare_plan.md.Changes
encodings/fastlanes/src/bitpacking/compute/compare.rs(new): ImplementsCompareKernelforBitPackedwith fast-path handling for out-of-range constants. Includes comprehensive unit tests covering patches, nullability, and edge cases.encodings/fastlanes/src/bitpacking/array/bitpack_compress.rs: Addsbitpack_constant()(low-level buffer synthesis),constant_lane_words()(analytical bit-pattern computation), andbitpack_encode_constant()(public API). Includes roundtrip tests verifying correctness against the standard encode path.encodings/fastlanes/src/bitpacking/array/mod.rs: Addsvalue_fits_bit_width()helper forO(1)range checking.encodings/fastlanes/docs/inrange_compare_plan.md(new): Detailed plan for accelerating in-range constant comparisons via SWAR and bit-sliced techniques, with benchmarking guidance.encodings/fastlanes/benches/bitpack_constant.rs(new): Benchmarks constant encoding against the standard pipeline.encodings/fastlanes/benches/bitpack_compare.rs(new): Benchmarks out-of-range comparisons against the decompress-then-compare baseline.encodings/fastlanes/src/bitpacking/vtable/kernels.rs: Registers the newCompareKernel.encodings/fastlanes/src/bitpacking/compute/mod.rs: Declares the newcomparemodule.encodings/fastlanes/Cargo.toml: Adds benchmark targets.encodings/fastlanes/public-api.lock: Updated for new public APIs.Testing
compare.rs: Cover out-of-range comparisons (above and below range), patches, nullability, and fallthrough for in-range constants across all six comparison operators.bitpack_compress.rs: Verify thatbitpack_constant()produces identical packed buffers to the standard encode path and that roundtrip unpacking recovers the original constant values.All existing tests pass; the new kernels return
Ok(None)for cases they don't accelerate, so the canonical paths remain the correctness fallback.https://claude.ai/code/session_0156Z1mXHNghcuT1yX3pscE8