Skip to content

perf(fastlanes): fuse bit-packed compare into a transposed mask + untranspose#8239

Open
joseph-isaacs wants to merge 1 commit into
claude/confident-hamilton-mZIEo-benchesfrom
claude/confident-hamilton-mZIEo
Open

perf(fastlanes): fuse bit-packed compare into a transposed mask + untranspose#8239
joseph-isaacs wants to merge 1 commit into
claude/confident-hamilton-mZIEo-benchesfrom
claude/confident-hamilton-mZIEo

Conversation

@joseph-isaacs
Copy link
Copy Markdown
Contributor

Summary

Stacked on #8238 (the benchmark) so the change lands as a CodSpeed diff.

Replaces the unpack-then-compare streaming kernel for compare-against-constant with the FastLanes fused unpack_cmp:

  • compare each value as it is unpacked, accumulating results straight into a transposed 1024-bit mask ([u64; 16], one register-resident word per lane — no [bool; 1024]/[T; 1024] scratch),
  • a single SIMD untranspose_bits per block rotates the mask into logical row order, copied directly into the output bit buffer,
  • inline patches are spliced in afterwards; sliced (offset != 0) arrays fall back to the scalar streaming predicate.

FastLanes dependency

Requires the in-development FastLanes (spiraldb/fastlanes#141 fused [u64;16] mask + spiraldb/fastlanes#145 width-generic BMI2/VBMI untranspose), pinned via a [patch.crates-io] git rev until a release is cut. This pin must be replaced with a published version bump before merge.

Benchmark (bitpack_compare_sweep, 64Ki elements, all types × all bit widths)

Fused beats the streaming baseline for every type and width (CodSpeed will show the diff vs #8238):

type speedup
i8 / u8 ~6.2–7.7×
i16 / u16 ~4.5–6.0×
i32 / u32 ~1.9–4.3×
i64 / u64 ~1.2–1.9×

Checks

  • cargo build -p vortex-fastlanes ✅ · cargo test -p vortex-fastlanes compare tests: 16 passed (type/width sweep, signed-with-patches, nullable) ✅ · cargo clippy clean ✅ · cargo +nightly fmt ✅ (verified locally against the FastLanes branch via a path patch; the committed git rev pin is functionally identical).

🤖 Generated with Claude Code


Generated by Claude Code

@joseph-isaacs joseph-isaacs added the changelog/performance A performance improvement label Jun 3, 2026 — with Claude
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Jun 3, 2026

Merging this PR will improve performance by 46%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 240 improved benchmarks
❌ 26 regressed benchmarks
✅ 1241 untouched benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation pushdown_compare[(1000, 16, 4)] 141.6 µs 345.8 µs -59.04%
Simulation pushdown_compare[(1000, 4, 4)] 142.5 µs 345.3 µs -58.74%
Simulation pushdown_compare[(1000, 64, 4)] 142.7 µs 345 µs -58.64%
Simulation pushdown_compare[(1000, 4, 8)] 145.9 µs 349.6 µs -58.27%
Simulation pushdown_compare[(1000, 64, 8)] 148 µs 351.3 µs -57.89%
Simulation pushdown_compare[(1000, 16, 8)] 154.4 µs 357.3 µs -56.78%
Simulation pushdown_compare[(10000, 64, 4)] 214.2 µs 417 µs -48.64%
Simulation pushdown_compare[(10000, 64, 8)] 221.6 µs 424.2 µs -47.75%
Simulation pushdown_compare[(10000, 4, 4)] 221.2 µs 418.1 µs -47.1%
Simulation pushdown_compare[(10000, 16, 4)] 221.4 µs 418.3 µs -47.08%
Simulation pushdown_compare[(10000, 4, 8)] 227.1 µs 423.6 µs -46.39%
Simulation pushdown_compare[(10000, 16, 8)] 263.8 µs 459.6 µs -42.61%
Simulation eq_pushdown_low_match 955.2 µs 1,152.4 µs -17.12%
Simulation eq_pushdown_high_match 1.1 ms 1.2 ms -15.7%
WallTime cuda/bitpacked_u8/unpack/3bw[100M] 298.8 µs 350.9 µs -14.84%
Simulation decompress_fsst[(10000, 16, 4)] 509.3 µs 579.9 µs -12.17%
Simulation fsst_decompress_string 3.1 ms 3.5 ms -11.95%
Simulation chunked_into_canonical[(10, 10000, 16, 4)] 5.2 ms 5.9 ms -11.93%
Simulation chunked_canonicalize_into[(10, 10000, 16, 4)] 5.2 ms 5.9 ms -11.89%
Simulation decompress_fsst[(10000, 16, 8)] 561.9 µs 631.9 µs -11.07%
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/confident-hamilton-mZIEo (48da899) with claude/confident-hamilton-mZIEo-benches (10939a6)

Open in CodSpeed

…ranspose

Replace the unpack-then-compare streaming kernel for compare-against-constant
with the FastLanes fused `unpack_cmp`: compare each value as it is unpacked,
accumulating results straight into a transposed 1024-bit mask (`[u64; 16]`,
one register-resident word per lane - no `[bool; 1024]`/`[T; 1024]` scratch),
then a single SIMD `untranspose_bits` per block rotates the mask into logical
row order, copied directly into the output bit buffer. Inline patches are
spliced in afterwards; sliced (offset != 0) arrays fall back to the scalar
streaming predicate.

This requires the in-development FastLanes (PR #141 fused mask + PR #145
width-generic BMI2/VBMI untranspose), pinned via a git patch until released.

Benchmarked end-to-end through the public compare path (`bitpack_compare_sweep`,
64Ki elements, all integer types and bit widths): fused beats the streaming
baseline for every type and width -

  i8/u8   ~6.2-7.7x
  i16/u16 ~4.5-6.0x
  i32/u32 ~1.9-4.3x
  i64/u64 ~1.2-1.9x

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs force-pushed the claude/confident-hamilton-mZIEo branch from e27f5f4 to 48da899 Compare June 3, 2026 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/performance A performance improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants