perf(fastlanes): fuse bit-packed compare into a transposed mask + untranspose by joseph-isaacs · Pull Request #8239 · vortex-data/vortex

joseph-isaacs · 2026-06-03T16:48:02Z

Summary

Stacked on #8238 (the benchmark) so the change lands as a CodSpeed diff.

Replaces the unpack-then-compare streaming kernel for compare-against-constant with the FastLanes fused unpack_cmp:

compare each value as it is unpacked, accumulating results straight into a transposed 1024-bit mask ([u64; 16], one register-resident word per lane — no [bool; 1024]/[T; 1024] scratch),
a single SIMD untranspose_bits per block rotates the mask into logical row order, copied directly into the output bit buffer,
inline patches are spliced in afterwards; sliced (offset != 0) arrays fall back to the scalar streaming predicate.

FastLanes dependency

Requires the in-development FastLanes (spiraldb/fastlanes#141 fused [u64;16] mask + spiraldb/fastlanes#145 width-generic BMI2/VBMI untranspose), pinned via a [patch.crates-io] git rev until a release is cut. This pin must be replaced with a published version bump before merge.

Benchmark (`bitpack_compare_sweep`, 64Ki elements, all types × all bit widths)

Fused beats the streaming baseline for every type and width (CodSpeed will show the diff vs #8238):

type	speedup
i8 / u8	~6.2–7.7×
i16 / u16	~4.5–6.0×
i32 / u32	~1.9–4.3×
i64 / u64	~1.2–1.9×

Checks

cargo build -p vortex-fastlanes ✅ · cargo test -p vortex-fastlanes compare tests: 16 passed (type/width sweep, signed-with-patches, nullable) ✅ · cargo clippy clean ✅ · cargo +nightly fmt ✅ (verified locally against the FastLanes branch via a path patch; the committed git rev pin is functionally identical).

🤖 Generated with Claude Code

Generated by Claude Code

codspeed-hq · 2026-06-03T16:57:14Z

Merging this PR will improve performance by 46%

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 240 improved benchmarks
❌ 26 regressed benchmarks
✅ 1241 untouched benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`pushdown_compare[(1000, 16, 4)]`	141.6 µs	345.8 µs	-59.04%
❌	Simulation	`pushdown_compare[(1000, 4, 4)]`	142.5 µs	345.3 µs	-58.74%
❌	Simulation	`pushdown_compare[(1000, 64, 4)]`	142.7 µs	345 µs	-58.64%
❌	Simulation	`pushdown_compare[(1000, 4, 8)]`	145.9 µs	349.6 µs	-58.27%
❌	Simulation	`pushdown_compare[(1000, 64, 8)]`	148 µs	351.3 µs	-57.89%
❌	Simulation	`pushdown_compare[(1000, 16, 8)]`	154.4 µs	357.3 µs	-56.78%
❌	Simulation	`pushdown_compare[(10000, 64, 4)]`	214.2 µs	417 µs	-48.64%
❌	Simulation	`pushdown_compare[(10000, 64, 8)]`	221.6 µs	424.2 µs	-47.75%
❌	Simulation	`pushdown_compare[(10000, 4, 4)]`	221.2 µs	418.1 µs	-47.1%
❌	Simulation	`pushdown_compare[(10000, 16, 4)]`	221.4 µs	418.3 µs	-47.08%
❌	Simulation	`pushdown_compare[(10000, 4, 8)]`	227.1 µs	423.6 µs	-46.39%
❌	Simulation	`pushdown_compare[(10000, 16, 8)]`	263.8 µs	459.6 µs	-42.61%
❌	Simulation	`eq_pushdown_low_match`	955.2 µs	1,152.4 µs	-17.12%
❌	Simulation	`eq_pushdown_high_match`	1.1 ms	1.2 ms	-15.7%
❌	WallTime	`cuda/bitpacked_u8/unpack/3bw[100M]`	298.8 µs	350.9 µs	-14.84%
❌	Simulation	`decompress_fsst[(10000, 16, 4)]`	509.3 µs	579.9 µs	-12.17%
❌	Simulation	`fsst_decompress_string`	3.1 ms	3.5 ms	-11.95%
❌	Simulation	`chunked_into_canonical[(10, 10000, 16, 4)]`	5.2 ms	5.9 ms	-11.93%
❌	Simulation	`chunked_canonicalize_into[(10, 10000, 16, 4)]`	5.2 ms	5.9 ms	-11.89%
❌	Simulation	`decompress_fsst[(10000, 16, 8)]`	561.9 µs	631.9 µs	-11.07%
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing claude/confident-hamilton-mZIEo (48da899) with claude/confident-hamilton-mZIEo-benches (10939a6)}

…ranspose Replace the unpack-then-compare streaming kernel for compare-against-constant with the FastLanes fused `unpack_cmp`: compare each value as it is unpacked, accumulating results straight into a transposed 1024-bit mask (`[u64; 16]`, one register-resident word per lane - no `[bool; 1024]`/`[T; 1024]` scratch), then a single SIMD `untranspose_bits` per block rotates the mask into logical row order, copied directly into the output bit buffer. Inline patches are spliced in afterwards; sliced (offset != 0) arrays fall back to the scalar streaming predicate. This requires the in-development FastLanes (PR #141 fused mask + PR #145 width-generic BMI2/VBMI untranspose), pinned via a git patch until released. Benchmarked end-to-end through the public compare path (`bitpack_compare_sweep`, 64Ki elements, all integer types and bit widths): fused beats the streaming baseline for every type and width - i8/u8 ~6.2-7.7x i16/u16 ~4.5-6.0x i32/u32 ~1.9-4.3x i64/u64 ~1.2-1.9x Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

joseph-isaacs added the changelog/performance A performance improvement label Jun 3, 2026 — with Claude

joseph-isaacs force-pushed the claude/confident-hamilton-mZIEo branch from e27f5f4 to 48da899 Compare June 3, 2026 17:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(fastlanes): fuse bit-packed compare into a transposed mask + untranspose#8239

perf(fastlanes): fuse bit-packed compare into a transposed mask + untranspose#8239
joseph-isaacs wants to merge 1 commit into
claude/confident-hamilton-mZIEo-benchesfrom
claude/confident-hamilton-mZIEo

joseph-isaacs commented Jun 3, 2026

Uh oh!

codspeed-hq Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joseph-isaacs commented Jun 3, 2026

Summary

FastLanes dependency

Benchmark (bitpack_compare_sweep, 64Ki elements, all types × all bit widths)

Checks

Uh oh!

codspeed-hq Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by 46%

Performance Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Benchmark (`bitpack_compare_sweep`, 64Ki elements, all types × all bit widths)

codspeed-hq Bot commented Jun 3, 2026 •

edited

Loading