Skip to content

perf: optimize shift implementations#574

Open
DaniPopes wants to merge 3 commits intorecmo:mainfrom
DaniPopes:dani/better-shifts
Open

perf: optimize shift implementations#574
DaniPopes wants to merge 3 commits intorecmo:mainfrom
DaniPopes:dani/better-shifts

Conversation

@DaniPopes
Copy link
Copy Markdown
Contributor

Optimize shift left/right for better codegen:

  • Replace the two-step carry trick (x >> (64 - bits - 1)) >> 1 with x.unbounded_shr(64 - bits) in the generic loop, exposing the funnel shift pattern to LLVM.
  • Add specialized fast paths for u64, u128, and u256 in wrapping_shl/wrapping_shr via the as_primitives! macro.
  • The u256 path decomposes into two u128 halves with branchless mask blending, producing shld/shrd + cmov with zero branches in the hot path.

llvm-mca (Zen 4) for U256::wrapping_shl:

  • 57 instructions, 73 uOps, ~13.1 cycles latency, IPC 4.35
  • Compared to native shl i256: 36 instructions, 37 uOps, ~15.1 cycles latency, IPC 2.39
  • Our version has better latency due to avoiding store-forwarding penalties; native shl i256 has better throughput due to fewer instructions.

- Use unbounded_shl/unbounded_shr in the generic loop to expose the
  funnel shift pattern to LLVM (replacing the two-step carry trick)
- Add specialized u64/u128/u256 fast paths in wrapping_shl/wrapping_shr
  via as_primitives! macro
- u256 path uses branchless two-i128-half decomposition with mask
  blending, emitting shld/shrd + cmov with no branches
@DaniPopes DaniPopes requested a review from prestwich as a code owner April 23, 2026 23:26
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Apr 23, 2026

Merging this PR will degrade performance by 14.14%

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 5 improved benchmarks
❌ 1 regressed benchmark
✅ 380 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Benchmark BASE HEAD Efficiency
most_significant_bits/4096/4096 29 µs 26.1 µs +11.11%
pow/128 64.9 µs 75.6 µs -14.14%
to/f32/4096 29.1 µs 26.2 µs +11.34%
to/f64/4096 29.2 µs 26.3 µs +11.06%
wrapping_shl/128 281.9 µs 249.1 µs +13.16%
wrapping_shr/128 292.9 µs 249 µs +17.62%

Comparing DaniPopes:dani/better-shifts (441e078) with main (bff85c8)

Open in CodSpeed

@DaniPopes
Copy link
Copy Markdown
Contributor Author

only shr benches are impacted by this change, rest is noise: #573 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants