perf: optimize shift implementations by DaniPopes · Pull Request #574 · recmo/uint

DaniPopes · 2026-04-23T23:26:40Z

Optimize shift left/right for better codegen:

Replace the two-step carry trick (x >> (64 - bits - 1)) >> 1 with x.unbounded_shr(64 - bits) in the generic loop, exposing the funnel shift pattern to LLVM.
Add specialized fast paths for u64, u128, and u256 in wrapping_shl/wrapping_shr via the as_primitives! macro.
The u256 path decomposes into two u128 halves with branchless mask blending, producing shld/shrd + cmov with zero branches in the hot path.

llvm-mca (Zen 4) for U256::wrapping_shl:

57 instructions, 73 uOps, ~13.1 cycles latency, IPC 4.35
Compared to native shl i256: 36 instructions, 37 uOps, ~15.1 cycles latency, IPC 2.39
Our version has better latency due to avoiding store-forwarding penalties; native shl i256 has better throughput due to fewer instructions.

- Use unbounded_shl/unbounded_shr in the generic loop to expose the funnel shift pattern to LLVM (replacing the two-step carry trick) - Add specialized u64/u128/u256 fast paths in wrapping_shl/wrapping_shr via as_primitives! macro - u256 path uses branchless two-i128-half decomposition with mask blending, emitting shld/shrd + cmov with no branches

codspeed-hq · 2026-04-23T23:39:00Z

Merging this PR will degrade performance by 14.14%

⚠️

Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 5 improved benchmarks
❌ 1 regressed benchmark
✅ 380 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	`most_significant_bits/4096/4096`	29 µs	26.1 µs	+11.11%
❌	`pow/128`	64.9 µs	75.6 µs	-14.14%
⚡	`to/f32/4096`	29.1 µs	26.2 µs	+11.34%
⚡	`to/f64/4096`	29.2 µs	26.3 µs	+11.06%
⚡	`wrapping_shl/128`	281.9 µs	249.1 µs	+13.16%
⚡	`wrapping_shr/128`	292.9 µs	249 µs	+17.62%

_{Comparing DaniPopes:dani/better-shifts (441e078) with main (bff85c8)}

DaniPopes · 2026-04-24T19:08:52Z

only shr benches are impacted by this change, rest is noise: #573 (comment)

DaniPopes added 2 commits April 24, 2026 00:20

chore: rm shift algos dead code

ee6a86f

DaniPopes requested a review from prestwich as a code owner April 23, 2026 23:26

Merge branch 'main' into dani/better-shifts

441e078

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize shift implementations#574

perf: optimize shift implementations#574
DaniPopes wants to merge 3 commits intorecmo:mainfrom
DaniPopes:dani/better-shifts

DaniPopes commented Apr 23, 2026

Uh oh!

codspeed-hq Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

DaniPopes commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DaniPopes commented Apr 23, 2026

Uh oh!

codspeed-hq Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will degrade performance by 14.14%

Performance Changes

Uh oh!

DaniPopes commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codspeed-hq Bot commented Apr 23, 2026 •

edited

Loading