Skip to content

perf[gpu]: reduce register pressure in dyn dispatch#7489

Merged
0ax1 merged 2 commits intodevelopfrom
ad/cap-values-per-tile
Apr 16, 2026
Merged

perf[gpu]: reduce register pressure in dyn dispatch#7489
0ax1 merged 2 commits intodevelopfrom
ad/cap-values-per-tile

Conversation

@0ax1
Copy link
Copy Markdown
Contributor

@0ax1 0ax1 commented Apr 16, 2026

We decrease the number of values per tile in the output stage each GPU thread uses, as well as limit the register count to 32 in the launch bounds. This brings the dynamic dispatch kernel into a reasonably close range compared to the standalone kernel for now.

Type Dynamic dispatch Standalone Ratio
u8 bw6 172 µs 79 µs 2.17×
u16 bw6 140 µs 88 µs 1.59×
u32 bw6 184 µs 148 µs 1.24×
u64 bw8 303 µs 276 µs 1.10×

We decrease the number of values per tile in the output stage each
GPU thread uses, as well as limit the register count to 32 in the
launch bounds.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@0ax1 0ax1 changed the title perf: reduce register pressure in dyn dispatch perf[gpu]: reduce register pressure in dyn dispatch Apr 16, 2026
@0ax1 0ax1 added the changelog/performance A performance improvement label Apr 16, 2026
@0ax1 0ax1 requested review from a10y and robert3005 April 16, 2026 16:20
@0ax1
Copy link
Copy Markdown
Contributor Author

0ax1 commented Apr 16, 2026

It's good to keep in mind, that dyn dispatch with only running bp is an odd scenario. We can spend more time on fine tuning bp in context of dyn dispatch but should prioritize optimizing end-to-end perf at this point.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 16, 2026

Merging this PR will improve performance by 20.02%

⚡ 9 improved benchmarks
✅ 1154 untouched benchmarks
⏩ 1457 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation take_map[(0.1, 0.5)] 1,154.5 µs 980.3 µs +17.77%
Simulation take_map[(0.1, 1.0)] 2 ms 1.6 ms +20.02%
Simulation patched_take_10k_contiguous_patches 258.1 µs 227.7 µs +13.32%
Simulation patched_take_10k_dispersed 316 µs 285.8 µs +10.58%
Simulation patched_take_10k_contiguous_not_patches 258.4 µs 228.1 µs +13.28%
Simulation patched_take_10k_first_chunk_only 302 µs 271.8 µs +11.14%
Simulation take_10k_first_chunk_only 270.6 µs 225.7 µs +19.89%
Simulation patched_take_10k_random 270.3 µs 240 µs +12.64%
Simulation take_10k_dispersed 284.4 µs 239.5 µs +18.76%

Comparing ad/cap-values-per-tile (ac24aaf) with develop (1169d84)

Open in CodSpeed

Footnotes

  1. 1457 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@0ax1 0ax1 force-pushed the ad/cap-values-per-tile branch from 41176af to ac24aaf Compare April 16, 2026 17:14
@0ax1 0ax1 enabled auto-merge (squash) April 16, 2026 17:14
@0ax1 0ax1 merged commit 91b4c75 into develop Apr 16, 2026
60 checks passed
@0ax1 0ax1 deleted the ad/cap-values-per-tile branch April 16, 2026 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/performance A performance improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants