Skip to content

GPU kernel for sorted patches with chunk_offsets#7440

Open
a10y wants to merge 8 commits intodevelopfrom
aduffy/patches-kernel
Open

GPU kernel for sorted patches with chunk_offsets#7440
a10y wants to merge 8 commits intodevelopfrom
aduffy/patches-kernel

Conversation

@a10y
Copy link
Copy Markdown
Contributor

@a10y a10y commented Apr 15, 2026

Summary

This branch is a proof of concept for doing data parallel patching without needing to do G-ALP style transposed patches.

Old Method

  • Do D2H copy of patches, transposing them on CPU, H2D back to GPU
  • Invoke the kernel with transposed patches

New Method

  • Extract the chunk_offsets from the Patches. They are like an ends buffer except the final end is implicit and not stored.
  • Update the patches cursor in the C++ code to use the chunk_offsets and seek to each chunk. Each thread receives n_chunk_patches / n_threads share of the patches for that chunk, which it just applies in a straightline loop (to the shared memory, not global memory)
  • Regular shared -> global memory flush occurs

TODO

  • I'm pretty sure this doesn't correctly handle non-zero offset_within_chunk so need to add some tests for that
  • Benchmark build_chunk_offsets to understand overhead

Follow up

This is strictly better than what we have, so I think it's worth taking as-is. But it begs the question of do we need Patched array at all anymore.

@a10y a10y force-pushed the aduffy/patches-kernel branch 3 times, most recently from 667ac15 to b12a04c Compare April 15, 2026 19:44
@a10y
Copy link
Copy Markdown
Contributor Author

a10y commented Apr 15, 2026

Initial results seem very positive:

bitunpack_cuda_patched_u8/bitunpack_patched/1%
                        time:   [82.419 µs 82.774 µs 82.964 µs]
                        thrpt:  [1122.6 GiB/s 1125.1 GiB/s 1130.0 GiB/s]
bitunpack_cuda_patched_u8/bitunpack_patched/10%
                        time:   [92.184 µs 94.058 µs 95.728 µs]
                        thrpt:  [972.88 GiB/s 990.16 GiB/s 1010.3 GiB/s]


bitunpack_cuda_patched_u16/bitunpack_patched/1%
                        time:   [100.92 µs 102.04 µs 104.02 µs]
                        thrpt:  [1790.6 GiB/s 1825.5 GiB/s 1845.7 GiB/s]
bitunpack_cuda_patched_u16/bitunpack_patched/10%
                        time:   [111.74 µs 112.58 µs 113.22 µs]
                        thrpt:  [1645.1 GiB/s 1654.6 GiB/s 1666.9 GiB/s]


bitunpack_cuda_patched_u32/bitunpack_patched/1%
                        time:   [150.23 µs 152.20 µs 154.80 µs]
                        thrpt:  [2406.5 GiB/s 2447.6 GiB/s 2479.8 GiB/s]
bitunpack_cuda_patched_u32/bitunpack_patched/10%
                        time:   [169.65 µs 169.93 µs 170.44 µs]
                        thrpt:  [2185.8 GiB/s 2192.3 GiB/s 2195.9 GiB/s]


bitunpack_cuda_patched_u64/bitunpack_patched/1%
                        time:   [271.95 µs 272.41 µs 272.90 µs]
                        thrpt:  [2730.2 GiB/s 2735.1 GiB/s 2739.7 GiB/s]
bitunpack_cuda_patched_u64/bitunpack_patched/10%
                        time:   [301.21 µs 301.47 µs 301.87 µs]
                        thrpt:  [2468.1 GiB/s 2471.4 GiB/s 2473.5 GiB/s]

Compared to develop:

bitunpack_cuda_patched_u8/bitunpack_patched/1%
                        time:   [102.93 µs 103.53 µs 104.76 µs]
                        thrpt:  [889.02 GiB/s 899.57 GiB/s 904.80 GiB/s]
                 change:
                        time:   [+24.496% +25.074% +26.299%] (p = 0.00 < 0.05)
                        thrpt:  [-20.823% -20.048% -19.676%]
                        Performance has regressed.
bitunpack_cuda_patched_u8/bitunpack_patched/10%
                        time:   [152.62 µs 154.01 µs 155.01 µs]
                        thrpt:  [600.81 GiB/s 604.70 GiB/s 610.21 GiB/s]
                 change:
                        time:   [+57.297% +63.743% +66.597%] (p = 0.00 < 0.05)
                        thrpt:  [-39.975% -38.929% -36.426%]
                        Performance has regressed.


bitunpack_cuda_patched_u16/bitunpack_patched/1%
                        time:   [110.13 µs 110.89 µs 112.07 µs]
                        thrpt:  [1662.0 GiB/s 1679.7 GiB/s 1691.3 GiB/s]
                 change:
                        time:   [+6.5498% +8.6809% +9.4975%] (p = 0.00 < 0.05)
                        thrpt:  [-8.6737% -7.9875% -6.1472%]
                        Performance has regressed.
bitunpack_cuda_patched_u16/bitunpack_patched/10%
                        time:   [159.44 µs 160.40 µs 161.07 µs]
                        thrpt:  [1156.4 GiB/s 1161.2 GiB/s 1168.3 GiB/s]
                 change:
                        time:   [+40.653% +42.484% +44.092%] (p = 0.00 < 0.05)
                        thrpt:  [-30.600% -29.817% -28.903%]
                        Performance has regressed.


bitunpack_cuda_patched_u32/bitunpack_patched/1%
                        time:   [155.37 µs 157.02 µs 159.86 µs]
                        thrpt:  [2330.4 GiB/s 2372.5 GiB/s 2397.8 GiB/s]
                 change:
                        time:   [+1.6495% +3.1642% +6.8374%] (p = 0.00 < 0.05)
                        thrpt:  [-6.3998% -3.0672% -1.6227%]
                        Performance has regressed.
bitunpack_cuda_patched_u32/bitunpack_patched/10%
                        time:   [205.53 µs 205.76 µs 205.97 µs]
                        thrpt:  [1808.7 GiB/s 1810.5 GiB/s 1812.6 GiB/s]
                 change:
                        time:   [+20.586% +21.090% +21.402%] (p = 0.00 < 0.05)
                        thrpt:  [-17.629% -17.417% -17.072%]
                        Performance has regressed.


bitunpack_cuda_patched_u64/bitunpack_patched/1%
                        time:   [279.55 µs 281.54 µs 282.60 µs]
                        thrpt:  [2636.4 GiB/s 2646.4 GiB/s 2665.2 GiB/s]
                 change:
                        time:   [+2.5941% +3.3502% +4.0234%] (p = 0.00 < 0.05)
                        thrpt:  [-3.8678% -3.2416% -2.5285%]
                        Performance has regressed.
bitunpack_cuda_patched_u64/bitunpack_patched/10%
                        time:   [519.03 µs 519.75 µs 522.95 µs]
                        thrpt:  [1424.7 GiB/s 1433.5 GiB/s 1435.5 GiB/s]
                 change:
                        time:   [+72.310% +72.405% +72.922%] (p = 0.00 < 0.05)
                        thrpt:  [-42.171% -41.997% -41.965%]
                        Performance has regressed.

#include "patches.h"

/// Load a chunk offset value, dispatching on the runtime type.
__device__ inline uint32_t load_chunk_offset(const GPUPatches &patches, uint32_t idx) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this exists b/c the chunk_offsets array can be any PType. We only index it once when we setup the cursor, so instead of trying to do an expensive cast of the whole array upfront, I just do runtime dispatch on the PType

@a10y a10y requested review from 0ax1 and robert3005 April 15, 2026 22:17
@a10y
Copy link
Copy Markdown
Contributor Author

a10y commented Apr 16, 2026

@claude take a first pass at this PR.

Focus on approach, CUDA kernels, and finding opportunities to simplify/remove

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 16, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

2 similar comments
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 16, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 16, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 16, 2026

Merging this PR will not alter performance

✅ 1163 untouched benchmarks
⏩ 1457 skipped benchmarks1


Comparing aduffy/patches-kernel (1a1c672) with develop (ce52b71)

Open in CodSpeed

Footnotes

  1. 1457 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@a10y a10y force-pushed the aduffy/patches-kernel branch 3 times, most recently from 40aa26e to fc5ac8f Compare April 16, 2026 19:24
a10y added 7 commits April 16, 2026 17:11
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y force-pushed the aduffy/patches-kernel branch from ef280f2 to 82f8911 Compare April 16, 2026 21:11
@a10y a10y added the changelog/feature A new feature label Apr 16, 2026
CI clippy was failing on deprecation warnings in three CUDA bitpacked
tests using `ArrayRef::to_canonical`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y force-pushed the aduffy/patches-kernel branch from 7e3281c to 1a1c672 Compare April 16, 2026 21:18
@a10y a10y marked this pull request as ready for review April 16, 2026 21:29
@robert3005
Copy link
Copy Markdown
Contributor

I think Patched array is the abstraction we want. How are offsets laid out was always independent in my head. The question is whether each thread having to do forward pass vs knowing starting point is better

@a10y
Copy link
Copy Markdown
Contributor Author

a10y commented Apr 17, 2026

I think for G-ALP the thing that really matters is

  1. Allowing threads to seek in constant time to their patch range
  2. Ensuring that data need to be moved around for patch application

I actually think this satisfies both of those. There is no forward pass required for this implementation, each thread can find its patch range in constant time. So we might not be losing all that much with this layout, if anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants