GPU kernel for sorted patches with chunk_offsets by a10y · Pull Request #7440 · vortex-data/vortex

a10y · 2026-04-15T01:14:58Z

Summary

This branch is a proof of concept for doing data parallel patching without needing to do G-ALP style transposed patches.

Old Method

Do D2H copy of patches, transposing them on CPU, H2D back to GPU
Invoke the kernel with transposed patches

New Method

Extract the chunk_offsets from the Patches. They are like an ends buffer except the final end is implicit and not stored.
Update the patches cursor in the C++ code to use the chunk_offsets and seek to each chunk. Each thread receives n_chunk_patches / n_threads share of the patches for that chunk, which it just applies in a straightline loop (to the shared memory, not global memory)
Regular shared -> global memory flush occurs

TODO

I'm pretty sure this doesn't correctly handle non-zero offset_within_chunk so need to add some tests for that
~~Benchmark build_chunk_offsets to understand overhead~~

Follow up

This is strictly better than what we have, so I think it's worth taking as-is. But it begs the question of do we need Patched array at all anymore.

a10y · 2026-04-15T21:48:44Z

Initial results seem very positive:

bitunpack_cuda_patched_u8/bitunpack_patched/1%
                        time:   [82.419 µs 82.774 µs 82.964 µs]
                        thrpt:  [1122.6 GiB/s 1125.1 GiB/s 1130.0 GiB/s]
bitunpack_cuda_patched_u8/bitunpack_patched/10%
                        time:   [92.184 µs 94.058 µs 95.728 µs]
                        thrpt:  [972.88 GiB/s 990.16 GiB/s 1010.3 GiB/s]


bitunpack_cuda_patched_u16/bitunpack_patched/1%
                        time:   [100.92 µs 102.04 µs 104.02 µs]
                        thrpt:  [1790.6 GiB/s 1825.5 GiB/s 1845.7 GiB/s]
bitunpack_cuda_patched_u16/bitunpack_patched/10%
                        time:   [111.74 µs 112.58 µs 113.22 µs]
                        thrpt:  [1645.1 GiB/s 1654.6 GiB/s 1666.9 GiB/s]


bitunpack_cuda_patched_u32/bitunpack_patched/1%
                        time:   [150.23 µs 152.20 µs 154.80 µs]
                        thrpt:  [2406.5 GiB/s 2447.6 GiB/s 2479.8 GiB/s]
bitunpack_cuda_patched_u32/bitunpack_patched/10%
                        time:   [169.65 µs 169.93 µs 170.44 µs]
                        thrpt:  [2185.8 GiB/s 2192.3 GiB/s 2195.9 GiB/s]


bitunpack_cuda_patched_u64/bitunpack_patched/1%
                        time:   [271.95 µs 272.41 µs 272.90 µs]
                        thrpt:  [2730.2 GiB/s 2735.1 GiB/s 2739.7 GiB/s]
bitunpack_cuda_patched_u64/bitunpack_patched/10%
                        time:   [301.21 µs 301.47 µs 301.87 µs]
                        thrpt:  [2468.1 GiB/s 2471.4 GiB/s 2473.5 GiB/s]

Compared to develop:

bitunpack_cuda_patched_u8/bitunpack_patched/1%
                        time:   [102.93 µs 103.53 µs 104.76 µs]
                        thrpt:  [889.02 GiB/s 899.57 GiB/s 904.80 GiB/s]
                 change:
                        time:   [+24.496% +25.074% +26.299%] (p = 0.00 < 0.05)
                        thrpt:  [-20.823% -20.048% -19.676%]
                        Performance has regressed.
bitunpack_cuda_patched_u8/bitunpack_patched/10%
                        time:   [152.62 µs 154.01 µs 155.01 µs]
                        thrpt:  [600.81 GiB/s 604.70 GiB/s 610.21 GiB/s]
                 change:
                        time:   [+57.297% +63.743% +66.597%] (p = 0.00 < 0.05)
                        thrpt:  [-39.975% -38.929% -36.426%]
                        Performance has regressed.


bitunpack_cuda_patched_u16/bitunpack_patched/1%
                        time:   [110.13 µs 110.89 µs 112.07 µs]
                        thrpt:  [1662.0 GiB/s 1679.7 GiB/s 1691.3 GiB/s]
                 change:
                        time:   [+6.5498% +8.6809% +9.4975%] (p = 0.00 < 0.05)
                        thrpt:  [-8.6737% -7.9875% -6.1472%]
                        Performance has regressed.
bitunpack_cuda_patched_u16/bitunpack_patched/10%
                        time:   [159.44 µs 160.40 µs 161.07 µs]
                        thrpt:  [1156.4 GiB/s 1161.2 GiB/s 1168.3 GiB/s]
                 change:
                        time:   [+40.653% +42.484% +44.092%] (p = 0.00 < 0.05)
                        thrpt:  [-30.600% -29.817% -28.903%]
                        Performance has regressed.


bitunpack_cuda_patched_u32/bitunpack_patched/1%
                        time:   [155.37 µs 157.02 µs 159.86 µs]
                        thrpt:  [2330.4 GiB/s 2372.5 GiB/s 2397.8 GiB/s]
                 change:
                        time:   [+1.6495% +3.1642% +6.8374%] (p = 0.00 < 0.05)
                        thrpt:  [-6.3998% -3.0672% -1.6227%]
                        Performance has regressed.
bitunpack_cuda_patched_u32/bitunpack_patched/10%
                        time:   [205.53 µs 205.76 µs 205.97 µs]
                        thrpt:  [1808.7 GiB/s 1810.5 GiB/s 1812.6 GiB/s]
                 change:
                        time:   [+20.586% +21.090% +21.402%] (p = 0.00 < 0.05)
                        thrpt:  [-17.629% -17.417% -17.072%]
                        Performance has regressed.


bitunpack_cuda_patched_u64/bitunpack_patched/1%
                        time:   [279.55 µs 281.54 µs 282.60 µs]
                        thrpt:  [2636.4 GiB/s 2646.4 GiB/s 2665.2 GiB/s]
                 change:
                        time:   [+2.5941% +3.3502% +4.0234%] (p = 0.00 < 0.05)
                        thrpt:  [-3.8678% -3.2416% -2.5285%]
                        Performance has regressed.
bitunpack_cuda_patched_u64/bitunpack_patched/10%
                        time:   [519.03 µs 519.75 µs 522.95 µs]
                        thrpt:  [1424.7 GiB/s 1433.5 GiB/s 1435.5 GiB/s]
                 change:
                        time:   [+72.310% +72.405% +72.922%] (p = 0.00 < 0.05)
                        thrpt:  [-42.171% -41.997% -41.965%]
                        Performance has regressed.

a10y · 2026-04-15T22:13:21Z

 #include "patches.h"

+/// Load a chunk offset value, dispatching on the runtime type.
+__device__ inline uint32_t load_chunk_offset(const GPUPatches &patches, uint32_t idx) {


this exists b/c the chunk_offsets array can be any PType. We only index it once when we setup the cursor, so instead of trying to do an expensive cast of the whole array upfront, I just do runtime dispatch on the PType

a10y · 2026-04-16T14:11:24Z

@claude take a first pass at this PR.

Focus on approach, CUDA kernels, and finding opportunities to simplify/remove

claude · 2026-04-16T14:11:57Z

Claude encountered an error —— View job

I'll analyze this and get back to you.

claude · 2026-04-16T14:24:32Z

Claude encountered an error —— View job

I'll analyze this and get back to you.

claude · 2026-04-16T14:54:25Z

Claude encountered an error —— View job

I'll analyze this and get back to you.

codspeed-hq · 2026-04-16T16:02:52Z

Merging this PR will not alter performance

✅ 1163 untouched benchmarks
⏩ 1457 skipped benchmarks¹

_{Comparing aduffy/patches-kernel (1a1c672) with develop (ce52b71)}

1457 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

CI clippy was failing on deprecation warnings in three CUDA bitpacked tests using `ArrayRef::to_canonical`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Duffy <andrew@a10y.dev>

robert3005 · 2026-04-16T22:20:11Z

I think Patched array is the abstraction we want. How are offsets laid out was always independent in my head. The question is whether each thread having to do forward pass vs knowing starting point is better

a10y · 2026-04-17T01:48:44Z

I think for G-ALP the thing that really matters is

Allowing threads to seek in constant time to their patch range
Ensuring that data need to be moved around for patch application

I actually think this satisfies both of those. There is no forward pass required for this implementation, each thread can find its patch range in constant time. So we might not be losing all that much with this layout, if anything.

a10y force-pushed the aduffy/patches-kernel branch 3 times, most recently from 667ac15 to b12a04c Compare April 15, 2026 19:44

a10y commented Apr 15, 2026

View reviewed changes

a10y requested review from 0ax1 and robert3005 April 15, 2026 22:17

a10y force-pushed the aduffy/patches-kernel branch 3 times, most recently from 40aa26e to fc5ac8f Compare April 16, 2026 19:24

a10y added 7 commits April 16, 2026 17:11

GPU kernel for sorted patches with chunk_offsets

92fc8ae

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

fix struct

26544eb

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

format

0385083

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

cleanup

18eadc9

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

simplify unit test

ecf4d55

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

try

a8ca2f1

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

fix

82f8911

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

a10y force-pushed the aduffy/patches-kernel branch from ef280f2 to 82f8911 Compare April 16, 2026 21:11

a10y added the changelog/feature A new feature label Apr 16, 2026

Replace deprecated to_canonical with execute::<Canonical>

1a1c672

CI clippy was failing on deprecation warnings in three CUDA bitpacked tests using `ArrayRef::to_canonical`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Duffy <andrew@a10y.dev>

a10y force-pushed the aduffy/patches-kernel branch from 7e3281c to 1a1c672 Compare April 16, 2026 21:18

a10y marked this pull request as ready for review April 16, 2026 21:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU kernel for sorted patches with chunk_offsets#7440

GPU kernel for sorted patches with chunk_offsets#7440
a10y wants to merge 8 commits intodevelopfrom
aduffy/patches-kernel

a10y commented Apr 15, 2026 •

edited

Loading

Uh oh!

a10y commented Apr 15, 2026 •

edited

Loading

Uh oh!

a10y Apr 15, 2026

Uh oh!

a10y commented Apr 16, 2026

Uh oh!

claude bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

claude bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

claude bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

codspeed-hq bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

robert3005 commented Apr 16, 2026

Uh oh!

a10y commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

a10y commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

TODO

Follow up

Uh oh!

a10y commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

a10y Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

a10y commented Apr 16, 2026

Uh oh!

claude bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codspeed-hq bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Footnotes

Uh oh!

robert3005 commented Apr 16, 2026

Uh oh!

a10y commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

a10y commented Apr 15, 2026 •

edited

Loading

a10y commented Apr 15, 2026 •

edited

Loading

claude bot commented Apr 16, 2026 •

edited

Loading

claude bot commented Apr 16, 2026 •

edited

Loading

claude bot commented Apr 16, 2026 •

edited

Loading

codspeed-hq bot commented Apr 16, 2026 •

edited

Loading