Skip to content

keep DropOnSync alive until after dispatching for cuda slices#6673

Merged
onursatici merged 2 commits intodevelopfrom
os/sync-on-drop
Feb 26, 2026
Merged

keep DropOnSync alive until after dispatching for cuda slices#6673
onursatici merged 2 commits intodevelopfrom
os/sync-on-drop

Conversation

@onursatici
Copy link
Contributor

Summary

cuda device pointers come with SyncOnDrop that is used to synchronise reads and writes to the underlying cuda buffer. We should keep those alive until after dispatching the read or write work on them. For our single stream case this is fine, but if we were to have multiple streams accessing the same buffer these would be a problem

Signed-off-by: Onur Satici <onur@spiraldb.com>
@onursatici onursatici requested a review from 0ax1 February 25, 2026 16:23
@onursatici onursatici added the changelog/fix A bug fix label Feb 25, 2026
@a10y
Copy link
Contributor

a10y commented Feb 25, 2026

I thought the ones returned from device_ptr don't sync and just record events?

@onursatici
Copy link
Contributor Author

yes they only record events, but those events are internal to the cuda view and they are used for synchronising other read or write operations on the same slice

@robert3005
Copy link
Contributor

So is there a practical thing to change here or is this something that is useful when combined with another change?

@0ax1
Copy link
Contributor

0ax1 commented Feb 25, 2026

@onursatici I think this lgtm, but would love to review this tomorrow morning with a fresh mind.

@onursatici
Copy link
Contributor Author

I believe we don't use the same cuda slice from multiple cuda streams yet so this won't fix anything yet, but it does future proof us for that world

Signed-off-by: Onur Satici <onur@spiraldb.com>
@codspeed-hq
Copy link

codspeed-hq bot commented Feb 26, 2026

Merging this PR will improve performance by 42.85%

⚡ 2 improved benchmarks
✅ 952 untouched benchmarks
⏩ 1466 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_bool_canonical_into[(1000, 10)] 68.4 µs 47.9 µs +42.85%
Simulation map_each[BufferMut<i32>, 128] 858.1 ns 770.6 ns +11.36%

Comparing os/sync-on-drop (ff2f511) with develop (ea7804c)

Open in CodSpeed

Footnotes

  1. 1466 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@onursatici onursatici enabled auto-merge (squash) February 26, 2026 11:00
@0ax1
Copy link
Contributor

0ax1 commented Feb 26, 2026

Unrelated to the changes in your PR I think there's sth odd about adding a sync/record event on drop for each device_ptr call. The drop also doesn't enforce that record event is set right after the launch. The other thing here being that launch kernel can record events as part of the launch.

@onursatici onursatici merged commit 6c7cd94 into develop Feb 26, 2026
51 checks passed
@onursatici onursatici deleted the os/sync-on-drop branch February 26, 2026 11:02
@0ax1
Copy link
Contributor

0ax1 commented Feb 26, 2026

The other thing to note is that cudarc sets events for each cudaview / cudaslice as part of the launch. So as long we enable recording events for launches we can check the events on each cudaview/cudaslice without recording additional events I think. Maybe it's worth considering going the opposite route relying on launch events, and provide a device ptr retrieval wrapper that bypasses recording events. Or I guess we'd need one event per split, once we start parallelizing splits across streams.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/fix A bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants