Skip to content

chore: drop async from CUDA traits#6584

Open
0ax1 wants to merge 1 commit intodevelopfrom
ad/remove-async-from-cuda-traits
Open

chore: drop async from CUDA traits#6584
0ax1 wants to merge 1 commit intodevelopfrom
ad/remove-async-from-cuda-traits

Conversation

@0ax1
Copy link
Contributor

@0ax1 0ax1 commented Feb 18, 2026

Summary

Removes async from CUDA Vortex traits.

Execution model

For a given execution context, all work (with the exception to host buf copies) is enqueued onto a single CUDA stream and executes in FIFO order. Kernel launches are synchronous fire-and-forget: they enqueue work and return immediately. The returned Canonical may reference device buffers with in-flight writes; correctness is guaranteed by CUDA stream ordering, not by any CPU-side synchronization.

The only operations that require waiting for GPU work to complete are device-to-host copies, where the CPU must block until data lands in host memory.

To insert an explicit sync point, use await_stream_callback, which completes when all preceding stream work has finished.

@0ax1 0ax1 changed the title chore: drop async from CUDA chore: drop async from CUDA traits Feb 18, 2026
@0ax1 0ax1 changed the title chore: drop async from CUDA traits chore: drop async from CUDA traits Feb 18, 2026
@0ax1 0ax1 added the changelog/chore A trivial change label Feb 18, 2026
@0ax1 0ax1 requested a review from robert3005 February 18, 2026 16:29
@0ax1 0ax1 force-pushed the ad/remove-async-from-cuda-traits branch from 86da863 to 0677373 Compare February 18, 2026 16:46
T: DeviceRepr + Debug + Send + Sync + 'static,
D: AsRef<[T]> + Send + 'static,
{
let host_slice: &[T] = data.as_ref();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks sus

Copy link
Contributor Author

@0ax1 0ax1 Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which part exactly? Any T is DeviceRepr.

@0ax1 0ax1 enabled auto-merge (squash) February 18, 2026 17:17
@0ax1
Copy link
Contributor Author

0ax1 commented Feb 18, 2026

Turns out, memory that is already allocated can be registered as pinned with cuMemHostRegister_v2, such that the copy to device will be async. See: https://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/html/group__CUDA__MEM_gf0a9fe11544326dabd743b7aa6b54223.html.

@0ax1 0ax1 force-pushed the ad/remove-async-from-cuda-traits branch from d987ef9 to 35fa29a Compare February 18, 2026 18:32
@robert3005
Copy link
Contributor

FWIW when I read the docs they strongly encourage to use their allocator as that's faster than cuMemHostRegister/cuMemHostUnregister but in places where we can't do that we should definitely fallback to that method

@0ax1
Copy link
Contributor Author

0ax1 commented Feb 18, 2026

FWIW when I read the docs they strongly encourage to use their allocator as that's faster than cuMemHostRegister/cuMemHostUnregister but in places where we can't do that we should definitely fallback to that method

That was my read too. Maybe we can think about this as a stop gap in the future before we wire in a "CUDA allocator".

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@0ax1 0ax1 force-pushed the ad/remove-async-from-cuda-traits branch from 395c051 to 51fbb2e Compare February 18, 2026 21:08
@0ax1 0ax1 disabled auto-merge February 19, 2026 10:15
@0ax1 0ax1 enabled auto-merge (squash) February 19, 2026 10:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/chore A trivial change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments