Conversation
async from CUDA traits
86da863 to
0677373
Compare
| T: DeviceRepr + Debug + Send + Sync + 'static, | ||
| D: AsRef<[T]> + Send + 'static, | ||
| { | ||
| let host_slice: &[T] = data.as_ref(); |
There was a problem hiding this comment.
Which part exactly? Any T is DeviceRepr.
|
Turns out, memory that is already allocated can be registered as pinned with |
d987ef9 to
35fa29a
Compare
|
FWIW when I read the docs they strongly encourage to use their allocator as that's faster than cuMemHostRegister/cuMemHostUnregister but in places where we can't do that we should definitely fallback to that method |
That was my read too. Maybe we can think about this as a stop gap in the future before we wire in a "CUDA allocator". |
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
395c051 to
51fbb2e
Compare
Summary
Removes async from CUDA Vortex traits.
Execution model
For a given execution context, all work (with the exception to host buf copies) is enqueued onto a single CUDA stream and executes in FIFO order. Kernel launches are synchronous fire-and-forget: they enqueue work and return immediately. The returned
Canonicalmay reference device buffers with in-flight writes; correctness is guaranteed by CUDA stream ordering, not by any CPU-side synchronization.The only operations that require waiting for GPU work to complete are device-to-host copies, where the CPU must block until data lands in host memory.
To insert an explicit sync point, use
await_stream_callback, which completes when all preceding stream work has finished.