-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for CUDA streams #96
Comments
There is no correct way today. Every Python GPU library has their own Python wrapping of the CUDA streams API, so there is no consistent way to do this today. Also cc'ing @kkraus14 |
FWIW recently stumbled across this bug ( cupy/cupy#2159 ) on the CuPy issue tracker. Not sure if it is directly related, but figured it was worth being aware of that issue. |
Possibly related: numba/numba#4797 |
One potentially interesting idea to consider would be to create streams and store them in thread local storage. This way we could grab the stream for the current thread and apply it to operations within that thread relatively easily. Ideally this would give us the benefits of compiling with cc @kkraus14 |
I've never really used |
Just wanna add that CuPy is already doing exactly what John said (thread-local streams), see https://github.com/cupy/cupy/blob/096957541c36d4f2a3f2a453e88456fffde1c8bf/cupy/cuda/stream.pyx#L8-L46. |
For @pentschev's question: I think users who write a multi-threading GPU program are responsible to take care of synchronization as they would do in CUDA C/C++. Not sure if there's an easy way out. By the way, a discussion on per-thread default streams is recently brought up in Numba: numba/numba#5137. |
It seems like |
Thanks for the info Leo! So maybe this is less a question of how should dask-cuda handle this (specifically) and more a question of how we should work together across libraries to achieve a shared stream solution. |
I'd also note that |
Updating this issue a bit. It is possible to enable |
Asking about PTDS support with CuPy in issue ( cupy/cupy#3755 ). |
Also we would need to make a change to Distributed to support PTDS. Have submitted a draft PR ( dask/distributed#4034 ) to make these changes. |
Also PR ( rapidsai/rmm#466 ) will switch us to the new RMM pool, which does support PTDS. |
I began with some testing and changes to enable PTDS support in rapidsai/rmm#633 and cupy/cupy#4322 . |
This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d. |
There is active work here though it’s not really tracked in this issue as a few libraries are involved |
This issue has been labeled |
This is ongoing work. See issue ( #517 ) for more up-to-date status |
This issue has been labeled |
This issue has been labeled |
When writing CUDA applications, an important aspect for keeping GPUs busy is the use of streams to enqueue operations asynchronously from the host.
Libraries such as Numba and CuPy offer support for CUDA streams, but today we don't know to what extent they're functional with Dask.
I believe CUDA streams will be beneficial to leverage higher performance, particularly in the case of several small operations, streams may help Dask keeping on dispatching work asynchronously, while GPUs do the work.
We should check what's the correct way of using them with Dask and how/if they provide performance improvements.
cc @mrocklin @jakirkham @jrhemstad
The text was updated successfully, but these errors were encountered: