Support for CUDA streams #96

pentschev · 2019-07-24T19:30:35Z

When writing CUDA applications, an important aspect for keeping GPUs busy is the use of streams to enqueue operations asynchronously from the host.

Libraries such as Numba and CuPy offer support for CUDA streams, but today we don't know to what extent they're functional with Dask.

I believe CUDA streams will be beneficial to leverage higher performance, particularly in the case of several small operations, streams may help Dask keeping on dispatching work asynchronously, while GPUs do the work.

We should check what's the correct way of using them with Dask and how/if they provide performance improvements.

cc @mrocklin @jakirkham @jrhemstad

mrocklin · 2019-08-30T18:10:34Z

There is no correct way today. Every Python GPU library has their own Python wrapping of the CUDA streams API, so there is no consistent way to do this today. Also cc'ing @kkraus14

jakirkham · 2019-09-19T15:51:00Z

FWIW recently stumbled across this bug ( cupy/cupy#2159 ) on the CuPy issue tracker. Not sure if it is directly related, but figured it was worth being aware of that issue.

leofang · 2020-01-09T16:28:03Z

Possibly related: numba/numba#4797

jakirkham · 2020-01-31T19:43:20Z

One potentially interesting idea to consider would be to create streams and store them in thread local storage. This way we could grab the stream for the current thread and apply it to operations within that thread relatively easily. Ideally this would give us the benefits of compiling with --default-stream per-thread without having to necessarily do that. The question then is where should this functionality live? dask-cuda seems like a reasonable place, but we could also imagine Numba, CuPy, or others being reasonable candidates. Thoughts?

cc @kkraus14

pentschev · 2020-01-31T21:38:37Z

I've never really used --default-stream per-thread, what's the behavior if you happen to access the same data on multiple threads, do you need explicit synchronization? If so, how would that be handled automatically without requiring user interference?

leofang · 2020-02-07T07:59:28Z

One potentially interesting idea to consider would be to create streams and store them in thread local storage. This way we could grab the stream for the current thread and apply it to operations within that thread relatively easily. Ideally this would give us the benefits of compiling with --default-stream per-thread without having to necessarily do that. The question then is where should this functionality live? dask-cuda seems like a reasonable place, but we could also imagine Numba, CuPy, or others being reasonable candidates. Thoughts?

Just wanna add that CuPy is already doing exactly what John said (thread-local streams), see https://github.com/cupy/cupy/blob/096957541c36d4f2a3f2a453e88456fffde1c8bf/cupy/cuda/stream.pyx#L8-L46.

leofang · 2020-02-07T08:04:21Z

For @pentschev's question: I think users who write a multi-threading GPU program are responsible to take care of synchronization as they would do in CUDA C/C++. Not sure if there's an easy way out.

By the way, a discussion on per-thread default streams is recently brought up in Numba: numba/numba#5137.

pentschev · 2020-02-07T11:53:06Z

It seems like --default-stream per-thread could then be helpful if we assign multiple threads per worker (currently we only use one in dask-cuda). My original idea was to have the same thread working on multiple streams, but I think this may be too complex of a use case for Dask. However, multiple threads per worker on the same GPU may limit the problem size due to a potential larger memory footprint, so we need to really test this out and see if this is something that we could work with in Dask.

jakirkham · 2020-02-07T20:48:35Z

Thanks for the info Leo! So maybe this is less a question of how should dask-cuda handle this (specifically) and more a question of how we should work together across libraries to achieve a shared stream solution.

kkraus14 · 2020-02-08T01:32:28Z

I'd also note that --default-stream-per-thread is currently incompatible with RMM pool mode which is planned to be addressed in the future, but likely wont be for a while.

jakirkham · 2020-07-02T00:18:41Z

Updating this issue a bit. It is possible to enable --default-stream-per-thread with RMM's default CNMeM pool ( rapidsai/rmm#354 ), but it comes with some tradeoffs. There is also a new RMM pool where --default-stream-per-thread is being worked on in PR ( rapidsai/rmm#425 ).

jakirkham · 2020-08-10T18:56:23Z

Asking about PTDS support with CuPy in issue ( cupy/cupy#3755 ).

jakirkham · 2020-08-10T20:28:56Z

Also we would need to make a change to Distributed to support PTDS. Have submitted a draft PR ( dask/distributed#4034 ) to make these changes.

jakirkham · 2020-08-10T20:33:12Z

I'd also note that --default-stream-per-thread is currently incompatible with RMM pool mode which is planned to be addressed in the future, but likely wont be for a while.

Also PR ( rapidsai/rmm#466 ) will switch us to the new RMM pool, which does support PTDS.

pentschev · 2020-11-30T13:24:10Z

I began with some testing and changes to enable PTDS support in rapidsai/rmm#633 and cupy/cupy#4322 .

github-actions · 2021-02-16T19:09:19Z

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

jakirkham · 2021-02-17T05:06:19Z

There is active work here though it’s not really tracked in this issue as a few libraries are involved

github-actions · 2021-03-19T05:13:40Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

jakirkham · 2021-03-19T06:48:41Z

This is ongoing work. See issue ( #517 ) for more up-to-date status

github-actions · 2021-04-18T08:04:02Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions · 2021-11-23T20:03:28Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

pentschev mentioned this issue Jan 6, 2020

UCT/CUDA/CUDA_IPC: add per stream refcount; avoid callbacks when no pending ops on stream openucx/ucx#4646

Merged

jakirkham mentioned this issue Feb 27, 2020

[BUG] Shouldn't need to synchronize on creation of DeviceBuffer even in the default stream rapidsai/rmm#313

Closed

rongou mentioned this issue Apr 22, 2020

[FEA] CUDA stream pool rapidsai/rmm#352

Closed

jakirkham mentioned this issue Aug 11, 2020

[FEA] Pass cudaStreamPerThread to numba/CuPy kernels rapidsai/cudf#5922

Open

quasiben mentioned this issue Jan 5, 2021

[QST] Parallelize multiple cuDF query calls rapidsai/cudf#7053

Closed

pentschev added the feature request New feature or request label Jan 8, 2021

github-actions bot added the inactive-30d label Feb 16, 2021

github-actions bot removed the inactive-30d label Feb 17, 2021

github-actions bot added the inactive-30d label Mar 19, 2021

github-actions bot removed the inactive-30d label Mar 19, 2021

jakirkham mentioned this issue Apr 2, 2021

Multiple ThreadPoolExecutors dask/distributed#4655

Closed

github-actions bot added the inactive-30d label Apr 18, 2021

quasiben mentioned this issue Aug 17, 2021

[FEA] Multiprocessing support (to avoid CUDA OOM due to memory spikes) rapidsai/cudf#9042

Closed

github-actions bot added the inactive-90d label Nov 23, 2021

caryr35 added this to dask-cuda Dec 8, 2022

pentschev mentioned this issue Oct 23, 2023

Dask LocalCudaCluster compute error when threads_per_worker not equal to 1 #1262

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for CUDA streams #96

Support for CUDA streams #96

pentschev commented Jul 24, 2019

mrocklin commented Aug 30, 2019

jakirkham commented Sep 19, 2019

leofang commented Jan 9, 2020

jakirkham commented Jan 31, 2020

pentschev commented Jan 31, 2020

leofang commented Feb 7, 2020

leofang commented Feb 7, 2020 •

edited

Loading

pentschev commented Feb 7, 2020

jakirkham commented Feb 7, 2020

kkraus14 commented Feb 8, 2020

jakirkham commented Jul 2, 2020

jakirkham commented Aug 10, 2020

jakirkham commented Aug 10, 2020

jakirkham commented Aug 10, 2020

pentschev commented Nov 30, 2020

github-actions bot commented Feb 16, 2021

jakirkham commented Feb 17, 2021

github-actions bot commented Mar 19, 2021

jakirkham commented Mar 19, 2021

github-actions bot commented Apr 18, 2021

github-actions bot commented Nov 23, 2021

Support for CUDA streams #96

Support for CUDA streams #96

Comments

pentschev commented Jul 24, 2019

mrocklin commented Aug 30, 2019

jakirkham commented Sep 19, 2019

leofang commented Jan 9, 2020

jakirkham commented Jan 31, 2020

pentschev commented Jan 31, 2020

leofang commented Feb 7, 2020

leofang commented Feb 7, 2020 • edited Loading

pentschev commented Feb 7, 2020

jakirkham commented Feb 7, 2020

kkraus14 commented Feb 8, 2020

jakirkham commented Jul 2, 2020

jakirkham commented Aug 10, 2020

jakirkham commented Aug 10, 2020

jakirkham commented Aug 10, 2020

pentschev commented Nov 30, 2020

github-actions bot commented Feb 16, 2021

jakirkham commented Feb 17, 2021

github-actions bot commented Mar 19, 2021

jakirkham commented Mar 19, 2021

github-actions bot commented Apr 18, 2021

github-actions bot commented Nov 23, 2021

leofang commented Feb 7, 2020 •

edited

Loading