UserWarning: A CUDA context for device 0 already exists #867

randerzander · 2022-03-02T16:44:19Z

With the latest dask_cudf and dask-cuda nightlies, if I've already imported dask_cudf by the time I start my LocalCUDACluster, I get a warning:

/home/rgelhausen/conda/envs/dsql-3-02/lib/python3.9/site-packages/distributed-2022.2.1+5.gbe4fc7f7-py3.9.egg/distributed/comm/ucx.py:83: UserWarning: A CUDA context for device 0 already exists on process ID 1180570. This is often the result of a CUDA-enabled library calling a CUDA runtime function before Dask-CUDA can spawn worker processes. Please make sure any such function calls don't happen at import time or in the global scope of a program.
  warnings.warn(

nvidia-smi shows I do indeed have two client processes running using GPU 0:

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1180570      C   ...envs/dsql-3-02/bin/python      307MiB |
|    0   N/A  N/A   1181113      C   ...envs/dsql-3-02/bin/python    28303MiB |
|    1   N/A  N/A   1181221      C   ...envs/dsql-3-02/bin/python    27995MiB |
|    2   N/A  N/A   1181213      C   ...envs/dsql-3-02/bin/python    27995MiB |
|    3   N/A  N/A   1181210      C   ...envs/dsql-3-02/bin/python    27995MiB |
|    4   N/A  N/A   1181501      C   ...envs/dsql-3-02/bin/python    27995MiB |
|    5   N/A  N/A   1181127      C   ...envs/dsql-3-02/bin/python    27995MiB |
|    6   N/A  N/A   1181276      C   ...envs/dsql-3-02/bin/python    27995MiB |
|    7   N/A  N/A   1181180      C   ...envs/dsql-3-02/bin/python    27995MiB |
|    8   N/A  N/A   1181150      C   ...envs/dsql-3-02/bin/python    27995MiB |
|    9   N/A  N/A   1181311      C   ...envs/dsql-3-02/bin/python    27995MiB |
|   10   N/A  N/A   1181246      C   ...envs/dsql-3-02/bin/python    27995MiB |
|   11   N/A  N/A   1181168      C   ...envs/dsql-3-02/bin/python    28303MiB |
|   12   N/A  N/A   1181270      C   ...envs/dsql-3-02/bin/python    27995MiB |
|   13   N/A  N/A   1181494      C   ...envs/dsql-3-02/bin/python    27995MiB |
|   14   N/A  N/A   1181184      C   ...envs/dsql-3-02/bin/python    27995MiB |
|   15   N/A  N/A   1181177      C   ...envs/dsql-3-02/bin/python    28303MiB |
+-----------------------------------------------------------------------------+

I'm not sure if this is new dask_cudf behavior, or if dask-cuda just now detects and warns about this?

The text was updated successfully, but these errors were encountered:

pentschev · 2022-03-02T16:56:36Z

This must be new dask_cudf behavior, this warning was introduced specifically to catch this kind of change.

@quasiben @shwina any ideas if anything may have changed in cudf regarding CUDA context creation? It may be relevant to mention that we have

dask-cuda/dask_cuda/cuda_worker.py

Lines 73 to 75 in 0bed313

    
           # Required by RAPIDS libraries (e.g., cuDF) to ensure no context 
        
           # initialization happens before we can set CUDA_VISIBLE_DEVICES 
        
           os.environ["RAPIDS_NO_INITIALIZE"] = "True"

to tell cuDF not to create a CUDA context at import time, which would be my first suspect of any behavior changes.

quasiben · 2022-03-02T17:02:27Z

There has been some warning cleanup recently by @bdice . But this should be going in the direction of less warning

bdice · 2022-03-02T17:11:48Z

My warning cleanup has been limited to the cuDF API thus far, and I'm particularly focusing on FutureWarnings that indicate API changes. I hope that this helps the theme of improving debugging and the user experience -- complementary work to address things like this issue on the dask side will also be needed. Relevant issues: rapidsai/cudf#10363, rapidsai/cudf#9999

pentschev · 2022-03-02T17:13:28Z

The warning from the description of this issue is raised by Dask-CUDA, so I don't think warnings in cuDF are relevant. The relevant part here is CUDA context management.

github-actions · 2022-04-01T18:03:06Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

charlesbluca · 2022-04-20T18:35:47Z

Did some digging into this - it looks like the issue here is stemming from the fact that as of 22.02, cuDF's validate_setup() call, which is run on import, creates a CUDA context.

In particular, the CUDA context is created on the call to rmm._cuda.gpu.getDeviceCount, which leads me to believe that this might've been caused by the switch to cuda-python in RMM with rapidsai/rmm#930?

EDIT:

Hmm that doesn't explain the warning though - in my reproducers using stable 22.02 packages:

import dask_cudf
from dask_cuda import LocalCUDACluster

if __name__ == "__main__":
    cluster = LocalCUDACluster()

I end up with two CUDA contexts on device 0, but don't get any warnings about a CUDA context already being created on one of the worker processes.

pentschev · 2022-04-21T07:35:53Z

In particular, the CUDA context is created on the call to rmm._cuda.gpu.getDeviceCount, which leads me to believe that this might've been caused by the switch to cuda-python in RMM with rapidsai/rmm#930?

validate_setup() respects RAPIDS_NO_INITIALIZE in https://github.com/rapidsai/cudf/blob/5f6b70a1dfe000c0ac16536c507b0a7bbe6a9efc/python/cudf/cudf/utils/gpu_utils.py#L7-L14, which we do set in https://github.com/rapidsai/dask-cuda/blob/branch-22.06/dask_cuda/local_cuda_cluster.py#L217 .

But in @charlesbluca 's example from #867 (comment) the import dask_cudf at the top has no RAPIDS_NO_INITIALIZE set and does cause the context to be created, which we do avoid in LocalCUDACluster but can't do so if the user imports something before the cluster is created. So in this case the warning is legit and I would argue the behavior here is correct and for us to close this issue.

For now, the user simply must be mindful of importing anything that causes the CUDA context to be created before the cluster is created, and that includes importing dask_cudf (or just cudf). I don't know if @randerzander 's original warning was reported in a case where cuDF was imported at the top because there's no reproducer, but if it did, we don't currently have a better solution for this problem and I would advise to just move such imports somewhere after creating LocalCUDACluster, or setting RAPIDS_NO_INITIALIZE (but then there's no guarantee non-RAPIDS library will not create a CUDA context).

charlesbluca · 2022-04-21T13:42:02Z

validate_setup() respects RAPIDS_NO_INITIALIZE

Right - this only occurs because that variable hasn't been set through Dask-CUDA's cluster initialization or manually - it might be worth adding documentation to Dask-CUDA encouraging use of this environment variable, as it seems like in the past issues with context management have created larger problems (xref rapidsai/cudf#4827)

So in this case the warning is legit and I would argue the behavior here is correct and for us to close this issue.

I would argue the warning here is still partially unexplained - we understand why cuDF is now creating a CUDA context when it once wasn't, but we still don't know why this process ends up being used for the worker assigned to device 0 (unless I'm missing some configuration option where something like this would be trivially possible)

pentschev · 2022-04-21T13:50:27Z

we still don't know why this process ends up being used for the worker assigned to device 0 (unless I'm missing some configuration option where something like this would be trivially possible)

This is the parent Python process, it's not being used by any workers, but rather to only spawn the workers, including another process which will then use device 0 and thus raise the warning.

pentschev · 2022-04-21T13:52:07Z

we still don't know why this process ends up being used for the worker assigned to device 0 (unless I'm missing some configuration option where something like this would be trivially possible)

This is the parent Python process, it's not being used by any workers, but rather to only spawn the workers, including another process which will then use device 0 and thus raise the warning.

Actually, to be more precise, this process is the same that will eventually be used by the client as well (besides spawning LocalCUDACluster workers).

charlesbluca · 2022-04-21T16:55:37Z

Thanks for the clarification @pentschev 🙂 I now understand that the warnings are the result of dask-cuda attempting to initialize a CUDA context on the parent process for UCX purposes - the following reproducer does give the warnings that @randerzander encountered:

import dask_cudf
from dask_cuda import LocalCUDACluster

if __name__ == "__main__":
    cluster = LocalCUDACluster(protocol="ucx")

I think the best solution here is to add documentation for RAPIDS_NO_INITIALIZE, which when used in this case should resolve the warnings, as well as a general warning around using libraries that create a CUDA context prior to starting a UCX-enabled cluster.

github-actions · 2022-05-21T17:05:17Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

charlesbluca · 2022-05-23T14:17:20Z

I've added documentation for RAPIDS_NO_INITIALIZE in #898, which in this case can be used to resolve the warnings popping up.

Is there anything else we wanted to discuss here or can we close this issue?

pentschev · 2022-05-23T14:19:16Z

I think we're good to close, I'll tentatively do that and if there's more left, please feel free to reopen.

github-actions bot added the inactive-30d label Apr 1, 2022

github-actions bot removed the inactive-30d label Apr 20, 2022

github-actions bot added the inactive-30d label May 21, 2022

pentschev closed this as completed May 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UserWarning: A CUDA context for device 0 already exists #867

UserWarning: A CUDA context for device 0 already exists #867

randerzander commented Mar 2, 2022

pentschev commented Mar 2, 2022

quasiben commented Mar 2, 2022

bdice commented Mar 2, 2022 •

edited

Loading

pentschev commented Mar 2, 2022 •

edited

Loading

github-actions bot commented Apr 1, 2022

charlesbluca commented Apr 20, 2022 •

edited

Loading

pentschev commented Apr 21, 2022

charlesbluca commented Apr 21, 2022

pentschev commented Apr 21, 2022

pentschev commented Apr 21, 2022

charlesbluca commented Apr 21, 2022

github-actions bot commented May 21, 2022

charlesbluca commented May 23, 2022 •

edited

Loading

pentschev commented May 23, 2022

UserWarning: A CUDA context for device 0 already exists #867

UserWarning: A CUDA context for device 0 already exists #867

Comments

randerzander commented Mar 2, 2022

pentschev commented Mar 2, 2022

quasiben commented Mar 2, 2022

bdice commented Mar 2, 2022 • edited Loading

pentschev commented Mar 2, 2022 • edited Loading

github-actions bot commented Apr 1, 2022

charlesbluca commented Apr 20, 2022 • edited Loading

pentschev commented Apr 21, 2022

charlesbluca commented Apr 21, 2022

pentschev commented Apr 21, 2022

pentschev commented Apr 21, 2022

charlesbluca commented Apr 21, 2022

github-actions bot commented May 21, 2022

charlesbluca commented May 23, 2022 • edited Loading

pentschev commented May 23, 2022

bdice commented Mar 2, 2022 •

edited

Loading

pentschev commented Mar 2, 2022 •

edited

Loading

charlesbluca commented Apr 20, 2022 •

edited

Loading

charlesbluca commented May 23, 2022 •

edited

Loading