Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UserWarning: A CUDA context for device 0 already exists #867

Closed
randerzander opened this issue Mar 2, 2022 · 14 comments
Closed

UserWarning: A CUDA context for device 0 already exists #867

randerzander opened this issue Mar 2, 2022 · 14 comments

Comments

@randerzander
Copy link
Contributor

With the latest dask_cudf and dask-cuda nightlies, if I've already imported dask_cudf by the time I start my LocalCUDACluster, I get a warning:

/home/rgelhausen/conda/envs/dsql-3-02/lib/python3.9/site-packages/distributed-2022.2.1+5.gbe4fc7f7-py3.9.egg/distributed/comm/ucx.py:83: UserWarning: A CUDA context for device 0 already exists on process ID 1180570. This is often the result of a CUDA-enabled library calling a CUDA runtime function before Dask-CUDA can spawn worker processes. Please make sure any such function calls don't happen at import time or in the global scope of a program.
  warnings.warn(

nvidia-smi shows I do indeed have two client processes running using GPU 0:

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1180570      C   ...envs/dsql-3-02/bin/python      307MiB |
|    0   N/A  N/A   1181113      C   ...envs/dsql-3-02/bin/python    28303MiB |
|    1   N/A  N/A   1181221      C   ...envs/dsql-3-02/bin/python    27995MiB |
|    2   N/A  N/A   1181213      C   ...envs/dsql-3-02/bin/python    27995MiB |
|    3   N/A  N/A   1181210      C   ...envs/dsql-3-02/bin/python    27995MiB |
|    4   N/A  N/A   1181501      C   ...envs/dsql-3-02/bin/python    27995MiB |
|    5   N/A  N/A   1181127      C   ...envs/dsql-3-02/bin/python    27995MiB |
|    6   N/A  N/A   1181276      C   ...envs/dsql-3-02/bin/python    27995MiB |
|    7   N/A  N/A   1181180      C   ...envs/dsql-3-02/bin/python    27995MiB |
|    8   N/A  N/A   1181150      C   ...envs/dsql-3-02/bin/python    27995MiB |
|    9   N/A  N/A   1181311      C   ...envs/dsql-3-02/bin/python    27995MiB |
|   10   N/A  N/A   1181246      C   ...envs/dsql-3-02/bin/python    27995MiB |
|   11   N/A  N/A   1181168      C   ...envs/dsql-3-02/bin/python    28303MiB |
|   12   N/A  N/A   1181270      C   ...envs/dsql-3-02/bin/python    27995MiB |
|   13   N/A  N/A   1181494      C   ...envs/dsql-3-02/bin/python    27995MiB |
|   14   N/A  N/A   1181184      C   ...envs/dsql-3-02/bin/python    27995MiB |
|   15   N/A  N/A   1181177      C   ...envs/dsql-3-02/bin/python    28303MiB |
+-----------------------------------------------------------------------------+

I'm not sure if this is new dask_cudf behavior, or if dask-cuda just now detects and warns about this?

@pentschev
Copy link
Member

This must be new dask_cudf behavior, this warning was introduced specifically to catch this kind of change.

@quasiben @shwina any ideas if anything may have changed in cudf regarding CUDA context creation? It may be relevant to mention that we have

# Required by RAPIDS libraries (e.g., cuDF) to ensure no context
# initialization happens before we can set CUDA_VISIBLE_DEVICES
os.environ["RAPIDS_NO_INITIALIZE"] = "True"
to tell cuDF not to create a CUDA context at import time, which would be my first suspect of any behavior changes.

@quasiben
Copy link
Member

quasiben commented Mar 2, 2022

There has been some warning cleanup recently by @bdice . But this should be going in the direction of less warning

@bdice
Copy link
Contributor

bdice commented Mar 2, 2022

My warning cleanup has been limited to the cuDF API thus far, and I'm particularly focusing on FutureWarnings that indicate API changes. I hope that this helps the theme of improving debugging and the user experience -- complementary work to address things like this issue on the dask side will also be needed. Relevant issues: rapidsai/cudf#10363, rapidsai/cudf#9999

@pentschev
Copy link
Member

pentschev commented Mar 2, 2022

The warning from the description of this issue is raised by Dask-CUDA, so I don't think warnings in cuDF are relevant. The relevant part here is CUDA context management.

@github-actions
Copy link

github-actions bot commented Apr 1, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@charlesbluca
Copy link
Member

charlesbluca commented Apr 20, 2022

Did some digging into this - it looks like the issue here is stemming from the fact that as of 22.02, cuDF's validate_setup() call, which is run on import, creates a CUDA context.

In particular, the CUDA context is created on the call to rmm._cuda.gpu.getDeviceCount, which leads me to believe that this might've been caused by the switch to cuda-python in RMM with rapidsai/rmm#930?

EDIT:

Hmm that doesn't explain the warning though - in my reproducers using stable 22.02 packages:

import dask_cudf
from dask_cuda import LocalCUDACluster

if __name__ == "__main__":
    cluster = LocalCUDACluster()

I end up with two CUDA contexts on device 0, but don't get any warnings about a CUDA context already being created on one of the worker processes.

@pentschev
Copy link
Member

In particular, the CUDA context is created on the call to rmm._cuda.gpu.getDeviceCount, which leads me to believe that this might've been caused by the switch to cuda-python in RMM with rapidsai/rmm#930?

validate_setup() respects RAPIDS_NO_INITIALIZE in https://github.com/rapidsai/cudf/blob/5f6b70a1dfe000c0ac16536c507b0a7bbe6a9efc/python/cudf/cudf/utils/gpu_utils.py#L7-L14, which we do set in https://github.com/rapidsai/dask-cuda/blob/branch-22.06/dask_cuda/local_cuda_cluster.py#L217 .

But in @charlesbluca 's example from #867 (comment) the import dask_cudf at the top has no RAPIDS_NO_INITIALIZE set and does cause the context to be created, which we do avoid in LocalCUDACluster but can't do so if the user imports something before the cluster is created. So in this case the warning is legit and I would argue the behavior here is correct and for us to close this issue.

For now, the user simply must be mindful of importing anything that causes the CUDA context to be created before the cluster is created, and that includes importing dask_cudf (or just cudf). I don't know if @randerzander 's original warning was reported in a case where cuDF was imported at the top because there's no reproducer, but if it did, we don't currently have a better solution for this problem and I would advise to just move such imports somewhere after creating LocalCUDACluster, or setting RAPIDS_NO_INITIALIZE (but then there's no guarantee non-RAPIDS library will not create a CUDA context).

@charlesbluca
Copy link
Member

validate_setup() respects RAPIDS_NO_INITIALIZE

Right - this only occurs because that variable hasn't been set through Dask-CUDA's cluster initialization or manually - it might be worth adding documentation to Dask-CUDA encouraging use of this environment variable, as it seems like in the past issues with context management have created larger problems (xref rapidsai/cudf#4827)

So in this case the warning is legit and I would argue the behavior here is correct and for us to close this issue.

I would argue the warning here is still partially unexplained - we understand why cuDF is now creating a CUDA context when it once wasn't, but we still don't know why this process ends up being used for the worker assigned to device 0 (unless I'm missing some configuration option where something like this would be trivially possible)

@pentschev
Copy link
Member

we still don't know why this process ends up being used for the worker assigned to device 0 (unless I'm missing some configuration option where something like this would be trivially possible)

This is the parent Python process, it's not being used by any workers, but rather to only spawn the workers, including another process which will then use device 0 and thus raise the warning.

@pentschev
Copy link
Member

we still don't know why this process ends up being used for the worker assigned to device 0 (unless I'm missing some configuration option where something like this would be trivially possible)

This is the parent Python process, it's not being used by any workers, but rather to only spawn the workers, including another process which will then use device 0 and thus raise the warning.

Actually, to be more precise, this process is the same that will eventually be used by the client as well (besides spawning LocalCUDACluster workers).

@charlesbluca
Copy link
Member

Thanks for the clarification @pentschev 🙂 I now understand that the warnings are the result of dask-cuda attempting to initialize a CUDA context on the parent process for UCX purposes - the following reproducer does give the warnings that @randerzander encountered:

import dask_cudf
from dask_cuda import LocalCUDACluster

if __name__ == "__main__":
    cluster = LocalCUDACluster(protocol="ucx")

I think the best solution here is to add documentation for RAPIDS_NO_INITIALIZE, which when used in this case should resolve the warnings, as well as a general warning around using libraries that create a CUDA context prior to starting a UCX-enabled cluster.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@charlesbluca
Copy link
Member

charlesbluca commented May 23, 2022

I've added documentation for RAPIDS_NO_INITIALIZE in #898, which in this case can be used to resolve the warnings popping up.

Is there anything else we wanted to discuss here or can we close this issue?

@pentschev
Copy link
Member

I think we're good to close, I'll tentatively do that and if there's more left, please feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants