Ray is not finding GPU but TF, PyTorch and nvcc does #5940

sytelus · 2019-10-17T01:27:03Z

I have two NVIDIA TitanX but Ray isn't seeing any:

ray.init(num_gpus=2)
print(ray.get_gpu_ids())
# prints []

Ray also prints below inicating no GPUs:

2019-10-16 18:20:17,954 INFO multi_gpu_optimizer.py:93 -- LocalMultiGPUOptimizer devices ['/cpu:0']

But TensorFlow sees all devices:

import tensorflow
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

That prints:

[name: "/device:CPU:0"
device_type: "CPU"
...
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
...
, name: "/device:GPU:0"
device_type: "GPU"
...
, name: "/device:GPU:1"
device_type: "GPU"
...
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
...
, name: "/device:XLA_GPU:1"
device_type: "XLA_GPU"
...
]

Similarly,

/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

Why Ray doesn't see my GPUs?

The text was updated successfully, but these errors were encountered:

sytelus · 2019-10-17T05:01:14Z

Just wondering what does Ray uses to find GPU? May be I can also look into this...

ericl · 2019-10-17T05:12:51Z

Ray does not allocate GPUs unless you specify them as resources, please see: https://ray.readthedocs.io/en/latest/rllib-training.html#specifying-resources

sytelus · 2019-10-17T05:44:02Z

Thanks, that was helpful although its confusing. This is what happens:

Even if I explicitly init ray with num_gpus=1, ray.get_gpu_ids() is [].

However, if I start PPOTrainer with explicit num_gpus=1 then ray gets GPU. If I don't set this in config then it doesn't.

I believe the confusing part is ray.get_gpu_ids() which I thought is the detected GPUs in the system. Instead, it's actually allocated gpus in the system. I think there should be a method, may be, detected_gpus() so one can test that ray indeed sees GPUs and things are good to go. It would also be great if Ray just allocated GPUs automatically to itself (which should be good perhaps 99% of the times) so we don't have to worry about this additional config.

richardliaw · 2019-10-17T08:26:07Z

Ray does the following to check:

def _autodetect_num_gpus():
    """Attempt to detect the number of GPUs on this machine.

    TODO(rkn): This currently assumes Nvidia GPUs and Linux.

    Returns:
        The number of GPUs if any were detected, otherwise 0.
    """
    proc_gpus_path = "/proc/driver/nvidia/gpus"
    if os.path.isdir(proc_gpus_path):
        return len(os.listdir(proc_gpus_path))
    return 0

che85 · 2019-10-18T00:42:32Z

I am having a similar issue. The number returned by the code in #5940 (comment) is correct, but when starting training it's telling me the following:

Initializing Ray automatically.For cluster usage or custom Ray initialization, call `ray.init(...)` before `tune.run`.
2019-10-17 20:39:19,979 WARNING worker.py:1426 -- WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
2019-10-17 20:39:19,981 WARNING resource_spec.py:163 -- Warning: Capping object memory store to 20.0GB. To increase this further, specify `object_store_memory` when calling ray.init() or ray start.
2019-10-17 20:39:19,981 INFO resource_spec.py:205 -- Starting Ray with 52.49 GiB memory available for workers and up to 18.63 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/20 CPUs, 0/2 GPUs, 0.0/52.49 GiB heap, 0.0/12.84 GiB objects
Memory usage on this node: 12.3/125.4 GiB

2019-10-17 20:39:22,108 WARNING logger.py:325 -- Could not instantiate tf2_compat_logger: cannot import name 'resnet'.
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 1/20 CPUs, 0/2 GPUs, 0.0/52.49 GiB heap, 0.0/12.84 GiB objects
Memory usage on this node: 13.7/125.4 GiB
Result logdir: /home/herzc/ray_results/main
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
 - main_0_lr=0.01:      RUNNING

(pid=27890) Warning: There's no GPU available on this machine, training will be performed on CPU.

At first, it detects GPUs but then somehow the process doesn't.

What could be a quick fix for this?

ericl · 2019-10-18T00:55:23Z

Ray is detecting the GPUs but not allocating them since your code didn't request any for the trials (that's why it says 0/2 GPUs requested). The Tune API provides a way to request GPUs per trial.

…

On Thu, Oct 17, 2019, 5:42 PM Christian Herz ***@***.***> wrote: I am having a similar issue. The number returned by the code in #5940 (comment) <#5940 (comment)> is correct, but when starting training it's telling me the following: Initializing Ray automatically.For cluster usage or custom Ray initialization, call `ray.init(...)` before `tune.run`. 2019-10-17 20:39:19,979 WARNING worker.py:1426 -- WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes. 2019-10-17 20:39:19,981 WARNING resource_spec.py:163 -- Warning: Capping object memory store to 20.0GB. To increase this further, specify `object_store_memory` when calling ray.init() or ray start. 2019-10-17 20:39:19,981 INFO resource_spec.py:205 -- Starting Ray with 52.49 GiB memory available for workers and up to 18.63 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>). == Status == Using FIFO scheduling algorithm. Resources requested: 0/20 CPUs, 0/2 GPUs, 0.0/52.49 GiB heap, 0.0/12.84 GiB objects Memory usage on this node: 12.3/125.4 GiB 2019-10-17 20:39:22,108 WARNING logger.py:325 -- Could not instantiate tf2_compat_logger: cannot import name 'resnet'. == Status == Using FIFO scheduling algorithm. Resources requested: 1/20 CPUs, 0/2 GPUs, 0.0/52.49 GiB heap, 0.0/12.84 GiB objects Memory usage on this node: 13.7/125.4 GiB Result logdir: /home/herzc/ray_results/main Number of trials: 1 ({'RUNNING': 1}) RUNNING trials: - main_0_lr=0.01: RUNNING (pid=27890) Warning: There's no GPU available on this machine, training will be performed on CPU. At first, it detects GPUs but then somehow the process doesn't. What could be a quick fix for this? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5940?email_source=notifications&email_token=AAADUSQI7TSO65E7LDXFQLLQPEBATA5CNFSM4JBS5PL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBSA6ZY#issuecomment-543428455>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSXXIC5G3YHKGHNWJRLQPEBATANCNFSM4JBS5PLQ> .

che85 · 2019-10-18T00:58:02Z

@ericl Why wouldn't it automatically allocate all found GPU unless otherwise defined?

Is this the resource? #5940 (comment)

ericl · 2019-10-18T01:05:41Z

How would it know how many GPUs to give to each trial? Please see https://ray.readthedocs.io/en/latest/tune-usage.html#trial-parallelism

…

On Thu, Oct 17, 2019, 5:58 PM Christian Herz ***@***.***> wrote: @ericl <https://github.com/ericl> Why wouldn't it automatically allocate all found GPU unless otherwise defined? Is this the resource? #5940 (comment) <#5940 (comment)> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5940?email_source=notifications&email_token=AAADUSXSJOOLAF4DQ5RLDK3QPEC2ZA5CNFSM4JBS5PL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBSCD4I#issuecomment-543433201>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSUIPLFDNNQAEPEL6BTQPEC2ZANCNFSM4JBS5PLQ> .

richardliaw · 2019-10-18T20:53:33Z

Closing this now - thanks!

Wormh0-le · 2021-06-29T09:31:19Z

Thanks, that was helpful although its confusing. This is what happens:

Even if I explicitly init ray with num_gpus=1, ray.get_gpu_ids() is [].

However, if I start PPOTrainer with explicit num_gpus=1 then ray gets GPU. If I don't set this in config then it doesn't.

I believe the confusing part is ray.get_gpu_ids() which I thought is the detected GPUs in the system. Instead, it's actually allocated gpus in the system. I think there should be a method, may be, detected_gpus() so one can test that ray indeed sees GPUs and things are good to go. It would also be great if Ray just allocated GPUs automatically to itself (which should be good perhaps 99% of the times) so we don't have to worry about this additional config.

and I explicit num_gpus=1，but ray still can't get GPU, and torch.cuda.is_available() is True. why?

SamShowalter · 2021-07-24T19:58:28Z

I am having the same issue as @Wormh0-le. This is preventing me from training a torch policy without ray.tune which I do not which to use. I just want to call .train() on my agent.

oroojlooy · 2021-08-04T17:21:12Z

I do have the same issue. I want to directly call A2CTrainer(config) where config["num_gpus"] = 2. But it returns:

self.device = self.devices[0]
IndexError: list index out of range

I checked the part of the ray code that results in this error, and there gpu_ids = [] that is generated by calling gpu_ids = ray.get_gpu_ids().
Not sure how to fix that.

danielclymer · 2021-10-11T15:33:21Z

I have the same problem as @SamShowalter and @oroojlooy. I want to train by calling agent.train() instead of using ray.tune to train. However on line 152 of policies.torch_policy.py, the line

gpu_ids = ray.get_gpu_ids()

returns an empty list no matter what I set in the config.

dejangrubisic · 2022-10-17T23:40:15Z

I have the same problem. Is there any updates on this topic?
torch.cuda.device_count() == 2 while ray.get_gpu_ids() == []

richardliaw · 2023-01-11T04:57:22Z

Hi folks, thanks for commenting - please open a new issue if you are continuing to see this issue.

And please also provide your CUDA driver version and a reproducible script if possible (also using 'better-exceptions would be great!)

rani-copilot · 2023-12-22T19:03:22Z

Hi All,

What is the solution to the problem, within Ray i too cannot see GPU utilization. Says RuntimeError : No GPU found.
Request if anyone knows the solution this problem

-Rani

sytelus changed the title ~~Ray is not finding GPU but TF, PyTorch and nvcc do~~ Ray is not finding GPU but TF, PyTorch and nvcc does Oct 17, 2019

richardliaw added the bug Something that is supposed to be working; but isn't label Oct 17, 2019

richardliaw closed this as completed Oct 18, 2019

lorrp1 mentioned this issue Jul 3, 2020

using a gpu with ray locally. EricSteinberger/PokerRL#14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray is not finding GPU but TF, PyTorch and nvcc does #5940

Ray is not finding GPU but TF, PyTorch and nvcc does #5940

sytelus commented Oct 17, 2019 •

edited

sytelus commented Oct 17, 2019

ericl commented Oct 17, 2019

sytelus commented Oct 17, 2019

richardliaw commented Oct 17, 2019

che85 commented Oct 18, 2019

ericl commented Oct 18, 2019 via email

che85 commented Oct 18, 2019

ericl commented Oct 18, 2019 via email

richardliaw commented Oct 18, 2019

Wormh0-le commented Jun 29, 2021

SamShowalter commented Jul 24, 2021

oroojlooy commented Aug 4, 2021

danielclymer commented Oct 11, 2021

dejangrubisic commented Oct 17, 2022

richardliaw commented Jan 11, 2023

rani-copilot commented Dec 22, 2023

Ray is not finding GPU but TF, PyTorch and nvcc does #5940

Ray is not finding GPU but TF, PyTorch and nvcc does #5940

Comments

sytelus commented Oct 17, 2019 • edited

sytelus commented Oct 17, 2019

ericl commented Oct 17, 2019

sytelus commented Oct 17, 2019

richardliaw commented Oct 17, 2019

che85 commented Oct 18, 2019

ericl commented Oct 18, 2019 via email

che85 commented Oct 18, 2019

ericl commented Oct 18, 2019 via email

richardliaw commented Oct 18, 2019

Wormh0-le commented Jun 29, 2021

SamShowalter commented Jul 24, 2021

oroojlooy commented Aug 4, 2021

danielclymer commented Oct 11, 2021

dejangrubisic commented Oct 17, 2022

richardliaw commented Jan 11, 2023

rani-copilot commented Dec 22, 2023

sytelus commented Oct 17, 2019 •

edited