Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray is not finding GPU but TF, PyTorch and nvcc does #5940

Closed
sytelus opened this issue Oct 17, 2019 · 16 comments
Closed

Ray is not finding GPU but TF, PyTorch and nvcc does #5940

sytelus opened this issue Oct 17, 2019 · 16 comments
Labels
bug Something that is supposed to be working; but isn't

Comments

@sytelus
Copy link
Contributor

sytelus commented Oct 17, 2019

I have two NVIDIA TitanX but Ray isn't seeing any:

ray.init(num_gpus=2)
print(ray.get_gpu_ids())
# prints []

Ray also prints below inicating no GPUs:

2019-10-16 18:20:17,954 INFO multi_gpu_optimizer.py:93 -- LocalMultiGPUOptimizer devices ['/cpu:0']

But TensorFlow sees all devices:

import tensorflow
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

That prints:

[name: "/device:CPU:0"
device_type: "CPU"
...
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
...
, name: "/device:GPU:0"
device_type: "GPU"
...
, name: "/device:GPU:1"
device_type: "GPU"
...
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
...
, name: "/device:XLA_GPU:1"
device_type: "XLA_GPU"
...
]

Similarly,

/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

Why Ray doesn't see my GPUs?

@sytelus sytelus changed the title Ray is not finding GPU but TF, PyTorch and nvcc do Ray is not finding GPU but TF, PyTorch and nvcc does Oct 17, 2019
@richardliaw richardliaw added the bug Something that is supposed to be working; but isn't label Oct 17, 2019
@sytelus
Copy link
Contributor Author

sytelus commented Oct 17, 2019

Just wondering what does Ray uses to find GPU? May be I can also look into this...

@ericl
Copy link
Contributor

ericl commented Oct 17, 2019

Ray does not allocate GPUs unless you specify them as resources, please see: https://ray.readthedocs.io/en/latest/rllib-training.html#specifying-resources

@sytelus
Copy link
Contributor Author

sytelus commented Oct 17, 2019

Thanks, that was helpful although its confusing. This is what happens:

Even if I explicitly init ray with num_gpus=1, ray.get_gpu_ids() is [].

However, if I start PPOTrainer with explicit num_gpus=1 then ray gets GPU. If I don't set this in config then it doesn't.

I believe the confusing part is ray.get_gpu_ids() which I thought is the detected GPUs in the system. Instead, it's actually allocated gpus in the system. I think there should be a method, may be, detected_gpus() so one can test that ray indeed sees GPUs and things are good to go. It would also be great if Ray just allocated GPUs automatically to itself (which should be good perhaps 99% of the times) so we don't have to worry about this additional config.

@richardliaw
Copy link
Contributor

Ray does the following to check:

def _autodetect_num_gpus():
    """Attempt to detect the number of GPUs on this machine.

    TODO(rkn): This currently assumes Nvidia GPUs and Linux.

    Returns:
        The number of GPUs if any were detected, otherwise 0.
    """
    proc_gpus_path = "/proc/driver/nvidia/gpus"
    if os.path.isdir(proc_gpus_path):
        return len(os.listdir(proc_gpus_path))
    return 0

@che85
Copy link

che85 commented Oct 18, 2019

I am having a similar issue. The number returned by the code in #5940 (comment) is correct, but when starting training it's telling me the following:

Initializing Ray automatically.For cluster usage or custom Ray initialization, call `ray.init(...)` before `tune.run`.
2019-10-17 20:39:19,979 WARNING worker.py:1426 -- WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
2019-10-17 20:39:19,981 WARNING resource_spec.py:163 -- Warning: Capping object memory store to 20.0GB. To increase this further, specify `object_store_memory` when calling ray.init() or ray start.
2019-10-17 20:39:19,981 INFO resource_spec.py:205 -- Starting Ray with 52.49 GiB memory available for workers and up to 18.63 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/20 CPUs, 0/2 GPUs, 0.0/52.49 GiB heap, 0.0/12.84 GiB objects
Memory usage on this node: 12.3/125.4 GiB

2019-10-17 20:39:22,108 WARNING logger.py:325 -- Could not instantiate tf2_compat_logger: cannot import name 'resnet'.
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 1/20 CPUs, 0/2 GPUs, 0.0/52.49 GiB heap, 0.0/12.84 GiB objects
Memory usage on this node: 13.7/125.4 GiB
Result logdir: /home/herzc/ray_results/main
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
 - main_0_lr=0.01:      RUNNING

(pid=27890) Warning: There's no GPU available on this machine, training will be performed on CPU.

At first, it detects GPUs but then somehow the process doesn't.

What could be a quick fix for this?

@ericl
Copy link
Contributor

ericl commented Oct 18, 2019 via email

@che85
Copy link

che85 commented Oct 18, 2019

@ericl Why wouldn't it automatically allocate all found GPU unless otherwise defined?

Is this the resource? #5940 (comment)

@ericl
Copy link
Contributor

ericl commented Oct 18, 2019 via email

@richardliaw
Copy link
Contributor

Closing this now - thanks!

@Wormh0-le
Copy link

Thanks, that was helpful although its confusing. This is what happens:

Even if I explicitly init ray with num_gpus=1, ray.get_gpu_ids() is [].

However, if I start PPOTrainer with explicit num_gpus=1 then ray gets GPU. If I don't set this in config then it doesn't.

I believe the confusing part is ray.get_gpu_ids() which I thought is the detected GPUs in the system. Instead, it's actually allocated gpus in the system. I think there should be a method, may be, detected_gpus() so one can test that ray indeed sees GPUs and things are good to go. It would also be great if Ray just allocated GPUs automatically to itself (which should be good perhaps 99% of the times) so we don't have to worry about this additional config.

and I explicit num_gpus=1,but ray still can't get GPU, and torch.cuda.is_available() is True. why?

@SamShowalter
Copy link

I am having the same issue as @Wormh0-le. This is preventing me from training a torch policy without ray.tune which I do not which to use. I just want to call .train() on my agent.

@oroojlooy
Copy link

I do have the same issue. I want to directly call A2CTrainer(config) where config["num_gpus"] = 2. But it returns:

self.device = self.devices[0]
IndexError: list index out of range

I checked the part of the ray code that results in this error, and there gpu_ids = [] that is generated by calling gpu_ids = ray.get_gpu_ids().
Not sure how to fix that.

@danielclymer
Copy link

I have the same problem as @SamShowalter and @oroojlooy. I want to train by calling agent.train() instead of using ray.tune to train. However on line 152 of policies.torch_policy.py, the line

gpu_ids = ray.get_gpu_ids()

returns an empty list no matter what I set in the config.

@dejangrubisic
Copy link

I have the same problem. Is there any updates on this topic?
torch.cuda.device_count() == 2 while ray.get_gpu_ids() == []

@richardliaw
Copy link
Contributor

Hi folks, thanks for commenting - please open a new issue if you are continuing to see this issue.

And please also provide your CUDA driver version and a reproducible script if possible (also using 'better-exceptions would be great!)

@rani-copilot
Copy link

Hi All,

What is the solution to the problem, within Ray i too cannot see GPU utilization. Says RuntimeError : No GPU found.
Request if anyone knows the solution this problem

-Rani

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't
Projects
None yet
Development

No branches or pull requests

10 participants