-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ray is not finding GPU but TF, PyTorch and nvcc does #5940
Comments
Just wondering what does Ray uses to find GPU? May be I can also look into this... |
Ray does not allocate GPUs unless you specify them as resources, please see: https://ray.readthedocs.io/en/latest/rllib-training.html#specifying-resources |
Thanks, that was helpful although its confusing. This is what happens: Even if I explicitly init ray with num_gpus=1, However, if I start PPOTrainer with explicit I believe the confusing part is |
Ray does the following to check:
|
I am having a similar issue. The number returned by the code in #5940 (comment) is correct, but when starting training it's telling me the following:
At first, it detects GPUs but then somehow the process doesn't. What could be a quick fix for this? |
Ray is detecting the GPUs but not allocating them since your code didn't
request any for the trials (that's why it says 0/2 GPUs requested).
The Tune API provides a way to request GPUs per trial.
…On Thu, Oct 17, 2019, 5:42 PM Christian Herz ***@***.***> wrote:
I am having a similar issue. The number returned by the code in #5940
(comment)
<#5940 (comment)>
is correct, but when starting training it's telling me the following:
Initializing Ray automatically.For cluster usage or custom Ray initialization, call `ray.init(...)` before `tune.run`.
2019-10-17 20:39:19,979 WARNING worker.py:1426 -- WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
2019-10-17 20:39:19,981 WARNING resource_spec.py:163 -- Warning: Capping object memory store to 20.0GB. To increase this further, specify `object_store_memory` when calling ray.init() or ray start.
2019-10-17 20:39:19,981 INFO resource_spec.py:205 -- Starting Ray with 52.49 GiB memory available for workers and up to 18.63 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/20 CPUs, 0/2 GPUs, 0.0/52.49 GiB heap, 0.0/12.84 GiB objects
Memory usage on this node: 12.3/125.4 GiB
2019-10-17 20:39:22,108 WARNING logger.py:325 -- Could not instantiate tf2_compat_logger: cannot import name 'resnet'.
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 1/20 CPUs, 0/2 GPUs, 0.0/52.49 GiB heap, 0.0/12.84 GiB objects
Memory usage on this node: 13.7/125.4 GiB
Result logdir: /home/herzc/ray_results/main
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
- main_0_lr=0.01: RUNNING
(pid=27890) Warning: There's no GPU available on this machine, training will be performed on CPU.
At first, it detects GPUs but then somehow the process doesn't.
What could be a quick fix for this?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5940?email_source=notifications&email_token=AAADUSQI7TSO65E7LDXFQLLQPEBATA5CNFSM4JBS5PL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBSA6ZY#issuecomment-543428455>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSXXIC5G3YHKGHNWJRLQPEBATANCNFSM4JBS5PLQ>
.
|
@ericl Why wouldn't it automatically allocate all found GPU unless otherwise defined? Is this the resource? #5940 (comment) |
How would it know how many GPUs to give to each trial?
Please see
https://ray.readthedocs.io/en/latest/tune-usage.html#trial-parallelism
…On Thu, Oct 17, 2019, 5:58 PM Christian Herz ***@***.***> wrote:
@ericl <https://github.com/ericl> Why wouldn't it automatically allocate
all found GPU unless otherwise defined?
Is this the resource? #5940 (comment)
<#5940 (comment)>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5940?email_source=notifications&email_token=AAADUSXSJOOLAF4DQ5RLDK3QPEC2ZA5CNFSM4JBS5PL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBSCD4I#issuecomment-543433201>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSUIPLFDNNQAEPEL6BTQPEC2ZANCNFSM4JBS5PLQ>
.
|
Closing this now - thanks! |
and I explicit |
I am having the same issue as @Wormh0-le. This is preventing me from training a torch policy without ray.tune which I do not which to use. I just want to call .train() on my agent. |
I do have the same issue. I want to directly call
I checked the part of the ray code that results in this error, and there |
I have the same problem as @SamShowalter and @oroojlooy. I want to train by calling agent.train() instead of using ray.tune to train. However on line 152 of policies.torch_policy.py, the line gpu_ids = ray.get_gpu_ids() returns an empty list no matter what I set in the config. |
I have the same problem. Is there any updates on this topic? |
Hi folks, thanks for commenting - please open a new issue if you are continuing to see this issue. And please also provide your CUDA driver version and a reproducible script if possible (also using 'better-exceptions would be great!) |
Hi All, What is the solution to the problem, within Ray i too cannot see GPU utilization. Says RuntimeError : No GPU found. -Rani |
I have two NVIDIA TitanX but Ray isn't seeing any:
Ray also prints below inicating no GPUs:
2019-10-16 18:20:17,954 INFO multi_gpu_optimizer.py:93 -- LocalMultiGPUOptimizer devices ['/cpu:0']
But TensorFlow sees all devices:
That prints:
Similarly,
Why Ray doesn't see my GPUs?
The text was updated successfully, but these errors were encountered: