-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] [Bug] Failed to register worker to Raylet for single node, multi-GPU #21226
Comments
I get the same message at one point, and 0% GPU utilization when using APEX_DDPG-torch on a 4 GPU/128 CPU node. After a few sampling iterations showing 'RUNNING' (and seen on the CPUs via htop), the run crashes with a Dequeue timeout. |
Btw the default timeout is 30, so you should experiment with values like 60. @iycheng maybe you can take a look at this? I think it could be related to our recent gcs changes, or there’s a failure from worker initialization (which could also be related to recent changes) |
Any progress on this? I'm having the same issue. |
I'm experiencing a similar error in issue #25834 |
Hi, I'm a bot from the Ray team :) To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months. If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel. |
I also am having this issue. Ray.init hangs forever. Happens 4/5 times, sometimes it works. |
Looks like a P1. I'm putting this into Core team backlog and let's discuss how to fix. |
Search before asking
Ray Component
Ray Tune
What happened + What you expected to happen
I am trying to run the official tutorial for PyTorch Lightning. It works fine one a single GPU, but fails when the requested resources per trial are more than one GPU
This is on a single node/machine that has 4 GPUs attached. Based on PyTorch Lightning's trainer, I would expect Ray to be able to distribute trials across all the available GPUs when they are requested as resources
Versions / Dependencies
System
requirements.txt
Reproduction script
tutorial.py
Anything else
Based on this discussion post, I tried setting the environment variable for
RAY_worker_register_timeout_seconds
but it does not fix the issuecc @ericl @rkooo567 @iycheng (from the request on #8890
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: