New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ray Cluster] Cannot connect to head after submitting jobs #29696
Comments
Can you try increasing the ulimit on open files? Example: https://discuss.ray.io/t/setting-ulimits-on-ec2-instances/590 |
Thank you. That seems to have fixed the issue, and I haven’t seen the failure again so far. It’s unclear to me the number of open files is related to which parameter. The number of tasks? Or the number of workers? I currently set it to 65535 but do I need to increase it in the future as I scale up? Which directory contains these files where I can inspect? It will be great if you could add these information to the documentation or point me to an existing one. |
AFAIK there's no easy answer here because Linux uses file descriptors (FD's) for everything. So could be TCP connections, unix domain sockets, actual files on disk, shared memory queues, etc... If you want to see the list of open files, you can use cc @jjyao do we have docs that guide users on this? I'm happy to create a PR, where should it go? |
Yea, do you mind running Currently we only have a guide on how to increase the limit: https://docs.ray.io/en/releases-2.0.1/ray-core/troubleshooting.html#crashes |
Previously the soft limit was 1024. The top counts of the lines in the
I grouped the |
Yea 1024 is probably too low. So from the log, gcs_server opens 14220 fds. Could you paste what those 14220 fds are? Are they sockets, files, etc? |
The vast majority of the fds are of type IPv6. As an example:
|
Hi, I'm a bot from the Ray team :) To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months. If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel. |
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message. Please feel free to reopen or open a new issue if you'd still like it to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for opening the issue! |
What happened + What you expected to happen
I set up a few computers that are connected to a local network and started the head node followed by the worker nodes. After all of them were started, I verified the cluster state by
ray status
and visiting the dashboard. All nodes were working fine until I submitted jobs in the ipython console interactively. Within seconds, errors were shown:ray status
would fail:Ray cluster is not found at 192.168.1.9:6379
. However the dashboard still worked. If I kill the jobs and submit new jobs, the new jobs would hang and not be executed. This issue doesn't happen every time when the cluster is started but it happens over 50% of the time.Expected behavior: No errors after jobs are submitted.
ray status
would continue to work. Existing jobs and new jobs should be run.Versions / Dependencies
Ray 2.0.0, Python 3.10.6, Ubuntu 22.04
Reproduction script
This is the script I use to test the cluster and it often triggers the issue.
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: