Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray Cluster] Cannot connect to head after submitting jobs #29696

Closed
zzb3886 opened this issue Oct 26, 2022 · 9 comments
Closed

[Ray Cluster] Cannot connect to head after submitting jobs #29696

zzb3886 opened this issue Oct 26, 2022 · 9 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core stale The issue is stale. It will be closed within 7 days unless there are further conversation

Comments

@zzb3886
Copy link

zzb3886 commented Oct 26, 2022

What happened + What you expected to happen

I set up a few computers that are connected to a local network and started the head node followed by the worker nodes. After all of them were started, I verified the cluster state by ray status and visiting the dashboard. All nodes were working fine until I submitted jobs in the ipython console interactively. Within seconds, errors were shown:

2022-10-25 10:37:55,912 INFO worker.py:1333 -- Connecting to existing Ray cluster at address: 192.168.1.9:6379...
2022-10-25 10:37:55,922 INFO worker.py:1509 -- Connected to Ray cluster. View the dashboard at 192.168.1.9:8265 
(pid=gcs_server) E1025 10:38:12.624350177    1647 tcp_server_posix.cc:213]    Failed accept4: Too many open files
(raylet, ip=192.168.1.14) [2022-10-25 10:38:17,636 E 2165 2165] gcs_rpc_client.h:202: Failed to connect to GCS at address 192.168.1.9:6379 within 5 seconds.
(raylet, ip=192.168.1.14) [2022-10-25 10:38:17,639 E 2266 2266] gcs_rpc_client.h:202: Failed to connect to GCS at address 192.168.1.9:6379 within 5 seconds.
(raylet, ip=192.168.1.14) [2022-10-25 10:38:17,643 E 2021 2021] gcs_rpc_client.h:202: Failed to connect to GCS at address 192.168.1.9:6379 within 5 seconds.
(raylet, ip=192.168.1.14) [2022-10-25 10:38:17,686 E 2178 2178] gcs_rpc_client.h:202: Failed to connect to GCS at address 192.168.1.9:6379 within 5 seconds.

ray status would fail: Ray cluster is not found at 192.168.1.9:6379. However the dashboard still worked. If I kill the jobs and submit new jobs, the new jobs would hang and not be executed. This issue doesn't happen every time when the cluster is started but it happens over 50% of the time.

Expected behavior: No errors after jobs are submitted. ray status would continue to work. Existing jobs and new jobs should be run.

Versions / Dependencies

Ray 2.0.0, Python 3.10.6, Ubuntu 22.04

Reproduction script

This is the script I use to test the cluster and it often triggers the issue.

import ray

@ray.remote
def square(x):
  for i in range(10000000):
    y = x * i
  return x * x

futures = [square.remote(i) for i in range(1000)]
print(ray.get(futures))

Issue Severity

High: It blocks me from completing my task.

@zzb3886 zzb3886 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 26, 2022
@hora-anyscale hora-anyscale added core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 28, 2022
@cadedaniel
Copy link
Member

Can you try increasing the ulimit on open files?

Example: https://discuss.ray.io/t/setting-ulimits-on-ec2-instances/590

@zzb3886
Copy link
Author

zzb3886 commented Oct 30, 2022

Thank you. That seems to have fixed the issue, and I haven’t seen the failure again so far. It’s unclear to me the number of open files is related to which parameter. The number of tasks? Or the number of workers? I currently set it to 65535 but do I need to increase it in the future as I scale up? Which directory contains these files where I can inspect? It will be great if you could add these information to the documentation or point me to an existing one.

@cadedaniel
Copy link
Member

It’s unclear to me the number of open files is related to which parameter. The number of tasks? Or the number of workers?

AFAIK there's no easy answer here because Linux uses file descriptors (FD's) for everything. So could be TCP connections, unix domain sockets, actual files on disk, shared memory queues, etc...

If you want to see the list of open files, you can use lsof, which should include all open file descriptors and their opening process.

cc @jjyao do we have docs that guide users on this? I'm happy to create a PR, where should it go?

@jjyao
Copy link
Contributor

jjyao commented Oct 31, 2022

Yea, do you mind running lsof and see how those fds are used? Also do you know the limit before your change?

Currently we only have a guide on how to increase the limit: https://docs.ray.io/en/releases-2.0.1/ray-core/troubleshooting.html#crashes

@cadedaniel cadedaniel assigned clarng and unassigned cadedaniel Nov 1, 2022
@zzb3886
Copy link
Author

zzb3886 commented Nov 4, 2022

Previously the soft limit was 1024. The top counts of the lines in the lsof output are

   5568 python3
  12814 raylet
  14220 gcs_serve
  17220 ipython
 141686 ray::squa

I grouped the lsof output by the name of the process, and the first column is the count.

@jjyao
Copy link
Contributor

jjyao commented Nov 8, 2022

Yea 1024 is probably too low.

So from the log, gcs_server opens 14220 fds. Could you paste what those 14220 fds are? Are they sockets, files, etc?

@zzb3886
Copy link
Author

zzb3886 commented Nov 21, 2022

The vast majority of the fds are of type IPv6. As an example:

COMMAND      PID    TID TASKCMD               USER   FD      TYPE             DEVICE    SIZE/OFF       NODE NAME
gcs_serve   1720                                test   20u     IPv6              66756         0t0        TCP machine4:redis->machine3:47824 (ESTABLISHED)

@stale
Copy link

stale bot commented Mar 23, 2023

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

  • If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
  • If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 23, 2023
@stale
Copy link

stale bot commented Apr 6, 2023

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

@stale stale bot closed this as completed Apr 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core stale The issue is stale. It will be closed within 7 days unless there are further conversation
Projects
None yet
Development

No branches or pull requests

5 participants