[Ray Cluster] Cannot connect to head after submitting jobs #29696

zzb3886 · 2022-10-26T05:54:15Z

What happened + What you expected to happen

I set up a few computers that are connected to a local network and started the head node followed by the worker nodes. After all of them were started, I verified the cluster state by ray status and visiting the dashboard. All nodes were working fine until I submitted jobs in the ipython console interactively. Within seconds, errors were shown:

2022-10-25 10:37:55,912 INFO worker.py:1333 -- Connecting to existing Ray cluster at address: 192.168.1.9:6379...
2022-10-25 10:37:55,922 INFO worker.py:1509 -- Connected to Ray cluster. View the dashboard at 192.168.1.9:8265 
(pid=gcs_server) E1025 10:38:12.624350177    1647 tcp_server_posix.cc:213]    Failed accept4: Too many open files
(raylet, ip=192.168.1.14) [2022-10-25 10:38:17,636 E 2165 2165] gcs_rpc_client.h:202: Failed to connect to GCS at address 192.168.1.9:6379 within 5 seconds.
(raylet, ip=192.168.1.14) [2022-10-25 10:38:17,639 E 2266 2266] gcs_rpc_client.h:202: Failed to connect to GCS at address 192.168.1.9:6379 within 5 seconds.
(raylet, ip=192.168.1.14) [2022-10-25 10:38:17,643 E 2021 2021] gcs_rpc_client.h:202: Failed to connect to GCS at address 192.168.1.9:6379 within 5 seconds.
(raylet, ip=192.168.1.14) [2022-10-25 10:38:17,686 E 2178 2178] gcs_rpc_client.h:202: Failed to connect to GCS at address 192.168.1.9:6379 within 5 seconds.

ray status would fail: Ray cluster is not found at 192.168.1.9:6379. However the dashboard still worked. If I kill the jobs and submit new jobs, the new jobs would hang and not be executed. This issue doesn't happen every time when the cluster is started but it happens over 50% of the time.

Expected behavior: No errors after jobs are submitted. ray status would continue to work. Existing jobs and new jobs should be run.

Versions / Dependencies

Ray 2.0.0, Python 3.10.6, Ubuntu 22.04

Reproduction script

This is the script I use to test the cluster and it often triggers the issue.

import ray

@ray.remote
def square(x):
  for i in range(10000000):
    y = x * i
  return x * x

futures = [square.remote(i) for i in range(1000)]
print(ray.get(futures))

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

cadedaniel · 2022-10-28T22:38:39Z

Can you try increasing the ulimit on open files?

Example: https://discuss.ray.io/t/setting-ulimits-on-ec2-instances/590

zzb3886 · 2022-10-30T22:01:01Z

Thank you. That seems to have fixed the issue, and I haven’t seen the failure again so far. It’s unclear to me the number of open files is related to which parameter. The number of tasks? Or the number of workers? I currently set it to 65535 but do I need to increase it in the future as I scale up? Which directory contains these files where I can inspect? It will be great if you could add these information to the documentation or point me to an existing one.

cadedaniel · 2022-10-31T17:23:47Z

It’s unclear to me the number of open files is related to which parameter. The number of tasks? Or the number of workers?

AFAIK there's no easy answer here because Linux uses file descriptors (FD's) for everything. So could be TCP connections, unix domain sockets, actual files on disk, shared memory queues, etc...

If you want to see the list of open files, you can use lsof, which should include all open file descriptors and their opening process.

cc @jjyao do we have docs that guide users on this? I'm happy to create a PR, where should it go?

jjyao · 2022-10-31T19:28:03Z

Yea, do you mind running lsof and see how those fds are used? Also do you know the limit before your change?

Currently we only have a guide on how to increase the limit: https://docs.ray.io/en/releases-2.0.1/ray-core/troubleshooting.html#crashes

zzb3886 · 2022-11-04T19:08:30Z

Previously the soft limit was 1024. The top counts of the lines in the lsof output are

   5568 python3
  12814 raylet
  14220 gcs_serve
  17220 ipython
 141686 ray::squa

I grouped the lsof output by the name of the process, and the first column is the count.

jjyao · 2022-11-08T16:44:53Z

Yea 1024 is probably too low.

So from the log, gcs_server opens 14220 fds. Could you paste what those 14220 fds are? Are they sockets, files, etc?

zzb3886 · 2022-11-21T05:05:05Z

The vast majority of the fds are of type IPv6. As an example:

COMMAND      PID    TID TASKCMD               USER   FD      TYPE             DEVICE    SIZE/OFF       NODE NAME
gcs_serve   1720                                test   20u     IPv6              66756         0t0        TCP machine4:redis->machine3:47824 (ESTABLISHED)

stale · 2023-03-23T00:55:22Z

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

stale · 2023-04-06T03:16:41Z

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

zzb3886 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 26, 2022

hora-anyscale assigned cadedaniel Oct 28, 2022

hora-anyscale added core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 28, 2022

cadedaniel assigned clarng and unassigned cadedaniel Nov 1, 2022

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 23, 2023

stale bot closed this as completed Apr 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ray Cluster] Cannot connect to head after submitting jobs #29696

[Ray Cluster] Cannot connect to head after submitting jobs #29696

zzb3886 commented Oct 26, 2022

cadedaniel commented Oct 28, 2022

zzb3886 commented Oct 30, 2022

cadedaniel commented Oct 31, 2022

jjyao commented Oct 31, 2022

zzb3886 commented Nov 4, 2022

jjyao commented Nov 8, 2022

zzb3886 commented Nov 21, 2022

stale bot commented Mar 23, 2023

stale bot commented Apr 6, 2023

[Ray Cluster] Cannot connect to head after submitting jobs #29696

[Ray Cluster] Cannot connect to head after submitting jobs #29696

Comments

zzb3886 commented Oct 26, 2022

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

cadedaniel commented Oct 28, 2022

zzb3886 commented Oct 30, 2022

cadedaniel commented Oct 31, 2022

jjyao commented Oct 31, 2022

zzb3886 commented Nov 4, 2022

jjyao commented Nov 8, 2022

zzb3886 commented Nov 21, 2022

stale bot commented Mar 23, 2023

stale bot commented Apr 6, 2023