-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Worker nodes are lost after head was killed due to OOM #104
Comments
I can confirm that worker can join back to the new head after 10 mins. This is still slow and we need to improve it. |
I confirm that all the workers rejoined the new head sharply after 10min delay as well (Refer to the pic). When the head Pod was deleted and recreated, the Raylet will try get GCS address from Redis in ReconnectGcsServer, but the redis_client always use the previous head IP, so it will always failed to get GCS address. The Raylet will exit until ping_gcs_rpc_server_max_retries(10min). So immediately after 10min wait for retry, the client exits and gets restarted while connecting to the new head IP. |
Here's the ray core issue @chenk008 filed ray-project/ray#20842 last Dec From the code perspective, these two configuration play the role.
It retries 600 time and the interval is 1s. It results in total 600s timeout. https://github.com/ray-project/ray/blob/master/src/ray/gcs/gcs_client/gcs_client.cc#L294-L295 This issue exists in all stable ray versions (including 1.9.1). This has been reduced to 60s in recent commit in master |
Please try ray nightly build wheels, I think it has been reduced to 60s. |
For version <= 1.9.1, you can set this head param |
Sure, will do. |
Please check https://github.com/ray-project/kuberay/blob/master/docs/best-practice/worker-head-reconnection.md for more details. We can close the issue |
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
Head of Ray cluster was killed due to OOM. After it came back, the rest of worker nodes are lost. Checked those worker pods are live:
but didn't reconnect to the head node, so the cluster remains the status of only one head node.
Checked the logs showing GCS client couldn't reconnect to the GCS server(The last attempted GCS server address was :0).
Reproduction script
Create a ray cluster with smaller RAM, e.g. 0.5G, then submit a large job or submit a small job multiple times quickly, then the head node is expected to run into error, get killed and relaunched due to OOM. Then from the Dashboard you will see the ray cluster was left with the head node only, with all rest worker nodes lost but still live running.
Anything else
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: