Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Worker nodes are lost after head was killed due to OOM #104

Closed
1 of 2 tasks
nostalgicimp opened this issue Nov 24, 2021 · 9 comments
Closed
1 of 2 tasks

[Bug] Worker nodes are lost after head was killed due to OOM #104

nostalgicimp opened this issue Nov 24, 2021 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@nostalgicimp
Copy link
Contributor

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

Head of Ray cluster was killed due to OOM. After it came back, the rest of worker nodes are lost. Checked those worker pods are live:

k get pods -l ray.io/cluster=rc5a9d9f-f445-461f-aaf8-d579c2b706f6
NAME                                                      READY   STATUS    RESTARTS     AGE
rc5a9d9f-f445-461f-aaf8-d579c2b706f6-head-p5k25           1/1     Running   2 (8h ago)   4d23h
rc5a9d9f-f445-461f-aaf8-d579c2b706f6-worker-small-47xp2   1/1     Running   0            4d23h
rc5a9d9f-f445-461f-aaf8-d579c2b706f6-worker-small-rn89r   1/1     Running   0            4d23h

but didn't reconnect to the head node, so the cluster remains the status of only one head node.

image

Checked the logs showing GCS client couldn't reconnect to the GCS server(The last attempted GCS server address was :0).

[2021-11-19 02:34:55,929 I 17 35] object_store.cc:35: Object store current usage 8e-09 / 0.14151 GB.
[2021-11-19 02:34:56,986 I 17 17] worker_pool.cc:556: [Eagerly] Start install runtime environment for job 09000000. The runtime environment was {}.
[2021-11-19 02:34:56,988 I 17 17] worker_pool.cc:562: [Eagerly] Create runtime env successful for job 09000000. The result context was {"command_prefix": [], "env_vars": {}, "py_executable": "/home/ray/anaconda3/bin/python", "resources_dir": null}.
[2021-11-19 02:34:57,418 I 17 17] node_manager.cc:1181: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0
[2021-11-19 02:35:02,271 I 17 17] worker_pool.cc:556: [Eagerly] Start install runtime environment for job 0a000000. The runtime environment was {}.
[2021-11-19 02:35:02,272 I 17 17] worker_pool.cc:562: [Eagerly] Create runtime env successful for job 0a000000. The result context was {"command_prefix": [], "env_vars": {}, "py_executable": "/home/ray/anaconda3/bin/python", "resources_dir": null}.
[2021-11-19 02:45:04,317 C 17 17] service_based_gcs_client.cc:251: Couldn't reconnect to GCS server. The last attempted GCS server address was :0
*** StackTrace Information ***
    ray::SpdLogMessage::Flush()
    ray::RayLog::~RayLog()
    ray::gcs::ServiceBasedGcsClient::ReconnectGcsServer()
    std::function<>::operator()()
    std::_Function_handler<>::_M_invoke()
    ray::rpc::ClientCallImpl<>::OnReplyReceived()
    std::_Function_handler<>::_M_invoke()
    boost::asio::detail::completion_handler<>::do_complete()
    boost::asio::detail::scheduler::do_run_one()
    boost::asio::detail::scheduler::run()
    boost::asio::io_context::run()
    main
    __libc_start_main

Reproduction script

Create a ray cluster with smaller RAM, e.g. 0.5G, then submit a large job or submit a small job multiple times quickly, then the head node is expected to run into error, get killed and relaunched due to OOM. Then from the Dashboard you will see the ray cluster was left with the head node only, with all rest worker nodes lost but still live running.

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@nostalgicimp nostalgicimp added the bug Something isn't working label Nov 24, 2021
@Jeffwan Jeffwan changed the title [Bug] [Bug] Worker nodes are lost after head was killed due to OOM Nov 25, 2021
@Jeffwan
Copy link
Collaborator

Jeffwan commented Nov 27, 2021

@chenk008 mentioned this is related to #62, since we use sleep infinity here. Even the connection is broken, ray worker didn't exit. We expect worker pod can get restarted and reconnect to the new head.

@Jeffwan
Copy link
Collaborator

Jeffwan commented Jan 5, 2022

I can confirm that worker can join back to the new head after 10 mins. This is still slow and we need to improve it.

@nostalgicimp
Copy link
Contributor Author

I confirm that all the workers rejoined the new head sharply after 10min delay as well (Refer to the pic).

JM8hmyY232

When the head Pod was deleted and recreated, the Raylet will try get GCS address from Redis in ReconnectGcsServer, but the redis_client always use the previous head IP, so it will always failed to get GCS address. The Raylet will exit until ping_gcs_rpc_server_max_retries(10min). So immediately after 10min wait for retry, the client exits and gets restarted while connecting to the new head IP.

@Jeffwan
Copy link
Collaborator

Jeffwan commented Jan 10, 2022

Here's the ray core issue @chenk008 filed ray-project/ray#20842 last Dec

From the code perspective, these two configuration play the role.

/// The interval at which the gcs rpc client will check if gcs rpc server is ready.
RAY_CONFIG(int64_t, ping_gcs_rpc_server_interval_milliseconds, 1000)

/// Maximum number of times to retry ping gcs rpc server when gcs server restarts.
RAY_CONFIG(int32_t, ping_gcs_rpc_server_max_retries, 600)

It retries 600 time and the interval is 1s. It results in total 600s timeout. https://github.com/ray-project/ray/blob/master/src/ray/gcs/gcs_client/gcs_client.cc#L294-L295

This issue exists in all stable ray versions (including 1.9.1). This has been reduced to 60s in recent commit in master

ray-project/ray@7baf623

@Jeffwan Jeffwan self-assigned this Jan 11, 2022
@chenk008
Copy link
Contributor

Please try ray nightly build wheels, I think it has been reduced to 60s.

@nostalgicimp
Copy link
Contributor Author

Please try ray nightly build wheels, I think it has been reduced to 60s.

Thanks. I tried the nightly version and confirmed that the workers come back to the new head cluster in about 1 min. Among the 4 lost workers, 2 restarted at 60s, and the rest 2 workers restarted at 80s.

image
image

@Jeffwan
Copy link
Collaborator

Jeffwan commented Jan 12, 2022

For version <= 1.9.1, you can set this head param --system-config='{"ping_gcs_rpc_server_max_retries": 20}' to overcome the problem. @nostalgicimp If you have time, can you help create a doc and document the best practice? You can put it under docs/

@nostalgicimp
Copy link
Contributor Author

For version <= 1.9.1, you can set this head param --system-config='{"ping_gcs_rpc_server_max_retries": 20}' to overcome the problem. @nostalgicimp If you have time, can you help create a doc and document the best practice? You can put it under docs/

Sure, will do.

@Jeffwan
Copy link
Collaborator

Jeffwan commented Feb 7, 2022

Please check https://github.com/ray-project/kuberay/blob/master/docs/best-practice/worker-head-reconnection.md for more details. We can close the issue

@Jeffwan Jeffwan closed this as completed Feb 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants