[Bug] Worker nodes are lost after head was killed due to OOM #104

nostalgicimp · 2021-11-24T19:36:08Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

Head of Ray cluster was killed due to OOM. After it came back, the rest of worker nodes are lost. Checked those worker pods are live:

k get pods -l ray.io/cluster=rc5a9d9f-f445-461f-aaf8-d579c2b706f6
NAME                                                      READY   STATUS    RESTARTS     AGE
rc5a9d9f-f445-461f-aaf8-d579c2b706f6-head-p5k25           1/1     Running   2 (8h ago)   4d23h
rc5a9d9f-f445-461f-aaf8-d579c2b706f6-worker-small-47xp2   1/1     Running   0            4d23h
rc5a9d9f-f445-461f-aaf8-d579c2b706f6-worker-small-rn89r   1/1     Running   0            4d23h

but didn't reconnect to the head node, so the cluster remains the status of only one head node.

Checked the logs showing GCS client couldn't reconnect to the GCS server(The last attempted GCS server address was :0).

[2021-11-19 02:34:55,929 I 17 35] object_store.cc:35: Object store current usage 8e-09 / 0.14151 GB.
[2021-11-19 02:34:56,986 I 17 17] worker_pool.cc:556: [Eagerly] Start install runtime environment for job 09000000. The runtime environment was {}.
[2021-11-19 02:34:56,988 I 17 17] worker_pool.cc:562: [Eagerly] Create runtime env successful for job 09000000. The result context was {"command_prefix": [], "env_vars": {}, "py_executable": "/home/ray/anaconda3/bin/python", "resources_dir": null}.
[2021-11-19 02:34:57,418 I 17 17] node_manager.cc:1181: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0
[2021-11-19 02:35:02,271 I 17 17] worker_pool.cc:556: [Eagerly] Start install runtime environment for job 0a000000. The runtime environment was {}.
[2021-11-19 02:35:02,272 I 17 17] worker_pool.cc:562: [Eagerly] Create runtime env successful for job 0a000000. The result context was {"command_prefix": [], "env_vars": {}, "py_executable": "/home/ray/anaconda3/bin/python", "resources_dir": null}.
[2021-11-19 02:45:04,317 C 17 17] service_based_gcs_client.cc:251: Couldn't reconnect to GCS server. The last attempted GCS server address was :0
*** StackTrace Information ***
    ray::SpdLogMessage::Flush()
    ray::RayLog::~RayLog()
    ray::gcs::ServiceBasedGcsClient::ReconnectGcsServer()
    std::function<>::operator()()
    std::_Function_handler<>::_M_invoke()
    ray::rpc::ClientCallImpl<>::OnReplyReceived()
    std::_Function_handler<>::_M_invoke()
    boost::asio::detail::completion_handler<>::do_complete()
    boost::asio::detail::scheduler::do_run_one()
    boost::asio::detail::scheduler::run()
    boost::asio::io_context::run()
    main
    __libc_start_main

Reproduction script

Create a ray cluster with smaller RAM, e.g. 0.5G, then submit a large job or submit a small job multiple times quickly, then the head node is expected to run into error, get killed and relaunched due to OOM. Then from the Dashboard you will see the ray cluster was left with the head node only, with all rest worker nodes lost but still live running.

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

Jeffwan · 2021-11-27T07:53:43Z

@chenk008 mentioned this is related to #62, since we use sleep infinity here. Even the connection is broken, ray worker didn't exit. We expect worker pod can get restarted and reconnect to the new head.

Jeffwan · 2022-01-05T06:52:52Z

I can confirm that worker can join back to the new head after 10 mins. This is still slow and we need to improve it.

nostalgicimp · 2022-01-08T10:13:25Z

I confirm that all the workers rejoined the new head sharply after 10min delay as well (Refer to the pic).

When the head Pod was deleted and recreated, the Raylet will try get GCS address from Redis in ReconnectGcsServer, but the redis_client always use the previous head IP, so it will always failed to get GCS address. The Raylet will exit until ping_gcs_rpc_server_max_retries(10min). So immediately after 10min wait for retry, the client exits and gets restarted while connecting to the new head IP.

Jeffwan · 2022-01-10T14:31:28Z

Here's the ray core issue @chenk008 filed ray-project/ray#20842 last Dec

From the code perspective, these two configuration play the role.

/// The interval at which the gcs rpc client will check if gcs rpc server is ready.
RAY_CONFIG(int64_t, ping_gcs_rpc_server_interval_milliseconds, 1000)

/// Maximum number of times to retry ping gcs rpc server when gcs server restarts.
RAY_CONFIG(int32_t, ping_gcs_rpc_server_max_retries, 600)

It retries 600 time and the interval is 1s. It results in total 600s timeout. https://github.com/ray-project/ray/blob/master/src/ray/gcs/gcs_client/gcs_client.cc#L294-L295

This issue exists in all stable ray versions (including 1.9.1). This has been reduced to 60s in recent commit in master

ray-project/ray@7baf623

chenk008 · 2022-01-11T01:32:49Z

Please try ray nightly build wheels, I think it has been reduced to 60s.

nostalgicimp · 2022-01-11T11:36:31Z

Please try ray nightly build wheels, I think it has been reduced to 60s.

Thanks. I tried the nightly version and confirmed that the workers come back to the new head cluster in about 1 min. Among the 4 lost workers, 2 restarted at 60s, and the rest 2 workers restarted at 80s.

Jeffwan · 2022-01-12T06:55:44Z

For version <= 1.9.1, you can set this head param --system-config='{"ping_gcs_rpc_server_max_retries": 20}' to overcome the problem. @nostalgicimp If you have time, can you help create a doc and document the best practice? You can put it under docs/

nostalgicimp · 2022-01-12T08:06:32Z

For version <= 1.9.1, you can set this head param --system-config='{"ping_gcs_rpc_server_max_retries": 20}' to overcome the problem. @nostalgicimp If you have time, can you help create a doc and document the best practice? You can put it under docs/

Sure, will do.

Jeffwan · 2022-02-07T09:36:08Z

Please check https://github.com/ray-project/kuberay/blob/master/docs/best-practice/worker-head-reconnection.md for more details. We can close the issue

nostalgicimp added the bug Something isn't working label Nov 24, 2021

Jeffwan changed the title ~~[Bug]~~ [Bug] Worker nodes are lost after head was killed due to OOM Nov 25, 2021

Jeffwan mentioned this issue Nov 26, 2021

Use ray start block in Pod's entrypoint #77

Merged

4 tasks

Jeffwan self-assigned this Jan 11, 2022

Jeffwan mentioned this issue Jan 28, 2022

[Bug] Alive worker nodes prevent faulty head node failure recovery #139

Closed

2 tasks

nostalgicimp mentioned this issue Jan 31, 2022

Explanation and Best Practice for workers-head Reconnection #142

Merged

4 tasks

Jeffwan closed this as completed Feb 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Worker nodes are lost after head was killed due to OOM #104

[Bug] Worker nodes are lost after head was killed due to OOM #104

nostalgicimp commented Nov 24, 2021

Jeffwan commented Nov 27, 2021

Jeffwan commented Jan 5, 2022

nostalgicimp commented Jan 8, 2022

Jeffwan commented Jan 10, 2022 •

edited

Loading

chenk008 commented Jan 11, 2022

nostalgicimp commented Jan 11, 2022

Jeffwan commented Jan 12, 2022 •

edited

Loading

nostalgicimp commented Jan 12, 2022

Jeffwan commented Feb 7, 2022

[Bug] Worker nodes are lost after head was killed due to OOM #104

[Bug] Worker nodes are lost after head was killed due to OOM #104

Comments

nostalgicimp commented Nov 24, 2021

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

Jeffwan commented Nov 27, 2021

Jeffwan commented Jan 5, 2022

nostalgicimp commented Jan 8, 2022

Jeffwan commented Jan 10, 2022 • edited Loading

chenk008 commented Jan 11, 2022

nostalgicimp commented Jan 11, 2022

Jeffwan commented Jan 12, 2022 • edited Loading

nostalgicimp commented Jan 12, 2022

Jeffwan commented Feb 7, 2022

Jeffwan commented Jan 10, 2022 •

edited

Loading

Jeffwan commented Jan 12, 2022 •

edited

Loading