[Core][GCS FT] Raylet on the worker node crashes unexpectedly when head crashes #41343
Labels
bug
Something that is supposed to be working; but isn't
core
Issues that should be addressed in Ray Core
core-gcs
Ray core global control storage.
P0
Issues that should be fixed in short order
release-blocker
P0 Issue that blocks the release
What happened + What you expected to happen
In KubeRay, there is a test named test_detached_actor.
Step 1: Create a RayCluster
num-cpus: 0
to prevent the detached actor from being scheduled on the head node.RAY_gcs_rpc_server_reconnect_timeout_s: 20
RAY_gcs_rpc_server_reconnect_timeout_s: 120
Step 2: Create a detached actor
TestCounter
, and the actor will be scheduled on the worker node. Then, call the actor's functionincrement
twice, and the expected results should be 1 and 2.Step 3: Kill the GCS process on the head node. The head node will crash after 20s (
RAY_gcs_rpc_server_reconnect_timeout_s
).Step 4: Wait for the new head node recovers.
Step 5: Call the function
increment
again, and the expected result should be 3.This test is very flaky with the Ray 2.8.0 and nightly image. In Step 5, it will fail because the Raylet process on the worker node crashes because the dashboard agent process fails unexpectedly. See the section "
test_detached_actor_2.py
's log (Step 5)" for more details.test_detached_actor_2.py
's log (Step 5)`test_detached_actor_2.py`'s log (Step 5)
dashboard_agent.log
on the worker nodedashboard_agent.log on the worker node
Versions / Dependencies
Ray 2.8.0
I tracked all failed GitHub Actions workflows in KubeRay CI. It firstly happens on Oct. 18, 2023 (error logs). At that time, the nightly image is Ray 2.8.0, which was published on November 1st.
Reproduction script
Run the
test_detached_actor.py
in KubeRay CI. It is very flaky in the KubeRay CI (GitHub Actions), but I cannot reproduce it in my environment. It becomes very flaky in the past 3 weeks.Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: