-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Closed
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray Corecore-workerstability
Description
While debugging a release test failure, we discovered that in some cases Serve replica actors are being killed due to fate sharing with the controller.
This should never happen because all actors started by Serve (the controller, replicas, proxies) are detached, so they should not fate share with the controller (relevant code in the raylet).
We see a number of log lines like the following in the Raylet logs in multiple runs of the Serve long-running failure test case:
[2023-11-01 07:10:17,273 I 825 825] (raylet) node_manager.cc:1104: The leased worker dd7d4d82da8fef21e59667dba16f2bce15203c8832039284cbb26461 is killed because the owner process 2b60b506544d378c192b7e1cbf989be4058f41015c00fa7f30e50f91 died.
[2023-11-01 07:10:17,273 I 825 825] (raylet) node_manager.cc:1104: The leased worker 3844620037d1ea4a19c830bb548edd9726cd4521cc78f2c7871367d6 is killed because the owner process 2b60b506544d378c192b7e1cbf989be4058f41015c00fa7f30e50f91 died.
All of the referenced actors are detached actors.
Metadata
Metadata
Assignees
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray Corecore-workerstability
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
edoakes commentedon Nov 1, 2023
Here are the full cluster logs for the release test failure run: logs.zip
edoakes commentedon Nov 1, 2023
One possible (unsubstantiated) theory: in this test, the controller and the actors it creates are killed repeatedly. It could be that the controller is killed immediately after creating a replica, in which case the raylet may not yet have marked the worker running the replica as being a detached actor.
edoakes commentedon Nov 1, 2023
Looks like we only mark the actor as detached after the creation task finishes:
ray/src/ray/raylet/node_manager.cc
Line 2186 in f5c5974
This means if the creation task is sent to the actor, then the owner dies before it finishes, the actor may be killed by HandleUnexpectedWorkerFailure.
rueian commentedon Jun 4, 2025
This issue can be reproduced on Ray 2.46.0 by the following 3 scripts:
And run the above scripts like this:
I think the issue was caused by #14184, where the first
worker.MarkDetachedActor()
was deleted here https://github.com/ray-project/ray/pull/14184/files#diff-d2f22b8f1bf5f9be47dacae8b467a72ee94629df12ffcc18b13447192ff3dbcfL1982, which made a leased worker possible to be killed by HandleUnexpectedWorkerFailure.I think a proper fix now would be to
worker->IsDetachedActor()
withworker->GetAssignedTask().GetTaskSpecification().IsDetachedActor()
Worker::IsDetachedActor
andWorker::MarkDetachedActor
to avoid further confusion.I will open a PR soon.
[core] Move dependencies of NodeManger to main.cc for better testabil…
[core] Move dependencies of NodeManger to main.cc for better testabil…
[core] fix detached actor being unexpectedly killed (ray-project#53562)
[core] fix detached actor being unexpectedly killed (ray-project#53562)
[core] Move dependencies of NodeManger to main.cc for better testabil…
[core] fix detached actor being unexpectedly killed (#53562)