Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][2/2] Kill worker on root detached actor died. #45638

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

rynewang
Copy link
Contributor

@rynewang rynewang commented May 30, 2024

PR #44214 fixes issue that when a job finished, there are workers leaked. However if a worker is assigned to a detached actor, they still leak. This PR fixes this by listening on actor death and kill assigned workers accordingly.

Based on #45637 which adds SubscribeAllActors to GcsClient. This PR uses it in node_manager and kills workers. node_manager listens to all actor changes in NodeManager::HandleActorUpdate, and it kills running workers and invoke worker_pool callback to bookkeep the died actor. worker_pool will then kill workers in periodic loops and worker start callbacks.

There are several kinds of workers to kill:

  1. running workers, kept in NodeManager::leased_workers_, killed in the subscription callback in NodeManager::HandleActorUpdate
  2. idle workers, killed in WorkerPool::TryKillingIdleWorkers -> WorkerPool::KillIdleWorker
  3. pending pop-from-pool workers, fail in WorkerPool::PopWorkerCallbackInternal
  4. pending push-to-pool workers (just started, or returned from lease), fail in WorkerPool::InvokePopWorkerCallbackForProcess

...and we treat them accordingly.

Fixes part of #45596.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang rynewang requested a review from a team as a code owner May 30, 2024 05:33
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants