[Serve] Fix orphaned actors on controller crash during shutdown by vaishdho1 · Pull Request #62823 · ray-project/ray

vaishdho1 · 2026-04-21T17:21:06Z

Description

When the Serve controller crashes mid-shutdown, replica actors are orphaned and their resources are leaked. This happens because KV checkpoints are deleted at the very start of the shutdown process. If the controller crashes and restarts after checkpoint deletion but before actor teardown completes, the restarted controller has no record of the apps/deployments it needs to clean up.
This PR fixes the issue by:

Persisting a SHUTDOWN_IN_PROGRESS_KEY in the KV store at the start of graceful_shutdown(). On restart, the controller checks this key and automatically re-enters the shutdown path.
Deferring checkpoint deletion to the very end, after all the resources are released and before the controller kills itself.

Tests

test_shutdown.py - controller crash mid-shutdown recovers and continues shutdown
test_application_state.py - app checkpoint survives shutdown lifecycle
test_deployment_state.py - deployment checkpoint survives shutdown lifecycle

Related issues

Fixes #62729

Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>

gemini-code-assist

Code Review

This pull request introduces a mechanism to persist the shutdown state of the Ray Serve controller in the KV store, ensuring that if the controller crashes during a graceful shutdown, it resumes the shutdown process upon recovery rather than re-applying the previous configuration. Changes include adding a SHUTDOWN_IN_PROGRESS_KEY, updating the controller's recovery logic, and moving checkpoint deletion to the final stage of the shutdown process. Feedback was provided regarding the recovery logic: when the controller restarts in a shutdown-in-progress state, it should explicitly trigger shutdown on the state managers to ensure all entities are correctly marked for teardown and to prevent the controller from waiting indefinitely.

abrarsheikh · 2026-04-22T09:05:15Z

    wait_for_condition(lambda: _check_deployment_actor_count(0), timeout=15)


+def test_crash_during_shutdown_no_orphaned_actors(ray_shutdown):


does this test fail w/o your PR?

Yes this fails w/o the changes

abrarsheikh · 2026-04-22T17:43:43Z

gcs_failiure tests are failing, PTAL

Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>

vaishdho1 · 2026-04-22T19:44:49Z

While fixing the gcs tests error found something interesting, there was a test where gcs is down and takes a while to start again whengraceful_shutdown() is called, so the write to kv store fails.
So, I wrapped the write in a try/except block inside graceful_shutdown(), with retries in the next clock tick inside shutdown(), if the first write fails.
I have summarized the different crash scenarios during shutdown and their recovery pattern and whether it is supported with the changes or not.

Scenario	Behavior
Normal shutdown, no crash	Flag persisted in `graceful_shutdown()`, controller completes teardown.
Controller crash mid shutdown	Flag was persisted before crash. Restarted controller reads it, re enters shutdown, managers recover state from their checkpoints, teardown resumes.
GCS down when `graceful_shutdown()` is called	`kv_store.put` fails, caught by try/except, `_shutting_down` is set in memory. Control loop retries persisting via `shutdown()` on each tick until GCS is back.
GCS down and controller crash simultaneously	Flag may not be persisted. On restart, controller doesn't know shutdown was in progress and resumes normal operation. This is a double-failure scenario: this is not supported.
Crash before first control loop tick after `graceful_shutdown`	Flag is persisted in `graceful_shutdown()` itself before returning to caller, so even an immediate `ray.kill` after the call is safe.

Is this good?

abrarsheikh · 2026-04-22T22:50:12Z

+        try:
+            self.kv_store.put(SHUTDOWN_IN_PROGRESS_KEY, b"1")
+            self._shutdown_flag_persisted = True
+        except Exception:
+            logger.warning(
+                "Failed to persist shutdown flag; will retry in control loop.",
+                extra={"log_to_stderr": False},
+            )


what if we remove this code block completely? would shutdown would get called on every tick of controller (because self._shutting_down is set to True) and do the right thing right?

Yes. But the only thing is there will be a small window up till the next control loop tick during which a crash cannot be recovered. If that is okay, we can directly set it inside shutdown(). This was done to reduce that window as much as possible.

you make a good point, it maybe atleast 100ms before the next tick. and by commiting to KV store here are able to provide correct response status to user.

abrarsheikh · 2026-04-23T00:46:54Z

@akyang-anyscale PTAL

[Serve] Fix orphaned actors on controller crash during shutdown

e16ae42

Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>

vaishdho1 requested a review from a team as a code owner April 21, 2026 17:21

gemini-code-assist Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread python/ray/serve/_private/controller.py

ray-gardener Bot added serve Ray Serve Related Issue community-contribution Contributed by the community labels Apr 21, 2026

abrarsheikh reviewed Apr 22, 2026

View reviewed changes

abrarsheikh added the go add ONLY when ready to merge, run all tests label Apr 22, 2026

[Serve] Add shutdown flag persistence retries

066e51f

Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>

Merge branch 'master' into serve-shutdown-checkpoint-state-fix

85a9bcd

abrarsheikh reviewed Apr 22, 2026

View reviewed changes

abrarsheikh approved these changes Apr 23, 2026

View reviewed changes

abrarsheikh requested a review from akyang-anyscale April 23, 2026 00:46

akyang-anyscale approved these changes Apr 24, 2026

View reviewed changes

abrarsheikh merged commit 2c7ca74 into ray-project:master Apr 24, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Fix orphaned actors on controller crash during shutdown#62823

[Serve] Fix orphaned actors on controller crash during shutdown#62823
abrarsheikh merged 3 commits intoray-project:masterfrom
vaishdho1:serve-shutdown-checkpoint-state-fix

vaishdho1 commented Apr 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

abrarsheikh Apr 22, 2026

Uh oh!

vaishdho1 Apr 22, 2026

Uh oh!

Uh oh!

abrarsheikh commented Apr 22, 2026

Uh oh!

vaishdho1 commented Apr 22, 2026

Uh oh!

abrarsheikh Apr 22, 2026

Uh oh!

vaishdho1 Apr 22, 2026 •

edited

Loading

Uh oh!

abrarsheikh Apr 23, 2026

Uh oh!

abrarsheikh commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		wait_for_condition(lambda: _check_deployment_actor_count(0), timeout=15)


		def test_crash_during_shutdown_no_orphaned_actors(ray_shutdown):

Conversation

vaishdho1 commented Apr 21, 2026

Description

Tests

Related issues

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

abrarsheikh Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

vaishdho1 Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

abrarsheikh commented Apr 22, 2026

Uh oh!

vaishdho1 commented Apr 22, 2026

Uh oh!

abrarsheikh Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

vaishdho1 Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abrarsheikh Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

abrarsheikh commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vaishdho1 Apr 22, 2026 •

edited

Loading