Skip to content

[Serve] Fix orphaned actors on controller crash during shutdown#62823

Merged
abrarsheikh merged 3 commits intoray-project:masterfrom
vaishdho1:serve-shutdown-checkpoint-state-fix
Apr 24, 2026
Merged

[Serve] Fix orphaned actors on controller crash during shutdown#62823
abrarsheikh merged 3 commits intoray-project:masterfrom
vaishdho1:serve-shutdown-checkpoint-state-fix

Conversation

@vaishdho1
Copy link
Copy Markdown
Contributor

Description

When the Serve controller crashes mid-shutdown, replica actors are orphaned and their resources are leaked. This happens because KV checkpoints are deleted at the very start of the shutdown process. If the controller crashes and restarts after checkpoint deletion but before actor teardown completes, the restarted controller has no record of the apps/deployments it needs to clean up.
This PR fixes the issue by:

  • Persisting a SHUTDOWN_IN_PROGRESS_KEY in the KV store at the start of graceful_shutdown(). On restart, the controller checks this key and automatically re-enters the shutdown path.
  • Deferring checkpoint deletion to the very end, after all the resources are released and before the controller kills itself.

Tests

  • test_shutdown.py - controller crash mid-shutdown recovers and continues shutdown
  • test_application_state.py - app checkpoint survives shutdown lifecycle
  • test_deployment_state.py - deployment checkpoint survives shutdown lifecycle

Related issues

Fixes #62729

Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>
@vaishdho1 vaishdho1 requested a review from a team as a code owner April 21, 2026 17:21
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to persist the shutdown state of the Ray Serve controller in the KV store, ensuring that if the controller crashes during a graceful shutdown, it resumes the shutdown process upon recovery rather than re-applying the previous configuration. Changes include adding a SHUTDOWN_IN_PROGRESS_KEY, updating the controller's recovery logic, and moving checkpoint deletion to the final stage of the shutdown process. Feedback was provided regarding the recovery logic: when the controller restarts in a shutdown-in-progress state, it should explicitly trigger shutdown on the state managers to ensure all entities are correctly marked for teardown and to prevent the controller from waiting indefinitely.

Comment thread python/ray/serve/_private/controller.py
@ray-gardener ray-gardener Bot added serve Ray Serve Related Issue community-contribution Contributed by the community labels Apr 21, 2026
wait_for_condition(lambda: _check_deployment_actor_count(0), timeout=15)


def test_crash_during_shutdown_no_orphaned_actors(ray_shutdown):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this test fail w/o your PR?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this fails w/o the changes

Comment thread python/ray/serve/_private/controller.py
@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label Apr 22, 2026
@abrarsheikh
Copy link
Copy Markdown
Contributor

gcs_failiure tests are failing, PTAL

Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>
@vaishdho1
Copy link
Copy Markdown
Contributor Author

While fixing the gcs tests error found something interesting, there was a test where gcs is down and takes a while to start again whengraceful_shutdown() is called, so the write to kv store fails.
So, I wrapped the write in a try/except block inside graceful_shutdown(), with retries in the next clock tick inside shutdown(), if the first write fails.
I have summarized the different crash scenarios during shutdown and their recovery pattern and whether it is supported with the changes or not.

Scenario Behavior
Normal shutdown, no crash Flag persisted in graceful_shutdown(), controller completes teardown.
Controller crash mid shutdown Flag was persisted before crash. Restarted controller reads it, re enters shutdown, managers recover state from their checkpoints, teardown resumes.
GCS down when graceful_shutdown() is called kv_store.put fails, caught by try/except, _shutting_down is set in memory. Control loop retries persisting via shutdown() on each tick until GCS is back.
GCS down and controller crash simultaneously Flag may not be persisted. On restart, controller doesn't know shutdown was in progress and resumes normal operation. This is a double-failure scenario: this is not supported.
Crash before first control loop tick after graceful_shutdown Flag is persisted in graceful_shutdown() itself before returning to caller, so even an immediate ray.kill after the call is safe.

Is this good?

Comment on lines +1760 to +1767
try:
self.kv_store.put(SHUTDOWN_IN_PROGRESS_KEY, b"1")
self._shutdown_flag_persisted = True
except Exception:
logger.warning(
"Failed to persist shutdown flag; will retry in control loop.",
extra={"log_to_stderr": False},
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we remove this code block completely? would shutdown would get called on every tick of controller (because self._shutting_down is set to True) and do the right thing right?

Copy link
Copy Markdown
Contributor Author

@vaishdho1 vaishdho1 Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But the only thing is there will be a small window up till the next control loop tick during which a crash cannot be recovered. If that is okay, we can directly set it inside shutdown(). This was done to reduce that window as much as possible.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you make a good point, it maybe atleast 100ms before the next tick. and by commiting to KV store here are able to provide correct response status to user.

@abrarsheikh
Copy link
Copy Markdown
Contributor

@akyang-anyscale PTAL

@abrarsheikh abrarsheikh merged commit 2c7ca74 into ray-project:master Apr 24, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Serve] Shutdown deletes KV checkpoints before teardown completes; can orphan replica actors

3 participants