[Serve] Fix orphaned actors on controller crash during shutdown#62823
Conversation
Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to persist the shutdown state of the Ray Serve controller in the KV store, ensuring that if the controller crashes during a graceful shutdown, it resumes the shutdown process upon recovery rather than re-applying the previous configuration. Changes include adding a SHUTDOWN_IN_PROGRESS_KEY, updating the controller's recovery logic, and moving checkpoint deletion to the final stage of the shutdown process. Feedback was provided regarding the recovery logic: when the controller restarts in a shutdown-in-progress state, it should explicitly trigger shutdown on the state managers to ensure all entities are correctly marked for teardown and to prevent the controller from waiting indefinitely.
| wait_for_condition(lambda: _check_deployment_actor_count(0), timeout=15) | ||
|
|
||
|
|
||
| def test_crash_during_shutdown_no_orphaned_actors(ray_shutdown): |
There was a problem hiding this comment.
does this test fail w/o your PR?
There was a problem hiding this comment.
Yes this fails w/o the changes
|
gcs_failiure tests are failing, PTAL |
Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>
|
While fixing the gcs tests error found something interesting, there was a test where gcs is down and takes a while to start again when
Is this good? |
| try: | ||
| self.kv_store.put(SHUTDOWN_IN_PROGRESS_KEY, b"1") | ||
| self._shutdown_flag_persisted = True | ||
| except Exception: | ||
| logger.warning( | ||
| "Failed to persist shutdown flag; will retry in control loop.", | ||
| extra={"log_to_stderr": False}, | ||
| ) |
There was a problem hiding this comment.
what if we remove this code block completely? would shutdown would get called on every tick of controller (because self._shutting_down is set to True) and do the right thing right?
There was a problem hiding this comment.
Yes. But the only thing is there will be a small window up till the next control loop tick during which a crash cannot be recovered. If that is okay, we can directly set it inside shutdown(). This was done to reduce that window as much as possible.
There was a problem hiding this comment.
you make a good point, it maybe atleast 100ms before the next tick. and by commiting to KV store here are able to provide correct response status to user.
|
@akyang-anyscale PTAL |
Description
When the Serve controller crashes mid-shutdown, replica actors are orphaned and their resources are leaked. This happens because KV checkpoints are deleted at the very start of the shutdown process. If the controller crashes and restarts after checkpoint deletion but before actor teardown completes, the restarted controller has no record of the apps/deployments it needs to clean up.
This PR fixes the issue by:
SHUTDOWN_IN_PROGRESS_KEYin the KV store at the start ofgraceful_shutdown(). On restart, the controller checks this key and automatically re-enters the shutdown path.Tests
test_shutdown.py- controller crash mid-shutdown recovers and continues shutdowntest_application_state.py- app checkpoint survives shutdown lifecycletest_deployment_state.py- deployment checkpoint survives shutdown lifecycleRelated issues