[serve] safe draining #43228

zcin · 2024-02-16T06:23:40Z

[serve] safe draining

Implement safe draining.

When we receive notification that a node is draining, or will be terminated soon, try to start a new replica first before gracefully terminating replicas running on the draining node.

If a new replacement replica gets started before deadline - graceful_shutdown_timeout_s, then start graceful termination of the old replica after the new replacement replica starts.
If it takes longer for the replacement replica to start, then at the latest start graceful termination of the old replica at deadline - graceful_shutdown_timeout_s.
If there is no deadline, wait indefinitely for the new replacement replica to start before gracefully terminating the old replica.

Signed-off-by: Cindy Zhang cindyzyx9@gmail.com

edoakes · 2024-02-21T23:02:24Z

If there is no deadline, wait indefinitely for the new replacement replica to start before gracefully terminating the old replica.

@zcin I think we should have a conservative default value here (controlled by environment variable). Otherwise we could end up in a "stuck" state indefinitely if there's a deadlock scenario (e.g., constrained resources).

Perhaps 5min by default?

edoakes

Looks great. Only stylistic comments.

Please also add an integration test using the Cluster utility.

python/ray/serve/_private/deployment_state.py

edoakes · 2024-02-21T23:06:12Z

python/ray/serve/_private/common.py

@@ -61,6 +61,7 @@ class ReplicaState(str, Enum):
    RECOVERING = "RECOVERING"
    RUNNING = "RUNNING"
    STOPPING = "STOPPING"
+    DRAINING = "DRAINING"


I find this state name a little bit confusing because it makes me think of the graceful shutdown procedure :/ don't have a better suggestion off the top of my head though.

Oh, actually, what about MIGRATING or PENDING_MIGRATION? That seems more clear to me -- we are trying to logically "migrate" this replica to another node, but are waiting for that one to start up first.

python/ray/serve/tests/unit/test_deployment_state.py

edoakes · 2024-02-21T23:09:03Z

python/ray/serve/tests/unit/test_deployment_state.py

+    a long time to start, the replica on the draining node should start
+    gracefully termination `graceful_shutdown_timeout_s` seconds before
+    the draining node's deadline, even if the new replica hasn't


Think you missed some words here?

or maybe just meant to say "start graceful termination" or "start gracefully terminating" :)

edoakes · 2024-02-21T23:09:57Z

python/ray/serve/tests/unit/test_deployment_state.py

+    We should try to start a new replica first. If the draining node has
+    no deadline (deadline is set to 0), then the replica should wait
+    indefinitely for the new replica to start before initiating graceful


as per top-level comment, I think we should still have a deadline here.

edoakes · 2024-02-21T23:12:23Z

python/ray/serve/_private/cluster_node_info_cache.py

@@ -53,7 +53,7 @@ def get_alive_node_ids(self) -> Set[str]:
        return {node_id for node_id, _ in self.get_alive_nodes()}

    @abstractmethod
-    def get_draining_node_ids(self) -> Set[str]:
+    def get_draining_node_ids(self) -> Dict[str, int]:


Can we make some kind of typed object for the return value now that we're adding metadata like the deadline?

I like the idea! Not sure if there's any immediate plans to add more information though - if not, would it be better to change this to a dataclass later, or do it now?

it's fine up to you

edoakes · 2024-02-22T22:41:01Z

Test failure: https://buildkite.com/ray-project/premerge/builds/19935#018dd2c9-e561-419f-90bb-5b09ed790823/196-4105

Implement safe draining. When we receive notification that a node is draining, or will be terminated soon, try to start a new replica first before gracefully terminating replicas running on the draining node. 1. If a new replacement replica gets started before deadline - `graceful_shutdown_timeout_s`, then start graceful termination of the old replica after the new replacement replica starts. 2. If it takes longer for the replacement replica to start, then at the latest start graceful termination of the old replica at deadline - `graceful_shutdown_timeout_s`. 3. If there is no deadline, wait indefinitely for the new replacement replica to start before gracefully terminating the old replica. Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

zcin · 2024-02-23T00:01:37Z

@edoakes @jjyao Tests are passing!

This was referenced Feb 16, 2024

[serve] test deployment state manager refactor #43186

Merged

[serve] Change stopping behavior #43187

Merged

zcin force-pushed the pr43228 branch 9 times, most recently from 476cee1 to 03e971e Compare February 18, 2024 03:12

zcin self-assigned this Feb 20, 2024

zcin force-pushed the pr43228 branch 6 times, most recently from 483d90f to 2ae7e95 Compare February 21, 2024 22:39

zcin marked this pull request as ready for review February 21, 2024 22:39

zcin requested review from jjyao and edoakes February 21, 2024 22:40

zcin force-pushed the pr43228 branch from 2ae7e95 to 3b22000 Compare February 21, 2024 22:40

edoakes reviewed Feb 21, 2024

View reviewed changes

zcin force-pushed the pr43228 branch 5 times, most recently from 015fad3 to 9ff9002 Compare February 22, 2024 21:45

edoakes approved these changes Feb 22, 2024

View reviewed changes

zcin force-pushed the pr43228 branch from 9ff9002 to 6b61d18 Compare February 22, 2024 22:53

edoakes merged commit 71187b5 into ray-project:master Feb 23, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[serve] safe draining #43228

[serve] safe draining #43228

zcin commented Feb 16, 2024 •

edited

edoakes commented Feb 21, 2024 •

edited

edoakes left a comment

edoakes Feb 21, 2024

edoakes Feb 21, 2024

edoakes Feb 21, 2024

edoakes Feb 21, 2024

edoakes Feb 21, 2024

edoakes Feb 21, 2024

zcin Feb 22, 2024

edoakes Feb 22, 2024

edoakes commented Feb 22, 2024

zcin commented Feb 23, 2024

[serve] safe draining #43228

[serve] safe draining #43228

Conversation

zcin commented Feb 16, 2024 • edited

edoakes commented Feb 21, 2024 • edited

edoakes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edoakes commented Feb 22, 2024

zcin commented Feb 23, 2024

zcin commented Feb 16, 2024 •

edited

edoakes commented Feb 21, 2024 •

edited