New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve] safe draining #43228
[serve] safe draining #43228
Conversation
476cee1
to
03e971e
Compare
483d90f
to
2ae7e95
Compare
@zcin I think we should have a conservative default value here (controlled by environment variable). Otherwise we could end up in a "stuck" state indefinitely if there's a deadlock scenario (e.g., constrained resources). Perhaps 5min by default? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Only stylistic comments.
Please also add an integration test using the Cluster
utility.
python/ray/serve/_private/common.py
Outdated
@@ -61,6 +61,7 @@ class ReplicaState(str, Enum): | |||
RECOVERING = "RECOVERING" | |||
RUNNING = "RUNNING" | |||
STOPPING = "STOPPING" | |||
DRAINING = "DRAINING" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this state name a little bit confusing because it makes me think of the graceful shutdown procedure :/ don't have a better suggestion off the top of my head though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, actually, what about MIGRATING
or PENDING_MIGRATION
? That seems more clear to me -- we are trying to logically "migrate" this replica to another node, but are waiting for that one to start up first.
a long time to start, the replica on the draining node should start | ||
gracefully termination `graceful_shutdown_timeout_s` seconds before | ||
the draining node's deadline, even if the new replica hasn't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think you missed some words here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe just meant to say "start graceful termination" or "start gracefully terminating" :)
We should try to start a new replica first. If the draining node has | ||
no deadline (deadline is set to 0), then the replica should wait | ||
indefinitely for the new replica to start before initiating graceful |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as per top-level comment, I think we should still have a deadline here.
@@ -53,7 +53,7 @@ def get_alive_node_ids(self) -> Set[str]: | |||
return {node_id for node_id, _ in self.get_alive_nodes()} | |||
|
|||
@abstractmethod | |||
def get_draining_node_ids(self) -> Set[str]: | |||
def get_draining_node_ids(self) -> Dict[str, int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make some kind of typed object for the return value now that we're adding metadata like the deadline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea! Not sure if there's any immediate plans to add more information though - if not, would it be better to change this to a dataclass later, or do it now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's fine up to you
015fad3
to
9ff9002
Compare
Implement safe draining. When we receive notification that a node is draining, or will be terminated soon, try to start a new replica first before gracefully terminating replicas running on the draining node. 1. If a new replacement replica gets started before deadline - `graceful_shutdown_timeout_s`, then start graceful termination of the old replica after the new replacement replica starts. 2. If it takes longer for the replacement replica to start, then at the latest start graceful termination of the old replica at deadline - `graceful_shutdown_timeout_s`. 3. If there is no deadline, wait indefinitely for the new replacement replica to start before gracefully terminating the old replica. Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
[serve] safe draining
Implement safe draining.
When we receive notification that a node is draining, or will be terminated soon, try to start a new replica first before gracefully terminating replicas running on the draining node.
graceful_shutdown_timeout_s
, then start graceful termination of the old replica after the new replacement replica starts.graceful_shutdown_timeout_s
.Signed-off-by: Cindy Zhang cindyzyx9@gmail.com