Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] safe draining #43228

Merged
merged 1 commit into from Feb 23, 2024
Merged

[serve] safe draining #43228

merged 1 commit into from Feb 23, 2024

Conversation

zcin
Copy link
Contributor

@zcin zcin commented Feb 16, 2024

[serve] safe draining

Implement safe draining.

When we receive notification that a node is draining, or will be terminated soon, try to start a new replica first before gracefully terminating replicas running on the draining node.

  1. If a new replacement replica gets started before deadline - graceful_shutdown_timeout_s, then start graceful termination of the old replica after the new replacement replica starts.
  2. If it takes longer for the replacement replica to start, then at the latest start graceful termination of the old replica at deadline - graceful_shutdown_timeout_s.
  3. If there is no deadline, wait indefinitely for the new replacement replica to start before gracefully terminating the old replica.

Signed-off-by: Cindy Zhang cindyzyx9@gmail.com

@zcin zcin force-pushed the pr43228 branch 9 times, most recently from 476cee1 to 03e971e Compare February 18, 2024 03:12
@zcin zcin self-assigned this Feb 20, 2024
@zcin zcin force-pushed the pr43228 branch 6 times, most recently from 483d90f to 2ae7e95 Compare February 21, 2024 22:39
@zcin zcin marked this pull request as ready for review February 21, 2024 22:39
@zcin zcin requested review from jjyao and edoakes February 21, 2024 22:40
@edoakes
Copy link
Contributor

edoakes commented Feb 21, 2024

  1. If there is no deadline, wait indefinitely for the new replacement replica to start before gracefully terminating the old replica.

@zcin I think we should have a conservative default value here (controlled by environment variable). Otherwise we could end up in a "stuck" state indefinitely if there's a deadlock scenario (e.g., constrained resources).

Perhaps 5min by default?

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Only stylistic comments.

Please also add an integration test using the Cluster utility.

python/ray/serve/_private/deployment_state.py Outdated Show resolved Hide resolved
python/ray/serve/_private/deployment_state.py Outdated Show resolved Hide resolved
@@ -61,6 +61,7 @@ class ReplicaState(str, Enum):
RECOVERING = "RECOVERING"
RUNNING = "RUNNING"
STOPPING = "STOPPING"
DRAINING = "DRAINING"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this state name a little bit confusing because it makes me think of the graceful shutdown procedure :/ don't have a better suggestion off the top of my head though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, actually, what about MIGRATING or PENDING_MIGRATION? That seems more clear to me -- we are trying to logically "migrate" this replica to another node, but are waiting for that one to start up first.

python/ray/serve/tests/unit/test_deployment_state.py Outdated Show resolved Hide resolved
Comment on lines 1406 to 1408
a long time to start, the replica on the draining node should start
gracefully termination `graceful_shutdown_timeout_s` seconds before
the draining node's deadline, even if the new replica hasn't
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think you missed some words here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe just meant to say "start graceful termination" or "start gracefully terminating" :)

Comment on lines 1488 to 1490
We should try to start a new replica first. If the draining node has
no deadline (deadline is set to 0), then the replica should wait
indefinitely for the new replica to start before initiating graceful
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per top-level comment, I think we should still have a deadline here.

@@ -53,7 +53,7 @@ def get_alive_node_ids(self) -> Set[str]:
return {node_id for node_id, _ in self.get_alive_nodes()}

@abstractmethod
def get_draining_node_ids(self) -> Set[str]:
def get_draining_node_ids(self) -> Dict[str, int]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make some kind of typed object for the return value now that we're adding metadata like the deadline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea! Not sure if there's any immediate plans to add more information though - if not, would it be better to change this to a dataclass later, or do it now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's fine up to you

@zcin zcin force-pushed the pr43228 branch 5 times, most recently from 015fad3 to 9ff9002 Compare February 22, 2024 21:45
@edoakes
Copy link
Contributor

edoakes commented Feb 22, 2024

Implement safe draining.

When we receive notification that a node is draining, or will be terminated soon, try to start a new replica first before gracefully terminating replicas running on the draining node.

1. If a new replacement replica gets started before deadline - `graceful_shutdown_timeout_s`, then start graceful termination of the old replica after the new replacement replica starts.
2. If it takes longer for the replacement replica to start, then at the latest start graceful termination of the old replica at deadline - `graceful_shutdown_timeout_s`.
3. If there is no deadline, wait indefinitely for the new replacement replica to start before gracefully terminating the old replica.

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
@zcin
Copy link
Contributor Author

zcin commented Feb 23, 2024

@edoakes @jjyao Tests are passing!

@edoakes edoakes merged commit 71187b5 into ray-project:master Feb 23, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants