-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve] Rolling updates for redeployments #14803
[serve] Rolling updates for redeployments #14803
Conversation
ad9520f
to
e1e413c
Compare
…oy-incremental-rollout
It's going to take sometime to review this, ETA tuesday noon. |
python/ray/serve/backend_state.py
Outdated
new_running_replicas = self._replicas[backend_tag].count( | ||
version=target_version, states=[ReplicaState.RUNNING]) | ||
|
||
pending_replicas = ( | ||
target_replicas - new_running_replicas - old_running_replicas) | ||
rollout_size = max(int(0.2 * target_replicas), 1) | ||
|
||
max_to_stop = max(rollout_size - pending_replicas, 0) | ||
replicas_to_stop = self._replicas[backend_tag].pop( | ||
exclude_version=target_version, | ||
states=[ | ||
ReplicaState.SHOULD_START, ReplicaState.STARTING, | ||
ReplicaState.RUNNING | ||
], | ||
max_replicas=max_to_stop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simon-mo this the core logic to review FYI
…oy-incremental-rollout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the core logic and I had a couple of concerns, but I might be misunderstanding or misremembering something. I think it's close though! Overall more comments would be helpful.
@simon-mo @architkulkarni I updated the tests to be easier to read and added two new test cases. I also addressed the issue @architkulkarni pointed out about target_replicas being below the current pending_replicas & added some more comments explaining the logic. Please take another look and let me know what's still confusing / if you see any issues in the logic. |
…oy-incremental-rollout
ReplicaState.SHOULD_START, ReplicaState.STARTING, | ||
ReplicaState.RUNNING | ||
], | ||
max_replicas=max_to_stop) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this version, seems like we are not counting number of new version replicas that are starting? namely if the following is > 0, we are breaking the invariant the max number of replicas in transition.
replicas.count(
version=target_version, states=[ReplicaState.STARTING, ReplicaState.SHOULD_START])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this part might be okay--the expression you've written is contributes to pending_replicas.
Concretely, if target_replicas = current_replicas =
10 replicas, and it consists of 8 old RUNNING and 2 new STARTING, then we have
pending_replicas = 10 - 8 - 0 = 2
so max_to_stop
is 2-2 = 0
and nothing gets stopped. (Until the new_version replicas move from STARTING to RUNNING, as desired.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic and new tests look good! The comments are helpful. I think the core logic would still be hard to understand for someone new to the code but I can't really think of a way to make it much clearer. One idea might be to move some of the algebra into the algorithm itself--specifically, actually pop all the replicas we can stop for "free" (namely, old SHOULD START and old STARTING) at the beginning, so we never have to account for them in the rest of the function. In any case, that can be done in a future PR (if at all), and the tests you wrote will make it much easier.
…oy-incremental-rollout
@architkulkarni I totally agree, I'm not very happy with the readability/understandability, but I couldn't find a better way to express it. I tried splitting out the "rolling update" and "regular scaling" logic into separate codepaths, but it ended up being both hard to make correct and added some redundancy. Hopefully we can find a way to clean this up in the near future. |
This reverts commit 63594c5.
Why are these changes needed?
Performs a rolling update when re-deploying a backend. Currently, the batch size of the update is hardcoded to floor(20%) of the target number of replicas. We can make this configurable in the future.
This follows the same protocol as k8s:
Related issue number
Closes #14805
Checks
scripts/format.sh
to lint the changes in this PR.