Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] Rolling updates for redeployments #14803

Merged
merged 30 commits into from
Mar 25, 2021

Conversation

edoakes
Copy link
Contributor

@edoakes edoakes commented Mar 19, 2021

Why are these changes needed?

Performs a rolling update when re-deploying a backend. Currently, the batch size of the update is hardcoded to floor(20%) of the target number of replicas. We can make this configurable in the future.

This follows the same protocol as k8s:

  • Old replicas are turned down before new ones are started.
  • At most batch_size old replicas will be stopped and new replicas will be started at once.

Related issue number

Closes #14805

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@edoakes edoakes added this to the [serve] v2 API milestone Mar 19, 2021
@edoakes edoakes changed the title [WIP][serve] Incremental rollout when re-deploying [serve] Rolling updates for redeployments Mar 22, 2021
@simon-mo
Copy link
Contributor

It's going to take sometime to review this, ETA tuesday noon.

Comment on lines 638 to 652
new_running_replicas = self._replicas[backend_tag].count(
version=target_version, states=[ReplicaState.RUNNING])

pending_replicas = (
target_replicas - new_running_replicas - old_running_replicas)
rollout_size = max(int(0.2 * target_replicas), 1)

max_to_stop = max(rollout_size - pending_replicas, 0)
replicas_to_stop = self._replicas[backend_tag].pop(
exclude_version=target_version,
states=[
ReplicaState.SHOULD_START, ReplicaState.STARTING,
ReplicaState.RUNNING
],
max_replicas=max_to_stop)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simon-mo this the core logic to review FYI

python/ray/serve/api.py Outdated Show resolved Hide resolved
python/ray/serve/tests/test_backend_state.py Outdated Show resolved Hide resolved
python/ray/serve/tests/test_backend_state.py Outdated Show resolved Hide resolved
python/ray/serve/tests/test_backend_state.py Outdated Show resolved Hide resolved
python/ray/serve/backend_state.py Show resolved Hide resolved
python/ray/serve/backend_state.py Show resolved Hide resolved
Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the core logic and I had a couple of concerns, but I might be misunderstanding or misremembering something. I think it's close though! Overall more comments would be helpful.

python/ray/serve/backend_state.py Show resolved Hide resolved
python/ray/serve/backend_state.py Outdated Show resolved Hide resolved
python/ray/serve/backend_state.py Show resolved Hide resolved
python/ray/serve/backend_state.py Show resolved Hide resolved
python/ray/serve/backend_state.py Outdated Show resolved Hide resolved
python/ray/serve/backend_state.py Show resolved Hide resolved
@edoakes
Copy link
Contributor Author

edoakes commented Mar 24, 2021

@simon-mo @architkulkarni I updated the tests to be easier to read and added two new test cases. I also addressed the issue @architkulkarni pointed out about target_replicas being below the current pending_replicas & added some more comments explaining the logic.

Please take another look and let me know what's still confusing / if you see any issues in the logic.

python/ray/serve/backend_state.py Show resolved Hide resolved
ReplicaState.SHOULD_START, ReplicaState.STARTING,
ReplicaState.RUNNING
],
max_replicas=max_to_stop)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this version, seems like we are not counting number of new version replicas that are starting? namely if the following is > 0, we are breaking the invariant the max number of replicas in transition.

replicas.count(
            version=target_version, states=[ReplicaState.STARTING, ReplicaState.SHOULD_START])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this part might be okay--the expression you've written is contributes to pending_replicas.

Concretely, if target_replicas = current_replicas = 10 replicas, and it consists of 8 old RUNNING and 2 new STARTING, then we have

pending_replicas = 10 - 8 - 0 = 2

so max_to_stop is 2-2 = 0 and nothing gets stopped. (Until the new_version replicas move from STARTING to RUNNING, as desired.)

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic and new tests look good! The comments are helpful. I think the core logic would still be hard to understand for someone new to the code but I can't really think of a way to make it much clearer. One idea might be to move some of the algebra into the algorithm itself--specifically, actually pop all the replicas we can stop for "free" (namely, old SHOULD START and old STARTING) at the beginning, so we never have to account for them in the rest of the function. In any case, that can be done in a future PR (if at all), and the tests you wrote will make it much easier.

@edoakes
Copy link
Contributor Author

edoakes commented Mar 25, 2021

@architkulkarni I totally agree, I'm not very happy with the readability/understandability, but I couldn't find a better way to express it. I tried splitting out the "rolling update" and "regular scaling" logic into separate codepaths, but it ended up being both hard to make correct and added some redundancy. Hopefully we can find a way to clean this up in the near future.

@edoakes edoakes merged commit 63594c5 into ray-project:master Mar 25, 2021
edoakes added a commit to edoakes/ray that referenced this pull request Mar 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incremental rollout for re-deployments
3 participants