New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[serve] Rolling updates for redeployments #14803

Merged

edoakes merged 30 commits into ray-project:master from edoakes:deploy-incremental-rollout

Mar 25, 2021

Contributor

edoakes commented Mar 19, 2021 •

edited

Loading

Why are these changes needed?

Performs a rolling update when re-deploying a backend. Currently, the batch size of the update is hardcoded to floor(20%) of the target number of replicas. We can make this configurable in the future.

This follows the same protocol as k8s:

Old replicas are turned down before new ones are started.
At most batch_size old replicas will be stopped and new replicas will be started at once.

Related issue number

Closes #14805

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

edoakes added 10 commits

March 18, 2021 18:27

wip

40c06d8


          passing unit tests

7d5c88d


          small fix

c2b5470


          unit test for container

666e018


          unit tests for container

61d6fc4


          Merge branch 'master' into replica-data-structure

1b3eb65


          docstring

82abaf4


          XXX -> NOTE

9b92499


          lint

ad3e3f2


          working manual test

e1e413c

edoakes force-pushed the deploy-incremental-rollout branch from ad9520f to e1e413c Compare

March 19, 2021 16:41

edoakes added this to the [serve] v2 API milestone

edoakes added 4 commits

March 19, 2021 14:08


          Merge branch 'master' into deploy-incremental-rollout

99e6642


          added test to test_deploy, passing

b4c0f23


          test for 2 replicas


          passing unit test

edoakes changed the title ~~[WIP][serve] Incremental rollout when re-deploying~~ [serve] Rolling updates for redeployments

edoakes assigned architkulkarni and simon-mo

edoakes added 3 commits

March 22, 2021 16:22


          Merge branch 'master' of https://github.com/ray-project/ray into depl…

68b5653

…oy-incremental-rollout

fix

a394f36


          increase timeout

e7568ea

Contributor

simon-mo commented Mar 23, 2021

It's going to take sometime to review this, ETA tuesday noon.

edoakes commented

View reviewed changes

python/ray/serve/backend_state.py Outdated

Comment on lines 638 to 652

+                      new_running_replicas = self._replicas[backend_tag].count(
+                          version=target_version, states=[ReplicaState.RUNNING])
+                      pending_replicas = (
+                          target_replicas - new_running_replicas - old_running_replicas)
+                      rollout_size = max(int(0.2 * target_replicas), 1)
+                      max_to_stop = max(rollout_size - pending_replicas, 0)
+                      replicas_to_stop = self._replicas[backend_tag].pop(
+                          exclude_version=target_version,
+                          states=[
+                              ReplicaState.SHOULD_START, ReplicaState.STARTING,
+                              ReplicaState.RUNNING
+                          ],
+                          max_replicas=max_to_stop)

Contributor Author

edoakes Mar 23, 2021

@simon-mo this the core logic to review FYI

edoakes added 2 commits

March 23, 2021 12:37


          Merge branch 'master' of https://github.com/ray-project/ray into depl…

a9a2e8a

…oy-incremental-rollout


          add missing_ok

99a4c4e

simon-mo reviewed

View reviewed changes

python/ray/serve/api.py Outdated Show resolved Hide resolved

python/ray/serve/tests/test_backend_state.py Outdated Show resolved Hide resolved

python/ray/serve/tests/test_backend_state.py Outdated Show resolved Hide resolved

python/ray/serve/tests/test_backend_state.py Outdated Show resolved Hide resolved

python/ray/serve/backend_state.py Show resolved Hide resolved

python/ray/serve/backend_state.py Show resolved Hide resolved

edoakes added 3 commits

March 23, 2021 15:09


          Merge branch 'master' into deploy-incremental-rollout

acaa94a


          Merge branch 'master' of https://github.com/ray-project/ray into depl…

2d30e95

…oy-incremental-rollout


          num_cpus=0 in call

c837159

architkulkarni reviewed

View reviewed changes

Contributor

architkulkarni left a comment

I looked at the core logic and I had a couple of concerns, but I might be misunderstanding or misremembering something. I think it's close though! Overall more comments would be helpful.

python/ray/serve/backend_state.py Show resolved Hide resolved

python/ray/serve/backend_state.py Outdated Show resolved Hide resolved

python/ray/serve/backend_state.py Show resolved Hide resolved

python/ray/serve/backend_state.py Show resolved Hide resolved

python/ray/serve/backend_state.py Outdated Show resolved Hide resolved

python/ray/serve/backend_state.py Show resolved Hide resolved

edoakes added 4 commits

March 24, 2021 09:45


          tmp change to debug test in ci

5d5e24f


          passing w/ cosmetic changes

bb2a81f


          smol changes

97e8d3a


          comments and fix ups

821f44a

Contributor Author

edoakes commented Mar 24, 2021

@simon-mo @architkulkarni I updated the tests to be easier to read and added two new test cases. I also addressed the issue @architkulkarni pointed out about target_replicas being below the current pending_replicas & added some more comments explaining the logic.

Please take another look and let me know what's still confusing / if you see any issues in the logic.

edoakes requested review from architkulkarni and simon-mo

March 24, 2021 18:46

edoakes added 2 commits

March 24, 2021 13:57


          integration tests

75fd455


          Merge branch 'master' of https://github.com/ray-project/ray into depl…

e785ada

…oy-incremental-rollout

simon-mo reviewed

View reviewed changes

python/ray/serve/backend_state.py Show resolved Hide resolved

python/ray/serve/backend_state.py

+                              ReplicaState.SHOULD_START, ReplicaState.STARTING,
+                              ReplicaState.RUNNING
+                          ],
+                          max_replicas=max_to_stop)

Contributor

simon-mo Mar 24, 2021

in this version, seems like we are not counting number of new version replicas that are starting? namely if the following is > 0, we are breaking the invariant the max number of replicas in transition.

replicas.count(
            version=target_version, states=[ReplicaState.STARTING, ReplicaState.SHOULD_START])

Contributor

architkulkarni Mar 25, 2021

I think this part might be okay--the expression you've written is contributes to pending_replicas.

Concretely, if target_replicas = current_replicas = 10 replicas, and it consists of 8 old RUNNING and 2 new STARTING, then we have

pending_replicas = 10 - 8 - 0 = 2

so max_to_stop is 2-2 = 0 and nothing gets stopped. (Until the new_version replicas move from STARTING to RUNNING, as desired.)

simon-mo reviewed

View reviewed changes

python/ray/serve/tests/test_backend_state.py Outdated Show resolved Hide resolved

architkulkarni approved these changes

View reviewed changes

Contributor

architkulkarni left a comment

Logic and new tests look good! The comments are helpful. I think the core logic would still be hard to understand for someone new to the code but I can't really think of a way to make it much clearer. One idea might be to move some of the algebra into the algorithm itself--specifically, actually pop all the replicas we can stop for "free" (namely, old SHOULD START and old STARTING) at the beginning, so we never have to account for them in the rest of the function. In any case, that can be done in a future PR (if at all), and the tests you wrote will make it much easier.

simon-mo approved these changes

View reviewed changes

edoakes added 2 commits

March 25, 2021 11:10


          Merge branch 'master' of https://github.com/ray-project/ray into depl…

546a261

…oy-incremental-rollout


          address last comments

8dd25e7

Contributor Author

edoakes commented Mar 25, 2021

@architkulkarni I totally agree, I'm not very happy with the readability/understandability, but I couldn't find a better way to express it. I tried splitting out the "rolling update" and "regular scaling" logic into separate codepaths, but it ended up being both hard to make correct and added some redundancy. Hopefully we can find a way to clean this up in the near future.

edoakes merged commit 63594c5 into ray-project:master

edoakes added a commit to edoakes/ray that referenced this pull request


          Revert "[serve] Rolling updates for redeployments (ray-project#14803)"

90dd8d2

This reverts commit 63594c5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet