Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] Change stopping behavior #43187

Merged
merged 1 commit into from Feb 21, 2024
Merged

[serve] Change stopping behavior #43187

merged 1 commit into from Feb 21, 2024

Conversation

zcin
Copy link
Contributor

@zcin zcin commented Feb 15, 2024

[serve] Change stopping behavior

Change from stop-fully-then-start to stop-then-immediately-start.
More concretely, previously we would make sure that the total number of (running or moving towards running) replicas plus the number of stopping replicas never exceeds the target. This PR changes it so that the number of stopping replicas is not considered in this calculation. The reasoning for this is that replicas that take a long time to gracefully shut down should not prevent new replicas from starting, as that will intentionally introduce lower availability.

Rollout plan:

  • New behavior is enabled by default.
  • Old behavior can be switched back on through the feature flag RAY_SERVE_STOP_FULLY_THEN_START_REPLICAS.
  • Old behavior will be removed in the next release.

Signed-off-by: Cindy Zhang cindyzyx9@gmail.com


Stack created with Sapling. Best reviewed with ReviewStack.

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

python/ray/serve/_private/constants.py Outdated Show resolved Hide resolved
@@ -206,6 +206,7 @@ def __call__(self):
@pytest.mark.skipif(sys.platform == "win32", reason="Failing on Windows.")
@pytest.mark.parametrize("smoothing_factor", [1, 0.2])
@pytest.mark.parametrize("use_upscale_downscale_config", [True, False])
@mock.patch("ray.serve._private.router.HANDLE_METRIC_PUSH_INTERVAL_S", 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why's this needed?

Copy link
Contributor Author

@zcin zcin Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test (for RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=0) when the deployment scales back down to 0, the last metric report pushed from the handle is often a non-zero number (because the push interval is set to 10 seconds). So for the next 10 seconds the # replicas keeps oscillating between 0 and 1 because of the outdated metric from the handle. I'm not sure why we never ran into this before, but this seems like a totally reasonable scenario so I set the handle push interval lower to avoid making the test wait longer.

Comment on lines +169 to +172
# We wait for this to be satisfied at the end because there may be
# more than 3 worker nodes after the deployment finishes deploying,
# since replicas are being started and stopped at the same time, and
# there is a strict max replicas per node requirement. However nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a little bit of a nasty interaction. This may cause excessive resource fragmentation in some cases since we'll start a new node for the replacement replica, then the old replica will get removed.

I suppose the defragmentation work will address it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this is definitely a case where this behavior change will have negative side effects... and agreed, we should aim to address this with the defragmentation work.

@zcin zcin force-pushed the pr43187 branch 2 times, most recently from 5ae7a4b to 8aba50a Compare February 20, 2024 22:10
Change from stop-fully-then-start to stop-then-immediately-start.
More concretely, previously we would make sure that the total number of (running or moving towards running) replicas **plus** the number of stopping replicas never exceeds the target. This PR changes it so that the number of stopping replicas is not considered in this calculation. The reasoning for this is that replicas that take a long time to gracefully shut down should not prevent new replicas from starting, as that will intentionally introduce lower availability.

Rollout plan:
* New behavior is enabled by default.
* Old behavior can be switched back on through the feature flag RAY_SERVE_STOP_FULLY_THEN_START_REPLICAS.
* Old behavior will be removed in the next release.


Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
@zcin
Copy link
Contributor Author

zcin commented Feb 21, 2024

@edoakes Tests are passing!

@edoakes edoakes merged commit e2d9f42 into ray-project:master Feb 21, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants