Allow configuring Serve control loop interval, add related docs #45063

JoshKarpel · 2024-04-30T20:45:07Z

Why are these changes needed?

In our experiments, adjusting this value upward helps the Serve Controller keep up with a large number of autoscaling metrics pushes from a large number of DeploymentHandles (because the loop body is blocking, so increasing the interval lets more other code when the control loop isn't running), at the cost of control loop responsiveness (since it doesn't run as often).

Related issue number

Closes #44784 ... for now!

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

Signed-off-by: Josh Karpel <josh.karpel@gmail.com> Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

JoshKarpel · 2024-04-30T20:48:45Z

doc/source/serve/advanced-guides/performance.md

@@ -56,3 +56,13 @@ proper backpressure. You can increase the value in the deployment decorator; e.g
 By default, Serve lets client HTTP requests run to completion no matter how long they take. However, slow requests could bottleneck the replica processing, blocking other requests that are waiting. It's recommended that you set an end-to-end timeout, so slow requests can be terminated and retried.

 You can set an end-to-end timeout for HTTP requests by setting the `request_timeout_s` in the `http_options` field of the Serve config. HTTP Proxies will wait for that many seconds before terminating an HTTP request. This config is global to your Ray cluster, and it cannot be updated during runtime. Use [client-side retries](serve-best-practices-http-requests) to retry requests that time out due to transient failures.
+
+### Give the Serve Controller more time to process requests


Took the liberty of adding a section here in case others run into the same issue. Please feel free to reword as desired, not sure what level of detail you want here :)

JoshKarpel · 2024-04-30T20:49:27Z

python/ray/serve/_private/constants.py

-# How often to call the control loop on the controller.
-CONTROL_LOOP_PERIOD_S = 0.1
+# How long to sleep between control loop cycles on the controller.
+CONTROL_LOOP_INTERVAL_S = float(os.getenv("RAY_SERVE_CONTROL_LOOP_INTERVAL_S", 0.1))


I thought INTERVAL made more sense than PERIOD as the name, since it's the time between cycles, not a target for when the next cycle starts.

way better :)

JoshKarpel · 2024-04-30T20:50:31Z

python/ray/serve/autoscaling_policy.py

@@ -130,7 +130,7 @@ def replica_queue_length_autoscaling_policy(

        # Only actually scale the replicas if we've made this decision for
        # 'scale_up_consecutive_periods' in a row.
-        if decision_counter > int(config.upscale_delay_s / CONTROL_LOOP_PERIOD_S):
+        if decision_counter > int(config.upscale_delay_s / CONTROL_LOOP_INTERVAL_S):


Seems like the interval is used in a few other places to count control loop cycles - am I breaking some assumption by allowing it to be configurable to some larger value (e.g., does this still make sense if the loop interval is large)?

I don't believe so -- but @zcin should confirm

I don't think this breaks any assumptions, if upscale delay < control loop interval, then the intervals between cycles that the controller sleeps for already inherently "covers" the required delay, so this code still makes sense.

edoakes

LGTM pending @zcin chiming in on the autoscaling question

edoakes · 2024-04-30T21:06:00Z

python/ray/serve/autoscaling_policy.py

@@ -130,7 +130,7 @@ def replica_queue_length_autoscaling_policy(

        # Only actually scale the replicas if we've made this decision for
        # 'scale_up_consecutive_periods' in a row.
-        if decision_counter > int(config.upscale_delay_s / CONTROL_LOOP_PERIOD_S):
+        if decision_counter > int(config.upscale_delay_s / CONTROL_LOOP_INTERVAL_S):


I don't believe so -- but @zcin should confirm

edoakes · 2024-04-30T21:06:06Z

python/ray/serve/_private/constants.py

-# How often to call the control loop on the controller.
-CONTROL_LOOP_PERIOD_S = 0.1
+# How long to sleep between control loop cycles on the controller.
+CONTROL_LOOP_INTERVAL_S = float(os.getenv("RAY_SERVE_CONTROL_LOOP_INTERVAL_S", 0.1))


way better :)

JoshKarpel · 2024-04-30T23:47:02Z

Thanks for the quick reviews! Much appreciated!

…project#45063)   ## Why are these changes needed? In our experiments, adjusting this value upward helps the Serve Controller keep up with a large number of autoscaling metrics pushes from a large number of `DeploymentHandle`s (because the loop body is blocking, so increasing the interval lets more other code when the control loop isn't running), at the cost of control loop responsiveness (since it doesn't run as often). ## Related issue number  Closes ray-project#44784 ... for now! ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [x] This PR is not tested :( Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

JoshKarpel added 2 commits April 30, 2024 15:41

allow configuring CONTROL_LOOP_INTERVAL_S, add related docs

b6f261f

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

Merge branch 'master' into allow-configuring-serve-control-loop-interval

2c3c655

Signed-off-by: Josh Karpel <josh.karpel@gmail.com> Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

JoshKarpel commented Apr 30, 2024

View reviewed changes

JoshKarpel marked this pull request as ready for review April 30, 2024 20:50

JoshKarpel requested review from edoakes, shrekris-anyscale, zcin, GeneDer, akshay-anyscale and a team as code owners April 30, 2024 20:50

JoshKarpel mentioned this pull request Apr 30, 2024

[Serve] Improve scalability of Serve DeploymentHandles #44784

Closed

edoakes reviewed Apr 30, 2024

View reviewed changes

edoakes approved these changes Apr 30, 2024

View reviewed changes

edoakes merged commit 23d05bb into ray-project:master Apr 30, 2024
5 checks passed

JoshKarpel deleted the allow-configuring-serve-control-loop-interval branch April 30, 2024 22:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow configuring Serve control loop interval, add related docs #45063

Allow configuring Serve control loop interval, add related docs #45063

JoshKarpel commented Apr 30, 2024 •

edited

JoshKarpel Apr 30, 2024

JoshKarpel Apr 30, 2024

edoakes Apr 30, 2024

JoshKarpel Apr 30, 2024

edoakes Apr 30, 2024

zcin Apr 30, 2024

edoakes left a comment

edoakes Apr 30, 2024

edoakes Apr 30, 2024

JoshKarpel commented Apr 30, 2024

Allow configuring Serve control loop interval, add related docs #45063

Allow configuring Serve control loop interval, add related docs #45063

Conversation

JoshKarpel commented Apr 30, 2024 • edited

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edoakes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoshKarpel commented Apr 30, 2024

JoshKarpel commented Apr 30, 2024 •

edited