[serve] Add replica queue length caching to replica scheduler #42943

edoakes · 2024-02-02T16:35:12Z

Why are these changes needed?

Adds caching logic to avoid actively probing replicas for every request. This is integrated into the existing PowerOfTwoChoicesReplicaScheduler so it can reuse much of the same policy and mechanism (e.g., locality-aware and model multiplexing-aware candidate selection).

The benefits of this change are:

Enables strict enforcement of max_concurrent_queries.
Reduces proxy-side overhead for scheduling requests.
Reduces latency for scheduling requests (in the "happy path," there's no extra RTT).

The changes are as follows:

All calls to replicas are now streaming calls, and the first message returned is a system message. The replica uses this message to return its current queue length and reject requests if it's at capacity (max_concurrent_queries). If the replica rejects, the request scheduling procedure will be retried.
The replica scheduler maintains a local cache of replica queue lengths. Entries in this cache have a timeout (currently set to 10 seconds). The cache is updated by (1) actively probing replicas and (2) the system response messages mentioned above.
When scheduling a request, we first attempt to choose the best replica based on the queue lengths in the cache. If none of the candidates have entries in the cache that are below max_concurrent_queries, we fall back to active probing (as before this PR).

There are two feature flags introduced to control this behavior (both currently off by default):

RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE
RAY_SERVE_ENABLE_STRICT_MAX_CONCURRENT_QUERIES (implicitly set by the above)

TODOs before merging:

Get all existing test cases to pass.
Add tests for ReplicaQueueLengthCache.
Add general test cases for the caching logic.
Run a subset of tests with the feature flag turned on.
Add a test case for the emplace_front logic (avoid tail latencies).
Add testing for the with_rejection logic.
Maybe add a separate feature flag for strict enforcement (decoupled from the caching logic).

TODOs for subsequent PRs (should file follow-up issues):

Add a replica-side timestamp to the queue length information to avoid overwriting the cache with stale information.
De-duplicate active probes to the same replica.

Related issue number

Closes #42946
Closes #42947

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

edoakes · 2024-02-02T19:12:11Z

Benchmark results

HTTP no-op latency

Baseline

(ray) eoakes@Edwards-MacBook-Pro-2 serve % RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=0 python ...
Latency (ms) for noop HTTP requests (num_replicas=1,num_requests=1000):
count    1000.000000
mean        3.838556
std         1.031378
min         3.371375
50%         3.737083
90%         4.060138
95%         4.188520
99%         5.597760
max        31.916375
dtype: float64

With caching

(ray) eoakes@Edwards-MacBook-Pro-2 serve % RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=1 python _private/benchmarks/http_noop_latency.py --num-requests 1000
...
Latency (ms) for noop HTTP requests (num_replicas=1,num_requests=1000):
count    1000.000000
mean        3.265671
std         0.964305
min         2.864250
50%         3.183500
90%         3.451258
95%         3.539712
99%         3.917807
max        30.811750
dtype: float64

HTTP throughput (using `ab`)

from ray import serve

@serve.deployment(num_replicas=8, max_concurrent_queries=10)
class A:
    def __call__(self, *args):
        return b"hi"

app = A.bind()

Baseline

RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=0 serve run noop:app
...
Requests per second:    757.95 [#/sec] (mean)
Time per request:       131.934 [ms] (mean)
Time per request:       1.319 [ms] (mean, across all concurrent requests)
Transfer rate:          141.38 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   0.8      0       4
Processing:    27  121  20.4    117     188
Waiting:       27  118  19.9    114     186
Total:         30  121  20.4    118     188
WARNING: The median and mean for the initial connection time are not within a normal deviation
        These results are probably not that reliable.

Percentage of the requests served within a certain time (ms)
  50%    118
  66%    123
  75%    128
  80%    131
  90%    152
  95%    159
  98%    175
  99%    180
 100%    188 (longest request)

With caching

RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=1 serve run noop:app
...
Requests per second:    990.00 [#/sec] (mean)
Time per request:       101.010 [ms] (mean)
Time per request:       1.010 [ms] (mean, across all concurrent requests)
Transfer rate:          184.66 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   0.7      1       3
Processing:    13   97  24.5     95     165
Waiting:       11   92  24.4     90     161
Total:         14   98  24.6     96     166

Percentage of the requests served within a certain time (ms)
  50%     96
  66%    102
  75%    106
  80%    114
  90%    140
  95%    143
  98%    155
  99%    159
 100%    166 (longest request)

HTTP streaming throughput

Baseline

(ray) eoakes@Edwards-MacBook-Pro-2 serve % RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=0 python _private/benchmarks/streaming/streaming_http_throughput.py
...
HTTP streaming throughput (num_replicas=1, tokens_per_request=1000, batch_size=10, use_intermediate_deployment=False): 228498.48 +- 6799.28 tokens/s

With caching

(ray) eoakes@Edwards-MacBook-Pro-2 serve % RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=1 python _private/benchmarks/streaming/streaming_http_throughput.py
...
HTTP streaming throughput (num_replicas=1, tokens_per_request=1000, batch_size=10, use_intermediate_deployment=False): 210692.81 +- 5081.87 tokens/s

Handle throughput

Baseline

(ray) eoakes@Edwards-MacBook-Pro-2 serve % RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=0 python _private/benchmarks/handle_throughput.py
...
DeploymentHandle throughput (num_replicas=1, batch_size=100): 1830.25 +- 5.44 requests/s

With caching

(ray) eoakes@Edwards-MacBook-Pro-2 serve % RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=1 python _private/benchmarks/handle_throughput.py
...
DeploymentHandle throughput (num_replicas=1, batch_size=100): 1840.15 +- 30.62 requests/s

Handle streaming throughput

Baseline

(ray) eoakes@Edwards-MacBook-Pro-2 serve % RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=0 python _private/benchmarks/streaming/streaming_handle_throughput.py
...
DeploymentHandle streaming throughput (ASYNC) (num_replicas=1, tokens_per_request=1000, batch_size=10): 11267.96 +- 172.1 tokens/s
(ServeReplica:default:CallerDeployment pid=48752) Individual request quantiles:
(ServeReplica:default:CallerDeployment pid=48752)       P50=676.8099579999998
(ServeReplica:default:CallerDeployment pid=48752)       P75=843.8243122500003
(ServeReplica:default:CallerDeployment pid=48752)       P99=915.2339992500001

With caching

(ray) eoakes@Edwards-MacBook-Pro-2 serve % RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=1 python _private/benchmarks/streaming/streaming_handle_throughput.py
...
DeploymentHandle streaming throughput (ASYNC) (num_replicas=1, tokens_per_request=1000, batch_size=10): 11671.64 +- 209.44 tokens/s
(ServeReplica:default:CallerDeployment pid=48848) Individual request quantiles:
(ServeReplica:default:CallerDeployment pid=48848)       P50=661.6651875000007
(ServeReplica:default:CallerDeployment pid=48848)       P75=806.0847917499996
(ServeReplica:default:CallerDeployment pid=48848)       P99=911.5834779999999

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

edoakes · 2024-02-02T20:26:49Z

@ray-project/ray-serve the changes in this PR are ready for an initial review, but please note the TODOs in the description (most notably, I still have a lot of tests to write).

…routing

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

GeneDer

The approach LGTM, thanks for making this more efficient Ed!

python/ray/serve/_private/replica_scheduler/common.py

python/ray/serve/_private/replica_scheduler/pow_2_scheduler.py

python/ray/serve/_private/proxy_state.py

shrekris-anyscale

Nice work so far. I haven't looked at the tests, but I left some comments on the implementation.

python/ray/serve/_private/proxy_state.py

python/ray/serve/_private/replica.py

python/ray/serve/_private/replica_scheduler/common.py

shrekris-anyscale · 2024-02-05T21:00:10Z

python/ray/serve/_private/replica_scheduler/pow_2_scheduler.py

-                    chosen_replica_id = t.replica_id
+                queue_len = t.result()
+                result.append((t.replica, queue_len))
+                self._replica_queue_len_cache.update(r.replica_id, queue_len)


[Nit] We calculate the timestamp for all the responses upon update rather than when we actually received the response. This means if there's one really slow replica, then all the replica's timestamps could actually be off by queue_len_response_deadline_s.

Since queue_len_response_deadline_s is pretty low, this shouldn't be a major concern, but if it becomes larger, then the timestamps may be pretty inaccurate.

yep you're right -- I have a follow-up item to generate all of these timestamps on the replica where possible

python/ray/serve/_private/replica_scheduler/pow_2_scheduler.py

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

edoakes · 2024-02-06T18:26:59Z

@GeneDer @shrekris-anyscale addressed comments and finished all of the TODOs from the description, PTAL.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

GeneDer

Some minor non-blocking comments, LGTM!

python/ray/serve/_private/replica_scheduler/common.py

GeneDer · 2024-02-07T00:26:51Z

python/ray/serve/_private/replica_scheduler/pow_2_scheduler.py

+                self._pending_requests_to_schedule.append(pending_request)
+            else:
+                index = 0
+                for pr in self._pending_requests_to_fulfill:


Non-blocker nitpick, we can probably just use enumerate and so we don't need to track and increment index separately

Also, want to mention, since now that we can assume the queue is sorted. We can utilize python's bisect.insort 😄

Oh nice I hadn't heard of bisect.insort. Looks interesting. I think in this case it might not be ideal because in the common case we should be inserting at or near the front of the queue when going through this slower is_retry path.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

shrekris-anyscale

I'm approving the code changes. I haven't had a chance to review the unit tests.

python/ray/serve/_private/constants.py

python/ray/serve/_private/replica_scheduler/common.py

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…oject#42943) Adds caching logic to avoid actively probing replicas for every request. This is integrated into the existing PowerOfTwoChoicesReplicaScheduler so it can reuse much of the same policy and mechanism (e.g., locality-aware and model multiplexing-aware candidate selection). The benefits of this change are: - Enables strict enforcement of max_concurrent_queries. - Reduces proxy-side overhead for scheduling requests. - Reduces latency for scheduling requests (in the "happy path," there's no extra RTT). The changes are as follows: - All calls to replicas are now streaming calls, and the first message returned is a system message. The replica uses this message to return its current queue length and reject requests if it's at capacity (max_concurrent_queries). If the replica rejects, the request scheduling procedure will be retried. - The replica scheduler maintains a local cache of replica queue lengths. Entries in this cache have a timeout (currently set to 10 seconds). The cache is updated by (1) actively probing replicas and (2) the system response messages mentioned above. - When scheduling a request, we first attempt to choose the best replica based on the queue lengths in the cache. If none of the candidates have entries in the cache that are below max_concurrent_queries, we fall back to active probing (as before this PR). There are two feature flags introduced to control this behavior (both currently off by default): - `RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE` - `RAY_SERVE_ENABLE_STRICT_MAX_CONCURRENT_QUERIES` (implicitly set by the above) --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Ratnopam Chakrabarti <ratnopamc@yahoo.com>

…oject#42943) Adds caching logic to avoid actively probing replicas for every request. This is integrated into the existing PowerOfTwoChoicesReplicaScheduler so it can reuse much of the same policy and mechanism (e.g., locality-aware and model multiplexing-aware candidate selection). The benefits of this change are: - Enables strict enforcement of max_concurrent_queries. - Reduces proxy-side overhead for scheduling requests. - Reduces latency for scheduling requests (in the "happy path," there's no extra RTT). The changes are as follows: - All calls to replicas are now streaming calls, and the first message returned is a system message. The replica uses this message to return its current queue length and reject requests if it's at capacity (max_concurrent_queries). If the replica rejects, the request scheduling procedure will be retried. - The replica scheduler maintains a local cache of replica queue lengths. Entries in this cache have a timeout (currently set to 10 seconds). The cache is updated by (1) actively probing replicas and (2) the system response messages mentioned above. - When scheduling a request, we first attempt to choose the best replica based on the queue lengths in the cache. If none of the candidates have entries in the cache that are below max_concurrent_queries, we fall back to active probing (as before this PR). There are two feature flags introduced to control this behavior (both currently off by default): - `RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE` - `RAY_SERVE_ENABLE_STRICT_MAX_CONCURRENT_QUERIES` (implicitly set by the above) --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: tterrysun <terry@anyscale.com>

edoakes added 8 commits February 2, 2024 10:34

WIP

c8c9997

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

WIP

0b74676

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

7b67d98

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

775c4de

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

a5ee0ee

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

f3fa3d5

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

add feature flag

65c7726

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

xxx

03c5294

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

edoakes added 2 commits February 2, 2024 13:44

wip

976a8a5

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

5a861f3

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

edoakes changed the title ~~[WIP][serve] Add queue length cache to replica scheduler~~ [serve] Add replica queue length caching to replica scheduler Feb 2, 2024

edoakes marked this pull request as ready for review February 2, 2024 20:26

edoakes requested a review from a team February 2, 2024 20:26

edoakes self-assigned this Feb 2, 2024

edoakes mentioned this pull request Feb 2, 2024

[Serve] performance bottlenecked by the ProxyActor #42565

Open

edoakes added 4 commits February 5, 2024 10:19

Merge branch 'master' of https://github.com/ray-project/ray into new-…

f538e05

…routing

fix

fd2a775

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

no windows due to path length

4ea03b5

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

WIP: try not putting proxy on head

0a5d2a5

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

GeneDer reviewed Feb 5, 2024

View reviewed changes

shrekris-anyscale reviewed Feb 5, 2024

View reviewed changes

edoakes added 7 commits February 6, 2024 10:25

fix

036fd11

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

31a05b7

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

247a10b

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

1e0ea74

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Add unit test for actor wrapper

f5f3b45

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

more tests

c8bb6fc

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

more tests

86b94fc

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

edoakes added 6 commits February 6, 2024 11:44

shreyas comments

74e97ca

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

179a5ae

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

552c84e

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

621cbc4

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

fa44981

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

00eb380

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

edoakes added 9 commits February 6, 2024 12:28

fix

61f75a0

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

0b791b0

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

360455f

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

52cac8b

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

88216bf

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

77cbd5d

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

e09936a

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

1bb45e5

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

2310093

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

GeneDer approved these changes Feb 7, 2024

View reviewed changes

fix

0510e32

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

shrekris-anyscale approved these changes Feb 7, 2024

View reviewed changes

python/ray/serve/_private/constants.py Outdated Show resolved Hide resolved

python/ray/serve/_private/replica_scheduler/common.py Outdated Show resolved Hide resolved

edoakes added 2 commits February 7, 2024 08:36

fix

69e8ccb

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

882a7d5

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

edoakes merged commit d8b0fe9 into ray-project:master Feb 7, 2024
9 checks passed

edoakes mentioned this pull request Mar 7, 2024

[Serve] Bring back optional RoundRobinReplicaScheduler #43549

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[serve] Add replica queue length caching to replica scheduler #42943

[serve] Add replica queue length caching to replica scheduler #42943

edoakes commented Feb 2, 2024 •

edited

Loading

edoakes commented Feb 2, 2024 •

edited

Loading

edoakes commented Feb 2, 2024

GeneDer left a comment

shrekris-anyscale left a comment

shrekris-anyscale Feb 5, 2024

edoakes Feb 5, 2024

edoakes commented Feb 6, 2024

GeneDer left a comment

GeneDer Feb 7, 2024

GeneDer Feb 7, 2024

edoakes Feb 7, 2024

shrekris-anyscale left a comment

[serve] Add replica queue length caching to replica scheduler #42943

[serve] Add replica queue length caching to replica scheduler #42943

Conversation

edoakes commented Feb 2, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

edoakes commented Feb 2, 2024 • edited Loading

Benchmark results

HTTP no-op latency

Baseline

With caching

HTTP throughput (using ab)

Baseline

With caching

HTTP streaming throughput

Baseline

With caching

Handle throughput

Baseline

With caching

Handle streaming throughput

Baseline

With caching

edoakes commented Feb 2, 2024

GeneDer left a comment

Choose a reason for hiding this comment

shrekris-anyscale left a comment

Choose a reason for hiding this comment

shrekris-anyscale Feb 5, 2024

Choose a reason for hiding this comment

edoakes Feb 5, 2024

Choose a reason for hiding this comment

edoakes commented Feb 6, 2024

GeneDer left a comment

Choose a reason for hiding this comment

GeneDer Feb 7, 2024

Choose a reason for hiding this comment

GeneDer Feb 7, 2024

Choose a reason for hiding this comment

edoakes Feb 7, 2024

Choose a reason for hiding this comment

shrekris-anyscale left a comment

Choose a reason for hiding this comment

edoakes commented Feb 2, 2024 •

edited

Loading

edoakes commented Feb 2, 2024 •

edited

Loading

HTTP throughput (using `ab`)