[serve] collect request metrics at handles #42578

zcin · 2024-01-22T21:16:39Z

[serve] collect request metrics at handles

Collect request metrics for autoscaling at handles instead of replicas. This will allow queued metrics to be taken into account instead of just ongoing requests.

(3x) DeploymentHandle streaming throughput (ASYNC) (num_replicas=1, tokens_per_request=1000, batch_size=10)

Collect on handles	Original
12140.58 +- 431.94 tokens/s	12365.42 +- 353.48 tokens/s
12119.03 +- 255.58 tokens/s	12395.43 +- 642.92 tokens/s
12168.44 +- 653.06 tokens/s	12365.76 +- 680.51 tokens/s

(5x) HTTP streaming throughput (num_replicas=1, tokens_per_request=1000, batch_size=10, use_intermediate_deployment=False)

Collect on handles	Original
141118.93 +- 103940.74 tokens/s	143454.65 +- 103615.47 tokens/s
225063.66 +- 5347.35 tokens/s	228244.37 +- 6209.89 tokens/s
225684.1 +- 3262.97 tokens/s	221354.73 +- 2928.82 tokens/s
220755.65 +- 6837.1 tokens/s	188224.32 +- 78546.09 tokens/s
221404.26 +- 3427.73 tokens/s	223172.25 +- 4064.79 tokens/s

(4x) DeploymentHandle throughput (num_replicas=1, batch_size=100)

Collect on handles	Original
1766.19 +- 15.25 requests/s	1819.54 +- 5.26 requests/s
1760.92 +- 51.87 requests/s	1762.08 +- 21.04 requests/s
1796.22 +- 10.52 requests/s	1750.08 +- 31.73 requests/s
1788.1 +- 29.98 requests/s	1779.63 +- 24.86 requests/s

Signed-off-by: Cindy Zhang cindyzyx9@gmail.com
Streaming HTTP throughput

python/ray/serve/tests/test_autoscaling_policy.py

python/ray/serve/tests/BUILD

.buildkite/serve.rayci.yml

python/ray/serve/tests/BUILD

python/ray/serve/_private/deployment_state.py

edoakes · 2024-01-29T22:35:47Z

python/ray/serve/_private/router.py

+        # 1. Only requests sent to RUNNING replicas should be considered
+        #    for the autoscaling algorithm.


Shouldn't this pruning happen when we send the metrics instead of only when the set is updated? Or is it happening in both places?

Yeah, it can. The reason I didn't add it to the controller side is because whenever the list of running replicas change on the controller, handles should be updated as well. By pruning on the handle side, we don't have to send over a map of (replica_id -> request count) to the controller for every handle (and each map will have to be stored in the controller as well). I did think theoretically handles will be slightly more delayed and we can run into out of sync issues, just wasn't sure how much that would practically impact autoscaling metrics.

It's very hard to reason about how this will behave due to ordering issues.

For example, we could have a query that's mid-scheduling, then update_running_replicas happens, then we finish scheduling. In this case, we'll end up with very strange behavior -- the replica will be reported as having 1 request ongoing (because in the codepath where we increment the defaultdict will return 0 and we'll increment to 1) until a new update_running_replicas occurs.

I see. I've added the filter-by-running-replicas logic to the controller. Each handle now sends a map of {replica_id -> num requests} to the controller, which is stored and used to filter by running replicas during autoscaling. The handle side still prunes its in-memory map so that the memory won't indefinitely grow over time when replicas die and new replica keep getting added.

edoakes · 2024-01-30T20:31:57Z

python/ray/serve/_private/router.py

+        else:
+            return (self.deployment_id, self.handle_id), self.num_queued_queries
+
+    def process_finished_request(self, replica_tag, *args):


Suggested change

def process_finished_request(self, replica_tag, *args):

def process_finished_request(self, replica_tag: str, *args):

what are the mysterious *args?

That's the return value of the request sent to a replica. Since we don't care about the result, I put it as generic *args

edoakes · 2024-01-30T20:32:25Z

python/ray/serve/_private/router.py

        self.autoscaling_config = None
+        # Track queries sent to replicas for the autoscaling algorithm.
+        self.num_queries_sent_to_replicas = defaultdict(int)
+        self._queries_lock = threading.Lock()


This very much needs a comment especially given the mixture of threading & asyncio model.

Added a comment!

edoakes · 2024-01-30T20:35:29Z

python/ray/serve/_private/router.py

+        # 1. Only requests sent to RUNNING replicas should be considered
+        #    for the autoscaling algorithm.


It's very hard to reason about how this will behave due to ordering issues.

For example, we could have a query that's mid-scheduling, then update_running_replicas happens, then we finish scheduling. In this case, we'll end up with very strange behavior -- the replica will be reported as having 1 request ongoing (because in the codepath where we increment the defaultdict will return 0 and we'll increment to 1) until a new update_running_replicas occurs.

edoakes · 2024-01-31T19:07:41Z

python/ray/serve/_private/controller.py

@@ -260,7 +260,7 @@ def record_autoscaling_metrics(self, data: Dict[str, float], send_timestamp: flo
        )
        self.deployment_state_manager.record_autoscaling_metrics(data, send_timestamp)

-    def record_handle_metrics(self, data: Dict[str, float], send_timestamp: float):
+    def record_handle_metrics(self, data, send_timestamp: float):


please replace the type hint

I'm passing a big mix of ids and request metrics here, the type hints will be very messy. I will add a follow-up PR that modifies the metrics pusher to use kwargs instead of args for the callback function, so we can avoid shoving everything into this data parameter.

ok sounds good

Collect request metrics for autoscaling at handles instead of replicas. This will allow queued metrics to be taken into account instead of just ongoing requests. (3x) DeploymentHandle streaming throughput (ASYNC) (num_replicas=1, tokens_per_request=1000, batch_size=10) | Collect on handles | Original | | --- | --- | | 12140.58 +- 431.94 tokens/s | 12365.42 +- 353.48 tokens/s | | 12119.03 +- 255.58 tokens/s | 12395.43 +- 642.92 tokens/s | | 12168.44 +- 653.06 tokens/s | 12365.76 +- 680.51 tokens/s | (5x) HTTP streaming throughput (num_replicas=1, tokens_per_request=1000, batch_size=10, use_intermediate_deployment=False) | Collect on handles | Original | | --- | --- | | 141118.93 +- 103940.74 tokens/s | 143454.65 +- 103615.47 tokens/s | | 225063.66 +- 5347.35 tokens/s | 228244.37 +- 6209.89 tokens/s | | 225684.1 +- 3262.97 tokens/s | 221354.73 +- 2928.82 tokens/s | | 220755.65 +- 6837.1 tokens/s | 188224.32 +- 78546.09 tokens/s | | 221404.26 +- 3427.73 tokens/s | 223172.25 +- 4064.79 tokens/s | (4x) DeploymentHandle throughput (num_replicas=1, batch_size=100) | Collect on handles | Original | | --- | --- | | 1766.19 +- 15.25 requests/s | 1819.54 +- 5.26 requests/s | | 1760.92 +- 51.87 requests/s | 1762.08 +- 21.04 requests/s | | 1796.22 +- 10.52 requests/s | 1750.08 +- 31.73 requests/s | | 1788.1 +- 29.98 requests/s | 1779.63 +- 24.86 requests/s | Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com> Streaming HTTP throughput

zcin · 2024-02-01T00:00:02Z

@edoakes conflicts fixed and tests are passing!

Collect request metrics for autoscaling at handles instead of replicas. This will allow queued metrics to be taken into account instead of just ongoing requests. (3x) DeploymentHandle streaming throughput (ASYNC) (num_replicas=1, tokens_per_request=1000, batch_size=10) | Collect on handles | Original | | --- | --- | | 12140.58 +- 431.94 tokens/s | 12365.42 +- 353.48 tokens/s | | 12119.03 +- 255.58 tokens/s | 12395.43 +- 642.92 tokens/s | | 12168.44 +- 653.06 tokens/s | 12365.76 +- 680.51 tokens/s | (5x) HTTP streaming throughput (num_replicas=1, tokens_per_request=1000, batch_size=10, use_intermediate_deployment=False) | Collect on handles | Original | | --- | --- | | 141118.93 +- 103940.74 tokens/s | 143454.65 +- 103615.47 tokens/s | | 225063.66 +- 5347.35 tokens/s | 228244.37 +- 6209.89 tokens/s | | 225684.1 +- 3262.97 tokens/s | 221354.73 +- 2928.82 tokens/s | | 220755.65 +- 6837.1 tokens/s | 188224.32 +- 78546.09 tokens/s | | 221404.26 +- 3427.73 tokens/s | 223172.25 +- 4064.79 tokens/s | (4x) DeploymentHandle throughput (num_replicas=1, batch_size=100) | Collect on handles | Original | | --- | --- | | 1766.19 +- 15.25 requests/s | 1819.54 +- 5.26 requests/s | | 1760.92 +- 51.87 requests/s | 1762.08 +- 21.04 requests/s | | 1796.22 +- 10.52 requests/s | 1750.08 +- 31.73 requests/s | | 1788.1 +- 29.98 requests/s | 1779.63 +- 24.86 requests/s | Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com> Streaming HTTP throughput Signed-off-by: tterrysun <terry@anyscale.com>

zcin changed the base branch from master to autoscaling-metrics-handle-2 January 22, 2024 21:17

zcin force-pushed the autoscaling-metrics-handle-3 branch 3 times, most recently from b09ae1d to a35cc01 Compare January 24, 2024 22:20

zcin changed the base branch from autoscaling-metrics-handle-2 to master January 24, 2024 22:20

This was referenced Jan 24, 2024

[serve] metrics pusher improvements and refactor #42510

Merged

[serve] properly broadcast autoscaling config when updated #42661

Merged

zcin force-pushed the autoscaling-metrics-handle-3 branch 19 times, most recently from 6fd962c to 80c366e Compare January 26, 2024 19:28

zcin marked this pull request as ready for review January 29, 2024 17:31

zcin requested a review from edoakes January 29, 2024 17:31

zcin force-pushed the autoscaling-metrics-handle-3 branch 2 times, most recently from 891293b to fe4b473 Compare January 29, 2024 19:02

edoakes reviewed Jan 29, 2024

View reviewed changes

python/ray/serve/tests/test_autoscaling_policy.py Outdated Show resolved Hide resolved

python/ray/serve/tests/BUILD Outdated Show resolved Hide resolved

.buildkite/serve.rayci.yml Outdated Show resolved Hide resolved

python/ray/serve/tests/BUILD Outdated Show resolved Hide resolved

zcin force-pushed the autoscaling-metrics-handle-3 branch 4 times, most recently from 6866c14 to 71a5aa4 Compare January 29, 2024 21:58

edoakes reviewed Jan 29, 2024

View reviewed changes

zcin force-pushed the autoscaling-metrics-handle-3 branch 2 times, most recently from c64a22c to 13fbde4 Compare January 30, 2024 00:53

zcin self-assigned this Jan 30, 2024

zcin force-pushed the autoscaling-metrics-handle-3 branch 5 times, most recently from f44e17b to 3d3e730 Compare January 30, 2024 19:59

edoakes reviewed Jan 30, 2024

View reviewed changes

zcin force-pushed the autoscaling-metrics-handle-3 branch 2 times, most recently from 8bf7fac to aa03e3b Compare January 31, 2024 17:27

zcin requested a review from edoakes January 31, 2024 18:55

edoakes approved these changes Jan 31, 2024

View reviewed changes

zcin force-pushed the autoscaling-metrics-handle-3 branch 5 times, most recently from 859e000 to 5625e47 Compare January 31, 2024 20:55

zcin force-pushed the autoscaling-metrics-handle-3 branch from 5625e47 to baa5f51 Compare January 31, 2024 22:30

edoakes merged commit 23d9b93 into ray-project:master Feb 1, 2024
10 checks passed

edoakes mentioned this pull request Feb 2, 2024

[serve] Implement collecting request metrics on handles #42610

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[serve] collect request metrics at handles #42578

[serve] collect request metrics at handles #42578

zcin commented Jan 22, 2024 •

edited

edoakes Jan 29, 2024

zcin Jan 29, 2024 •

edited

edoakes Jan 30, 2024

zcin Jan 31, 2024

edoakes Jan 30, 2024

zcin Jan 31, 2024

edoakes Jan 30, 2024

zcin Jan 31, 2024

edoakes Jan 30, 2024

edoakes Jan 31, 2024

zcin Jan 31, 2024 •

edited

edoakes Jan 31, 2024

zcin commented Feb 1, 2024

		# 1. Only requests sent to RUNNING replicas should be considered
		# for the autoscaling algorithm.

	def process_finished_request(self, replica_tag, *args):
	def process_finished_request(self, replica_tag: str, *args):

[serve] collect request metrics at handles #42578

[serve] collect request metrics at handles #42578

Conversation

zcin commented Jan 22, 2024 • edited

Choose a reason for hiding this comment

zcin Jan 29, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zcin Jan 31, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zcin commented Feb 1, 2024

zcin commented Jan 22, 2024 •

edited

zcin Jan 29, 2024 •

edited

zcin Jan 31, 2024 •

edited