New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve] collect request metrics at handles #42578
Conversation
b09ae1d
to
a35cc01
Compare
6fd962c
to
80c366e
Compare
891293b
to
fe4b473
Compare
6866c14
to
71a5aa4
Compare
python/ray/serve/_private/router.py
Outdated
# 1. Only requests sent to RUNNING replicas should be considered | ||
# for the autoscaling algorithm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this pruning happen when we send the metrics instead of only when the set is updated? Or is it happening in both places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it can. The reason I didn't add it to the controller side is because whenever the list of running replicas change on the controller, handles should be updated as well. By pruning on the handle side, we don't have to send over a map of (replica_id -> request count) to the controller for every handle (and each map will have to be stored in the controller as well). I did think theoretically handles will be slightly more delayed and we can run into out of sync issues, just wasn't sure how much that would practically impact autoscaling metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very hard to reason about how this will behave due to ordering issues.
For example, we could have a query that's mid-scheduling, then update_running_replicas
happens, then we finish scheduling. In this case, we'll end up with very strange behavior -- the replica will be reported as having 1
request ongoing (because in the codepath where we increment the defaultdict
will return 0
and we'll increment to 1
) until a new update_running_replicas
occurs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I've added the filter-by-running-replicas logic to the controller. Each handle now sends a map of {replica_id -> num requests} to the controller, which is stored and used to filter by running replicas during autoscaling. The handle side still prunes its in-memory map so that the memory won't indefinitely grow over time when replicas die and new replica keep getting added.
c64a22c
to
13fbde4
Compare
f44e17b
to
3d3e730
Compare
else: | ||
return (self.deployment_id, self.handle_id), self.num_queued_queries | ||
|
||
def process_finished_request(self, replica_tag, *args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def process_finished_request(self, replica_tag, *args): | |
def process_finished_request(self, replica_tag: str, *args): |
what are the mysterious *args
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the return value of the request sent to a replica. Since we don't care about the result, I put it as generic *args
self.autoscaling_config = None | ||
# Track queries sent to replicas for the autoscaling algorithm. | ||
self.num_queries_sent_to_replicas = defaultdict(int) | ||
self._queries_lock = threading.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This very much needs a comment especially given the mixture of threading & asyncio model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment!
python/ray/serve/_private/router.py
Outdated
# 1. Only requests sent to RUNNING replicas should be considered | ||
# for the autoscaling algorithm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very hard to reason about how this will behave due to ordering issues.
For example, we could have a query that's mid-scheduling, then update_running_replicas
happens, then we finish scheduling. In this case, we'll end up with very strange behavior -- the replica will be reported as having 1
request ongoing (because in the codepath where we increment the defaultdict
will return 0
and we'll increment to 1
) until a new update_running_replicas
occurs.
8bf7fac
to
aa03e3b
Compare
@@ -260,7 +260,7 @@ def record_autoscaling_metrics(self, data: Dict[str, float], send_timestamp: flo | |||
) | |||
self.deployment_state_manager.record_autoscaling_metrics(data, send_timestamp) | |||
|
|||
def record_handle_metrics(self, data: Dict[str, float], send_timestamp: float): | |||
def record_handle_metrics(self, data, send_timestamp: float): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please replace the type hint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm passing a big mix of ids and request metrics here, the type hints will be very messy. I will add a follow-up PR that modifies the metrics pusher to use kwargs instead of args for the callback function, so we can avoid shoving everything into this data
parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok sounds good
859e000
to
5625e47
Compare
Collect request metrics for autoscaling at handles instead of replicas. This will allow queued metrics to be taken into account instead of just ongoing requests. (3x) DeploymentHandle streaming throughput (ASYNC) (num_replicas=1, tokens_per_request=1000, batch_size=10) | Collect on handles | Original | | --- | --- | | 12140.58 +- 431.94 tokens/s | 12365.42 +- 353.48 tokens/s | | 12119.03 +- 255.58 tokens/s | 12395.43 +- 642.92 tokens/s | | 12168.44 +- 653.06 tokens/s | 12365.76 +- 680.51 tokens/s | (5x) HTTP streaming throughput (num_replicas=1, tokens_per_request=1000, batch_size=10, use_intermediate_deployment=False) | Collect on handles | Original | | --- | --- | | 141118.93 +- 103940.74 tokens/s | 143454.65 +- 103615.47 tokens/s | | 225063.66 +- 5347.35 tokens/s | 228244.37 +- 6209.89 tokens/s | | 225684.1 +- 3262.97 tokens/s | 221354.73 +- 2928.82 tokens/s | | 220755.65 +- 6837.1 tokens/s | 188224.32 +- 78546.09 tokens/s | | 221404.26 +- 3427.73 tokens/s | 223172.25 +- 4064.79 tokens/s | (4x) DeploymentHandle throughput (num_replicas=1, batch_size=100) | Collect on handles | Original | | --- | --- | | 1766.19 +- 15.25 requests/s | 1819.54 +- 5.26 requests/s | | 1760.92 +- 51.87 requests/s | 1762.08 +- 21.04 requests/s | | 1796.22 +- 10.52 requests/s | 1750.08 +- 31.73 requests/s | | 1788.1 +- 29.98 requests/s | 1779.63 +- 24.86 requests/s | Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com> Streaming HTTP throughput
5625e47
to
baa5f51
Compare
@edoakes conflicts fixed and tests are passing! |
Collect request metrics for autoscaling at handles instead of replicas. This will allow queued metrics to be taken into account instead of just ongoing requests. (3x) DeploymentHandle streaming throughput (ASYNC) (num_replicas=1, tokens_per_request=1000, batch_size=10) | Collect on handles | Original | | --- | --- | | 12140.58 +- 431.94 tokens/s | 12365.42 +- 353.48 tokens/s | | 12119.03 +- 255.58 tokens/s | 12395.43 +- 642.92 tokens/s | | 12168.44 +- 653.06 tokens/s | 12365.76 +- 680.51 tokens/s | (5x) HTTP streaming throughput (num_replicas=1, tokens_per_request=1000, batch_size=10, use_intermediate_deployment=False) | Collect on handles | Original | | --- | --- | | 141118.93 +- 103940.74 tokens/s | 143454.65 +- 103615.47 tokens/s | | 225063.66 +- 5347.35 tokens/s | 228244.37 +- 6209.89 tokens/s | | 225684.1 +- 3262.97 tokens/s | 221354.73 +- 2928.82 tokens/s | | 220755.65 +- 6837.1 tokens/s | 188224.32 +- 78546.09 tokens/s | | 221404.26 +- 3427.73 tokens/s | 223172.25 +- 4064.79 tokens/s | (4x) DeploymentHandle throughput (num_replicas=1, batch_size=100) | Collect on handles | Original | | --- | --- | | 1766.19 +- 15.25 requests/s | 1819.54 +- 5.26 requests/s | | 1760.92 +- 51.87 requests/s | 1762.08 +- 21.04 requests/s | | 1796.22 +- 10.52 requests/s | 1750.08 +- 31.73 requests/s | | 1788.1 +- 29.98 requests/s | 1779.63 +- 24.86 requests/s | Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com> Streaming HTTP throughput Signed-off-by: tterrysun <terry@anyscale.com>
[serve] collect request metrics at handles
Collect request metrics for autoscaling at handles instead of replicas. This will allow queued metrics to be taken into account instead of just ongoing requests.
(3x) DeploymentHandle streaming throughput (ASYNC) (num_replicas=1, tokens_per_request=1000, batch_size=10)
(5x) HTTP streaming throughput (num_replicas=1, tokens_per_request=1000, batch_size=10, use_intermediate_deployment=False)
(4x) DeploymentHandle throughput (num_replicas=1, batch_size=100)
Signed-off-by: Cindy Zhang cindyzyx9@gmail.com
Streaming HTTP throughput