Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] collect request metrics at handles #42578

Merged
merged 1 commit into from Feb 1, 2024

Conversation

zcin
Copy link
Contributor

@zcin zcin commented Jan 22, 2024

[serve] collect request metrics at handles

Collect request metrics for autoscaling at handles instead of replicas. This will allow queued metrics to be taken into account instead of just ongoing requests.

(3x) DeploymentHandle streaming throughput (ASYNC) (num_replicas=1, tokens_per_request=1000, batch_size=10)

Collect on handles Original
12140.58 +- 431.94 tokens/s 12365.42 +- 353.48 tokens/s
12119.03 +- 255.58 tokens/s 12395.43 +- 642.92 tokens/s
12168.44 +- 653.06 tokens/s 12365.76 +- 680.51 tokens/s

(5x) HTTP streaming throughput (num_replicas=1, tokens_per_request=1000, batch_size=10, use_intermediate_deployment=False)

Collect on handles Original
141118.93 +- 103940.74 tokens/s 143454.65 +- 103615.47 tokens/s
225063.66 +- 5347.35 tokens/s 228244.37 +- 6209.89 tokens/s
225684.1 +- 3262.97 tokens/s 221354.73 +- 2928.82 tokens/s
220755.65 +- 6837.1 tokens/s 188224.32 +- 78546.09 tokens/s
221404.26 +- 3427.73 tokens/s 223172.25 +- 4064.79 tokens/s

(4x) DeploymentHandle throughput (num_replicas=1, batch_size=100)

Collect on handles Original
1766.19 +- 15.25 requests/s 1819.54 +- 5.26 requests/s
1760.92 +- 51.87 requests/s 1762.08 +- 21.04 requests/s
1796.22 +- 10.52 requests/s 1750.08 +- 31.73 requests/s
1788.1 +- 29.98 requests/s 1779.63 +- 24.86 requests/s

Signed-off-by: Cindy Zhang cindyzyx9@gmail.com
Streaming HTTP throughput

@zcin zcin changed the base branch from master to autoscaling-metrics-handle-2 January 22, 2024 21:17
@zcin zcin force-pushed the autoscaling-metrics-handle-3 branch 3 times, most recently from b09ae1d to a35cc01 Compare January 24, 2024 22:20
@zcin zcin changed the base branch from autoscaling-metrics-handle-2 to master January 24, 2024 22:20
@zcin zcin force-pushed the autoscaling-metrics-handle-3 branch 19 times, most recently from 6fd962c to 80c366e Compare January 26, 2024 19:28
@zcin zcin marked this pull request as ready for review January 29, 2024 17:31
@zcin zcin requested a review from edoakes January 29, 2024 17:31
@zcin zcin force-pushed the autoscaling-metrics-handle-3 branch 2 times, most recently from 891293b to fe4b473 Compare January 29, 2024 19:02
python/ray/serve/tests/test_autoscaling_policy.py Outdated Show resolved Hide resolved
python/ray/serve/tests/BUILD Outdated Show resolved Hide resolved
.buildkite/serve.rayci.yml Outdated Show resolved Hide resolved
python/ray/serve/tests/BUILD Outdated Show resolved Hide resolved
@zcin zcin force-pushed the autoscaling-metrics-handle-3 branch 4 times, most recently from 6866c14 to 71a5aa4 Compare January 29, 2024 21:58
python/ray/serve/_private/deployment_state.py Outdated Show resolved Hide resolved
Comment on lines 1024 to 1025
# 1. Only requests sent to RUNNING replicas should be considered
# for the autoscaling algorithm.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this pruning happen when we send the metrics instead of only when the set is updated? Or is it happening in both places?

Copy link
Contributor Author

@zcin zcin Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it can. The reason I didn't add it to the controller side is because whenever the list of running replicas change on the controller, handles should be updated as well. By pruning on the handle side, we don't have to send over a map of (replica_id -> request count) to the controller for every handle (and each map will have to be stored in the controller as well). I did think theoretically handles will be slightly more delayed and we can run into out of sync issues, just wasn't sure how much that would practically impact autoscaling metrics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very hard to reason about how this will behave due to ordering issues.

For example, we could have a query that's mid-scheduling, then update_running_replicas happens, then we finish scheduling. In this case, we'll end up with very strange behavior -- the replica will be reported as having 1 request ongoing (because in the codepath where we increment the defaultdict will return 0 and we'll increment to 1) until a new update_running_replicas occurs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I've added the filter-by-running-replicas logic to the controller. Each handle now sends a map of {replica_id -> num requests} to the controller, which is stored and used to filter by running replicas during autoscaling. The handle side still prunes its in-memory map so that the memory won't indefinitely grow over time when replicas die and new replica keep getting added.

@zcin zcin force-pushed the autoscaling-metrics-handle-3 branch 2 times, most recently from c64a22c to 13fbde4 Compare January 30, 2024 00:53
@zcin zcin self-assigned this Jan 30, 2024
@zcin zcin force-pushed the autoscaling-metrics-handle-3 branch 5 times, most recently from f44e17b to 3d3e730 Compare January 30, 2024 19:59
else:
return (self.deployment_id, self.handle_id), self.num_queued_queries

def process_finished_request(self, replica_tag, *args):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def process_finished_request(self, replica_tag, *args):
def process_finished_request(self, replica_tag: str, *args):

what are the mysterious *args?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the return value of the request sent to a replica. Since we don't care about the result, I put it as generic *args

self.autoscaling_config = None
# Track queries sent to replicas for the autoscaling algorithm.
self.num_queries_sent_to_replicas = defaultdict(int)
self._queries_lock = threading.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This very much needs a comment especially given the mixture of threading & asyncio model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment!

Comment on lines 1024 to 1025
# 1. Only requests sent to RUNNING replicas should be considered
# for the autoscaling algorithm.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very hard to reason about how this will behave due to ordering issues.

For example, we could have a query that's mid-scheduling, then update_running_replicas happens, then we finish scheduling. In this case, we'll end up with very strange behavior -- the replica will be reported as having 1 request ongoing (because in the codepath where we increment the defaultdict will return 0 and we'll increment to 1) until a new update_running_replicas occurs.

@zcin zcin force-pushed the autoscaling-metrics-handle-3 branch 2 times, most recently from 8bf7fac to aa03e3b Compare January 31, 2024 17:27
@zcin zcin requested a review from edoakes January 31, 2024 18:55
@@ -260,7 +260,7 @@ def record_autoscaling_metrics(self, data: Dict[str, float], send_timestamp: flo
)
self.deployment_state_manager.record_autoscaling_metrics(data, send_timestamp)

def record_handle_metrics(self, data: Dict[str, float], send_timestamp: float):
def record_handle_metrics(self, data, send_timestamp: float):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please replace the type hint

Copy link
Contributor Author

@zcin zcin Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm passing a big mix of ids and request metrics here, the type hints will be very messy. I will add a follow-up PR that modifies the metrics pusher to use kwargs instead of args for the callback function, so we can avoid shoving everything into this data parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok sounds good

@zcin zcin force-pushed the autoscaling-metrics-handle-3 branch 5 times, most recently from 859e000 to 5625e47 Compare January 31, 2024 20:55
Collect request metrics for autoscaling at handles instead of replicas. This will allow queued metrics to be taken into account instead of just ongoing requests.

(3x) DeploymentHandle streaming throughput (ASYNC) (num_replicas=1, tokens_per_request=1000, batch_size=10)
| Collect on handles | Original |
| --- | --- |
| 12140.58 +- 431.94 tokens/s | 12365.42 +- 353.48 tokens/s |
| 12119.03 +- 255.58 tokens/s | 12395.43 +- 642.92 tokens/s |
| 12168.44 +- 653.06 tokens/s | 12365.76 +- 680.51 tokens/s |

(5x) HTTP streaming throughput (num_replicas=1, tokens_per_request=1000, batch_size=10, use_intermediate_deployment=False)
| Collect on handles | Original |
| --- | --- |
| 141118.93 +- 103940.74 tokens/s | 143454.65 +- 103615.47 tokens/s |
| 225063.66 +- 5347.35 tokens/s | 228244.37 +- 6209.89 tokens/s |
| 225684.1 +- 3262.97 tokens/s | 221354.73 +- 2928.82 tokens/s |
| 220755.65 +- 6837.1 tokens/s | 188224.32 +- 78546.09 tokens/s |
| 221404.26 +- 3427.73 tokens/s | 223172.25 +- 4064.79 tokens/s |

(4x) DeploymentHandle throughput (num_replicas=1, batch_size=100)
| Collect on handles | Original |
| --- | --- |
| 1766.19 +- 15.25 requests/s | 1819.54 +- 5.26 requests/s |
| 1760.92 +- 51.87 requests/s | 1762.08 +- 21.04 requests/s |
| 1796.22 +- 10.52 requests/s | 1750.08 +- 31.73 requests/s |
| 1788.1 +- 29.98 requests/s | 1779.63 +- 24.86 requests/s |


Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Streaming HTTP throughput
@zcin zcin force-pushed the autoscaling-metrics-handle-3 branch from 5625e47 to baa5f51 Compare January 31, 2024 22:30
@zcin
Copy link
Contributor Author

zcin commented Feb 1, 2024

@edoakes conflicts fixed and tests are passing!

@edoakes edoakes merged commit 23d9b93 into ray-project:master Feb 1, 2024
10 checks passed
tterrysun pushed a commit to tterrysun/ray that referenced this pull request Feb 14, 2024
Collect request metrics for autoscaling at handles instead of replicas. This will allow queued metrics to be taken into account instead of just ongoing requests.

(3x) DeploymentHandle streaming throughput (ASYNC) (num_replicas=1, tokens_per_request=1000, batch_size=10)
| Collect on handles | Original |
| --- | --- |
| 12140.58 +- 431.94 tokens/s | 12365.42 +- 353.48 tokens/s |
| 12119.03 +- 255.58 tokens/s | 12395.43 +- 642.92 tokens/s |
| 12168.44 +- 653.06 tokens/s | 12365.76 +- 680.51 tokens/s |

(5x) HTTP streaming throughput (num_replicas=1, tokens_per_request=1000, batch_size=10, use_intermediate_deployment=False)
| Collect on handles | Original |
| --- | --- |
| 141118.93 +- 103940.74 tokens/s | 143454.65 +- 103615.47 tokens/s |
| 225063.66 +- 5347.35 tokens/s | 228244.37 +- 6209.89 tokens/s |
| 225684.1 +- 3262.97 tokens/s | 221354.73 +- 2928.82 tokens/s |
| 220755.65 +- 6837.1 tokens/s | 188224.32 +- 78546.09 tokens/s |
| 221404.26 +- 3427.73 tokens/s | 223172.25 +- 4064.79 tokens/s |

(4x) DeploymentHandle throughput (num_replicas=1, batch_size=100)
| Collect on handles | Original |
| --- | --- |
| 1766.19 +- 15.25 requests/s | 1819.54 +- 5.26 requests/s |
| 1760.92 +- 51.87 requests/s | 1762.08 +- 21.04 requests/s |
| 1796.22 +- 10.52 requests/s | 1750.08 +- 31.73 requests/s |
| 1788.1 +- 29.98 requests/s | 1779.63 +- 24.86 requests/s |


Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Streaming HTTP throughput
Signed-off-by: tterrysun <terry@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants