[serve] prevent in memory metric store in handles from growing in memory #44877

zcin · 2024-04-19T22:02:35Z

[serve] prevent in memory metric store in handles from growing in memory

There are two potential sources of memory leak for the InMemoryMetricsStore in the handles that's used to record/report autoscaling metrics:

Old replica ID keys are never removed. We remove old replica keys from num_queries_sent_to_replicas when we get an updated list of running replicas from the long poll update, but we don't do any such cleaning for the in memory metrics store. This means there is leftover, uncleaned data for replicas that are no longer running.
We don't delete data points recorded from more than look_back_period_s ago for replicas except during window avg queries. This should mostly be solved once (1) is solved because this should only be a problem for replicas that are no longer running.

This PR addresses (1) and (2) by periodically

pruning keys that haven't had updated data points in the past look_back_period_s.
compacting datapoints that are more than look_back_period_s old

Main benchmarks picked from the full microbenchmark run posted below in the comments:

metric	master	current changes	% change
http_p50_latency	11.082594282925129	11.626139283180237	4.9044924534731305
http_1mb_p50_latency	11.81719359010458	12.776304967701435	8.116236484439
http_10mb_p50_latency	17.57313683629036	18.03796272724867	2.6450934473940757
http_avg_rps	204.2	195.04	-4.48579823702252
grpc_p50_latency	7.965719327330589	8.844093419611454	11.026927465
grpc_1mb_p50_latency	17.652496695518494	19.921275787055492	12.852454418603475
grpc_10mb_p50_latency	142.39510521292686	153.88561598956585	8.069456291673038
grpc_avg_rps	203.35	211.01	3.766904352102296
handle_p50_latency	4.890996962785721	4.082906059920788	-16.522007864929765
handle_1mb_p50_latency	11.582874692976475	10.905216448009014	-5.8505186573275525
handle_10mb_p50_latency	65.54202642291784	67.52330902963877	3.0229193615962657
handle_avg_rps	394.57	404.85	2.6053678688192194

There is no performance degradation in latencies or throughput. All benchmarks were run with autoscaling turned on (instead of num_replicas=1, I just set autoscaling_config={"min_replicas": 1, "max_replicas": 1})

Closes #44870.

Signed-off-by: Cindy Zhang cindyzyx9@gmail.com

edoakes · 2024-04-22T20:01:28Z

python/ray/serve/_private/metrics_utils.py

+    def prune_data(self, start_timestamp_s: float):
+        """Prune keys that haven't had new data recorded after start_timestamp_s."""
+        for key, datapoints in list(self.data.items()):
+            if len(datapoints) == 0 or datapoints[-1].timestamp < start_timestamp_s:
+                del self.data[key]


there's probably a better datastructure we can be using for this, e.g., use numpy arrays instead of lists numpy

but for now this is very unlikely to be a bottleneck

nit: name it "prune_keys" to make it clear that this does not prune datapoints that are outside of the window if a key is still active

edoakes · 2024-04-22T20:01:39Z

python/ray/serve/_private/router.py

-    def process_finished_request(self, replica_id: ReplicaID, *args):
+    def dec_num_running_requests_for_replica(self, replica_id: ReplicaID, *args):


edoakes

[non blocking] It's a little worrisome to me that we now have two separate GC paths, one to prune outdated datapoints and one to prune keys with no datapoints. Is it possible to combine them into the same path w/o any performance degradation?

zcin · 2024-04-22T22:45:22Z

Results from running python workloads/microbenchmarks.py on a linux devbox:

metric	master	current changes	% change
http_p50_latency	11.689163744449615	12.697989121079445	8.630432413172851
http_p90_latency	17.667977139353752	20.538362674415115	16.246260182598093
http_p95_latency	20.711670164018855	23.694146424531922	14.399979513454909
http_p99_latency	29.65034229680895	29.68140149489045	0.10475156667868468
http_1mb_p50_latency	12.98057846724987	12.922706082463264	-0.4458382569976993
http_1mb_p90_latency	21.109187789261345	20.95024958252907	-0.7529337856055696
http_1mb_p95_latency	24.55082470551133	23.634034581482396	-3.7342538795575697
http_1mb_p99_latency	32.392665147781365	29.604270085692402	-8.60810633940673
http_10mb_p50_latency	17.51419436186552	18.64500530064106	6.456539852256671
http_10mb_p90_latency	26.162455044686794	28.389364294707775	8.51185122427276
http_10mb_p95_latency	29.105186089873314	31.291070766746994	7.510292736572555
http_10mb_p99_latency	35.415677595883594	38.92970275133845	9.922230475305938
http_avg_rps	189.59	193.57	2.0992668389683056
http_throughput_std	10.87	12.29	13.063477460901574
grpc_p50_latency	11.873401701450348	8.261814713478088	-30.417458103275496
grpc_p90_latency	18.791097030043602	13.284517452120783	-29.304194263479044
grpc_p95_latency	21.959066577255722	15.328398626297712	-30.195581982641183
grpc_p99_latency	29.742918573319905	21.216103378683307	-28.66838764870019
grpc_1mb_p50_latency	21.94363623857498	19.259278662502766	-12.232966072201613
grpc_1mb_p90_latency	36.9620030745864	30.142956227064136	-18.448802230122553
grpc_1mb_p95_latency	42.1659248881042	32.7460279688239	-22.340069485675684
grpc_1mb_p99_latency	53.054883759468794	41.461732257157564	-21.851242865539554
grpc_10mb_p50_latency	143.66386830806732	141.6999576613307	-1.3670177963782004
grpc_10mb_p90_latency	179.3738016858697	175.6876302883029	-2.0550221732057894
grpc_10mb_p95_latency	194.27295429632062	190.28351670131084	-2.0535218653877862
grpc_10mb_p99_latency	216.17326522246	214.93708996102214	-0.5718446544098477
grpc_avg_rps	187.63	205.5	9.52406331610085
grpc_throughput_std	17.96	12.56	-30.066815144766146
handle_p50_latency	4.922645166516304	4.540378227829933	-7.765478228789268
handle_p90_latency	9.407674893736841	8.274185843765737	-12.048556766409135
handle_p95_latency	11.222601961344473	9.853483550250528	-12.199652235816483
handle_p99_latency	15.578116625547406	13.329956158995623	-14.43152930865148
handle_1mb_p50_latency	11.975042521953583	9.734918363392353	-18.70660713274679
handle_1mb_p90_latency	17.783071100711826	15.391851216554642	-13.446608128679571
handle_1mb_p95_latency	20.919007528573268	18.27068757265806	-12.659873812358779
handle_1mb_p99_latency	28.804545234888774	23.77186611294746	-17.47182286997404
handle_10mb_p50_latency	68.21819301694632	68.3914739638567	0.25400987514772044
handle_10mb_p90_latency	88.46968635916711	90.04715215414762	1.783057971491342
handle_10mb_p95_latency	98.49190469831224	95.55206522345537	-2.984853916534369
handle_10mb_p99_latency	116.04135598987341	117.39470710977912	1.1662662060100581
handle_avg_rps	348.2	377.13	8.308443423319932
handle_throughput_std	35.27	23.24	-34.1083073433513

zcin · 2024-04-22T23:48:47Z

Comparing results master, and the basic fix + combining the GC into a single path:

metric	master	basic fix + combining GC	% change
http_p50_latency	11.082594282925129	11.626139283180237	4.9044924534731305
http_p90_latency	17.237211205065254	17.71053411066532	2.7459366829651666
http_p95_latency	19.190100394189354	20.124413166195147	4.868722689375282
http_p99_latency	22.697068899869915	25.29432920739053	11.44315294181224
http_1mb_p50_latency	11.81719359010458	12.776304967701435	8.116236484439
http_1mb_p90_latency	18.78219097852707	19.593118876218796	4.3175362161891995
http_1mb_p95_latency	21.54928250238299	22.788667865097523	5.751399669930901
http_1mb_p99_latency	32.356874477118254	26.531913559883822	-18.002236035973308
http_10mb_p50_latency	17.57313683629036	18.03796272724867	2.6450934473940757
http_10mb_p90_latency	25.324736163020134	26.30617953836918	3.875433761802327
http_10mb_p95_latency	28.40709909796714	30.150835309177623	6.138381836163154
http_10mb_p99_latency	35.19434265792368	37.684448473155484	7.075301389870359
http_avg_rps	204.2	195.04	-4.48579823702252
http_throughput_std	10.59	10.05	-5.099150141643049
grpc_p50_latency	7.965719327330589	9.226929396390915	11.026927465
grpc_p90_latency	12.888458557426931	16.000223346054558	24.143808778703658
grpc_p95_latency	15.0979402475059	19.054750353097894	26.207615348362758
grpc_p99_latency	21.112139746546735	25.18679138273	19.300041043209504
grpc_1mb_p50_latency	17.652496695518494	19.921275787055492	12.852454418603475
grpc_1mb_p90_latency	24.58920944482088	28.302832320332527	15.102652583626796
grpc_1mb_p95_latency	27.62148445472119	32.11428858339785	16.26561431208209
grpc_1mb_p99_latency	34.32560721412301	40.012291837483644	16.566887186837278
grpc_10mb_p50_latency	142.39510521292686	153.88561598956585	8.069456291673038
grpc_10mb_p90_latency	171.29450254142284	196.54369726777077	14.740224789316937
grpc_10mb_p95_latency	180.62871610745782	208.2955297082662	15.316951920506995
grpc_10mb_p99_latency	203.08018665760756	228.67932429537174	12.605433380324914
grpc_avg_rps	203.35	211.01	3.766904352102296
grpc_throughput_std	11.9	18.37	54.36974789915967
handle_p50_latency	4.890996962785721	4.082906059920788	-16.522007864929765
handle_p90_latency	8.330053836107256	7.066779211163522	-15.165263632126525
handle_p95_latency	10.292354132980105	8.812354505062096	-14.379602652571
handle_p99_latency	14.295727964490647	11.079099290072916	-22.500628736134033
handle_1mb_p50_latency	11.582874692976475	10.905216448009014	-5.8505186573275525
handle_1mb_p90_latency	17.04296506941319	16.72532092779875	-1.8637845018206867
handle_1mb_p95_latency	19.410823285579664	18.967161420732737	-2.2856416666083623
handle_1mb_p99_latency	22.920265067368735	26.537567153573033	15.782112796567095
handle_10mb_p50_latency	65.54202642291784	67.52330902963877	3.0229193615962657
handle_10mb_p90_latency	83.52929335087538	84.07336305826902	0.6513519815236624
handle_10mb_p95_latency	90.69845974445342	93.42181514948605	3.0026479090227154
handle_10mb_p99_latency	106.92971816286443	104.65163111686707	-2.1304526797009005
handle_avg_rps	394.57	404.85	2.6053678688192194
handle_throughput_std	22.09	23.86	8.01267541874151

zcin · 2024-04-23T00:28:39Z

[non blocking] It's a little worrisome to me that we now have two separate GC paths, one to prune outdated datapoints and one to prune keys with no datapoints. Is it possible to combine them into the same path w/o any performance degradation?

@edoakes I've combined them. It doesn't seem like there's any noticeable performance degradation.

There are two potential sources of memory leak for the `InMemoryMetricsStore` in the handles that's used to record/report autoscaling metrics: 1. Old replica ID keys are never removed. We remove old replica keys from `num_queries_sent_to_replicas` when we get an updated list of running replicas from the long poll update, but we don't do any such cleaning for the in memory metrics store. This means there is leftover, uncleaned data for replicas that are no longer running. 2. We don't delete data points recorded from more than `look_back_period_s` ago for replicas except during window avg queries. This should mostly be solved once (1) is solved because this should only be a problem for replicas that are no longer running. This PR addresses (1) and (2) by periodically * pruning keys that haven't had updated data points in the past `look_back_period_s`. * compacting datapoints that are more than `look_back_period_s` old Main benchmarks picked from the full microbenchmark run posted below in the comments: |metric| master | current changes | % change | |------|---|---| -------- | |http_p50_latency|11.082594282925129|11.626139283180237|4.9044924534731305| |http_1mb_p50_latency|11.81719359010458|12.776304967701435|8.116236484439| |http_10mb_p50_latency|17.57313683629036|18.03796272724867|2.6450934473940757| |http_avg_rps|204.2|195.04|-4.48579823702252| |grpc_p50_latency|7.965719327330589|8.844093419611454|11.026927465| |grpc_1mb_p50_latency|17.652496695518494|19.921275787055492|12.852454418603475| |grpc_10mb_p50_latency|142.39510521292686|153.88561598956585|8.069456291673038| |grpc_avg_rps|203.35|211.01|3.766904352102296| |handle_p50_latency|4.890996962785721|4.082906059920788|-16.522007864929765| |handle_1mb_p50_latency|11.582874692976475|10.905216448009014|-5.8505186573275525| |handle_10mb_p50_latency|65.54202642291784|67.52330902963877|3.0229193615962657| |handle_avg_rps|394.57|404.85|2.6053678688192194| There is no performance degradation in latencies or throughput. All benchmarks were run with autoscaling turned on (instead of `num_replicas=1`, I just set `autoscaling_config={"min_replicas": 1, "max_replicas": 1}`) Closes ray-project#44870. Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

zcin force-pushed the pr44877 branch 2 times, most recently from 74e4301 to e5acb10 Compare April 20, 2024 01:04

zcin marked this pull request as ready for review April 20, 2024 01:04

zcin requested a review from edoakes April 20, 2024 01:04

zcin force-pushed the pr44877 branch 4 times, most recently from 96a4a86 to 2f0a091 Compare April 22, 2024 18:52

edoakes reviewed Apr 22, 2024

View reviewed changes

edoakes approved these changes Apr 22, 2024

View reviewed changes

zcin force-pushed the pr44877 branch 3 times, most recently from 505f57e to 56834d0 Compare April 22, 2024 22:43

zcin force-pushed the pr44877 branch from 56834d0 to f0145d9 Compare April 23, 2024 00:28

zcin self-assigned this Apr 23, 2024

zcin force-pushed the pr44877 branch 4 times, most recently from d9fb461 to 6ff3878 Compare April 23, 2024 17:58

zcin force-pushed the pr44877 branch from 6ff3878 to 543aa41 Compare April 23, 2024 18:00

edoakes approved these changes Apr 23, 2024

View reviewed changes

edoakes merged commit 9835610 into ray-project:master Apr 23, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[serve] prevent in memory metric store in handles from growing in memory #44877

[serve] prevent in memory metric store in handles from growing in memory #44877

zcin commented Apr 19, 2024 •

edited

edoakes Apr 22, 2024

edoakes Apr 22, 2024

edoakes Apr 22, 2024

edoakes left a comment

zcin commented Apr 22, 2024 •

edited

zcin commented Apr 22, 2024 •

edited

zcin commented Apr 23, 2024

		def process_finished_request(self, replica_id: ReplicaID, *args):
		def dec_num_running_requests_for_replica(self, replica_id: ReplicaID, *args):

[serve] prevent in memory metric store in handles from growing in memory #44877

[serve] prevent in memory metric store in handles from growing in memory #44877

Conversation

zcin commented Apr 19, 2024 • edited

edoakes Apr 22, 2024

Choose a reason for hiding this comment

edoakes Apr 22, 2024

Choose a reason for hiding this comment

edoakes Apr 22, 2024

Choose a reason for hiding this comment

edoakes left a comment

Choose a reason for hiding this comment

zcin commented Apr 22, 2024 • edited

zcin commented Apr 22, 2024 • edited

zcin commented Apr 23, 2024

zcin commented Apr 19, 2024 •

edited

zcin commented Apr 22, 2024 •

edited

zcin commented Apr 22, 2024 •

edited