[Serve] Memory leak after upgrading from 2.8.1 to 2.9.0 #42144

morhidi · 2024-01-02T18:41:38Z

What happened + What you expected to happen

After upgrading Ray version from 2.8.1 to 2.9.0 we noticed a memory leak ray_node_mem_used on the head node.

Versions / Dependencies

2.9.0

Reproduction script

Load is 200 QPS

import logging

from fastapi import FastAPI, Request
from fastapi.encoders import jsonable_encoder
from ray import serve
from starlette.responses import JSONResponse

logger = logging.getLogger("ray.serve")

app = FastAPI()


@serve.deployment
@serve.ingress(app)
class ModelServer:

    def __init__(self):
        logger.info("Initialized")


    @app.post("/inference")
    def inference(self, request: Request) -> JSONResponse:

        response = {
            "result": "OK"
        }

        return JSONResponse(content=jsonable_encoder(response))

esp_model_app = ModelServer.bind()

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

shrekris-anyscale · 2024-01-04T17:56:09Z

Thanks @morhidi for all your help with narrowing down this issue. With the following repro, we observe a memory leak:

Repro

import logging

from fastapi import FastAPI, Request
from fastapi.encoders import jsonable_encoder
from ray import serve
from starlette.responses import JSONResponse

logger = logging.getLogger("ray.serve")

app = FastAPI()


@serve.deployment
@serve.ingress(app)
class ModelServer:

    def __init__(self):
        logger.info("Initialized")


    @app.post("/inference")
    def inference(self, request: Request) -> JSONResponse:

        response = {
            "result": "OK"
        }

        return JSONResponse(content=jsonable_encoder(response))

esp_model_app = ModelServer.bind()

2.8.1 head node memory:

2.9.0 head node memory:

shrekris-anyscale · 2024-01-04T18:01:34Z

The process growing in memory seems to be the GCS server. Here's some top outputs over time (courtesy of @morhidi):

top outputs

shrekris-anyscale · 2024-01-04T19:17:21Z

Here are the keys from the internal kv list in the Ray 2.9.0 cluster:

2.9.0 KV keys

>>> import ray
>>> import ray.experimental.internal_kv as kv
>>> ray.init(address="auto")
RayContext(dashboard_url='session-44rt3qcdjplsk1v8pl23eg875t.i.anyscaleuserdata-staging.com', python_version='3.9.18', ray_version='2.9.0', ray_commit='34ab695d5248aff4ddecbf5fb7d6e8035f74437b', protocol_version=None)
>>> import json
>>> print(json.dumps([str(key) for key in kv._internal_kv_list("")], indent=4))
[
    "b'CLUSTER_METADATA'",
    "b'ActorClass:01000000:](\\xc1\\xe2\\xe9\\x1e\\xb6\\x9b\\x96\\xbc\\x0f\\xdc&\\x90\"\\x95Y@\\xcf\\xf5\\x07\\xbai\\x9f\\x9f9f\\xfe'",
    "b'extra_usage_tag_serve_http_proxy_used'",
    "b'extra_usage_tag_serve_num_apps'",
    "b'extra_usage_tag_serve_num_deployments'",
    "b'extra_usage_tag_serve_num_gpu_deployments'",
    "b'temp_dir'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:8a1c4ff64b9a2ad2ccaddd3b4c5a88972f5ef16bb895cfdef755013e'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:51b003aa8458aaed48d101050c3429d5eb6b0cd4360dcf848302dbb8'",
    "b'dashboard'",
    "b'extra_usage_tag_gcs_storage'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:239fec11187df96c8e7608901552ecabfefdd5c3a8c3a8d7e56ee0a1'",
    "b'DashboardMetricsAddress'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:b0986c77e97fca702edc1ce3599d387de773ebfee7b54f996bb25f9f'",
    "b'SERVE_CONTROLLER_ACTOR-_ray_internal_dashboard-serve-application-state-checkpoint'",
    "b'SERVE_CONTROLLER_ACTOR-_ray_internal_dashboard-serve-app-config-checkpoint'",
    "b'SERVE_CONTROLLER_ACTOR-_ray_internal_dashboard-serve-deployment-state-checkpoint'",
    "b'extra_usage_tag_dashboard_metrics_prometheus_enabled'",
    "b'SERVE_CONTROLLER_ACTOR-_ray_internal_dashboard-serve-endpoint-state-checkpoint'",
    "b'extra_usage_tag_num_drivers'",
    "b'extra_usage_tag_serve_get_deployment_handle_api_used'",
    "b'RemoteFunction:01000000:\\xe6\\x9bI\\xc8\\x99\\xcd\\x91\\xcb\\xbb\\xf6]\\xecj\\x05\\x0c\\xa8\\tU{Y\\xe1l\\x04\\xa7$3\\x04O'",
    "b'RemoteFunction:01000000:\\x1e>\\xdc\\xfa\\xe9&l\\x88?+\\xc8\\x17\\x86\\xb5\\x8cUt\\x7f~TQ\\xc7B%l\\xad\\xae\\x9d'",
    "b'library_usage_serve'",
    "b'extra_usage_tag_pg_num_created'",
    "b'extra_usage_tag_serve_rest_api_version'",
    "b'extra_usage_tag_serve_api_version'",
    "b'ActorClass:01000000:\\x1d\\x19\\xaa+\\xce\\x87\\xc7\\xd3\\x06\\xc6\\x9et\\x1a\\xb3\\xcb\\xece\\x99\\x02i\\xfe\\xb2UIo\\x07>\\x8f'",
    "b'head_node_id'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:ca9e904f57a85065adc08c45d3487dfd6cf3632abc544c53a6e43e19'",
    "b'webui:url'",
    "b'SERVE_CONTROLLER_ACTOR-_ray_internal_dashboard-serve-logging-config-checkpoint'",
    "b'ActorClass:01000000:F\\xb49\\xa2\\xb5\\x97\\x7f\\xa76\\xdd\\xc7\\xb0\\n\\xf9\\xc7\\xc9+5\\x9es\\x00\\x96\\xe2\\xa2\\xbc\\x7f|8'",
    "b'ray_cluster_id'",
    "b'ray_client_server'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:171f40ad49706c7557fb3e6d5f99888a1399eb91eec1c8d94a1a9d09'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:6b833e51c96ef6b7a1cdb4db86afce4c2ee8e2468bb822bd18bfc38a'",
    "b'extra_usage_tag_actor_num_created'",
    "b'hardware_usage_Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:d5c9da700adc3c0046e412223f7ba63f4eb94cacfdfa304a64d8cc33'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:c600bdaa5fde305e6ad5a0861e72a6d50cfaa0ac829eee898bd9874e'",
    "b'session_dir'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:550c9f52029ca67360b183880513790f8b02334d95d0c7aa1e363b95'",
    "b'extra_usage_tag_dashboard_metrics_grafana_enabled'",
    "b'extra_usage_tag_serve_fastapi_used'",
    "b'__autoscaler_v2_enabled'",
    "b'extra_usage_tag_num_actor_tasks'",
    "b'extra_usage_tag_num_actor_creation_tasks'",
    "b'session_name'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:bc01916ca0020fdb0d6c0a2b77177e3caef8ab023bb63ce0c20b971e'",
    "b'extra_usage_tag_num_normal_tasks'",
    "b'extra_usage_tag_dashboard_used'",
    "b'dashboard_rpc'"
]
>>> len(kv._internal_kv_list(""))
53

And the 2.8.1 cluster:

2.8.1 KV keys

>>> import ray
>>> import ray.experimental.internal_kv as kv
>>> import json
>>> ray.init(address="auto")
RayContext(dashboard_url='session-g9ruxv1ijt26a5h2gh142v7slg.i.anyscaleuserdata-staging.com', python_version='3.9.15', ray_version='2.8.1', ray_commit='523c184201976c46d1be3a60d461c4bd9b5e473a', protocol_version=None)
>>> print(json.dumps([str(key) for key in kv._internal_kv_list("")], indent=4))
[
    "b'extra_usage_tag_num_normal_tasks'",
    "b'extra_usage_tag_num_actor_creation_tasks'",
    "b'CLUSTER_METADATA'",
    "b'dashboard'",
    "b'extra_usage_tag_serve_get_deployment_handle_api_used'",
    "b'extra_usage_tag_serve_num_gpu_deployments'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:e81beb9ea4d005441b87b21372c6c37bdf6f6d4c06b832120b093675'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:114c6a567dcfc43f652a9148854d3b48723cf0870eb4b8bdc2e78500'",
    "b'head_node_id'",
    "b'dashboard_rpc'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:b608ef87ab5c62f617638555cfe0fd3653cd2dda06def1c9f8511089'",
    "b'extra_usage_tag_actor_num_created'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:479628f252c0a1b04885e8c5c4383ee6b3de28666a910c621de4e532'",
    "b'ActorClass:01000000:\\xab\\xd6\\xe3S\\xc9\\x871\\x14\\xb3\\xe7:&\\xa0\\xbe\\x9a\\xb7\\x99\\xb0\\xd2F\\xb4\\xf2\\xc2\\xe6\\xb9\\x87\\xfbP'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:c860039280cc6f264ca88e3c2de94c31fe69a250f90547f1a477058d'",
    "b'extra_usage_tag_dashboard_metrics_prometheus_enabled'",
    "b'SERVE_CONTROLLER_ACTOR-_ray_internal_dashboard-serve-endpoint-state-checkpoint'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:2243b1dc05e3a8219394a6b44ff496bbb278620c0a0cac5e7a77dd58'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:8fb32b2efaa3f04351bd14d9adc480143d681cfca856038d6e44c7c6'",
    "b'DashboardMetricsAddress'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:f9f3025727cd75a00780cea6e863a2e99191cbd939378d0bb58ab4be'",
    "b'extra_usage_tag_serve_fastapi_used'",
    "b'extra_usage_tag_num_drivers'",
    "b'extra_usage_tag_core_state_api_get_log'",
    "b'extra_usage_tag_serve_api_version'",
    "b'extra_usage_tag_serve_rest_api_version'",
    "b'extra_usage_tag_dashboard_used'",
    "b'extra_usage_tag_serve_http_proxy_used'",
    "b'RemoteFunction:01000000:\\xe0\\xae\\xb4\\xf0\\xbb;H\\xc87\\xd7\\xd1\\x9e;\\x92\\x0f\\xa6\\xb6\\x1f)\\xe8ctn\\xb0;\\xb8\\x92O'",
    "b'SERVE_CONTROLLER_ACTOR-_ray_internal_dashboard-serve-app-config-checkpoint'",
    "b'ray_client_server'",
    "b'__autoscaler_v2_enabled'",
    "b'library_usage_serve'",
    "b'temp_dir'",
    "b'ActorClass:01000000:w|\\x81_\\xfa1\\x7fe\\x17\\xc10\\xfa\\xd8C~:\\x06\\xe7\\xe3+\\xe1#\\xe9\\x1dH\\xfc\\xd3-'",
    "b'extra_usage_tag_gcs_storage'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:ce50e1d00f14860b813ead9425f25c00d4cdc3d8de214935fda0aaf4'",
    "b'session_dir'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:4c6c065351a39adbbeb67c304acdc6cfafb285bc0625440625e3e191'",
    "b'webui:url'",
    "b'session_name'",
    "b'ActorClass:01000000:R\\xa3JE\\xb7R\\x0fFJ8\\xfe\\xac\\xb6\\x0c\\xf9\\x98\\xe2g\\x04\\xbc\\x00\\xf8p\\x11\\xf2\\xa6\\xbc`'",
    "b'SERVE_CONTROLLER_ACTOR-_ray_internal_dashboard-serve-application-state-checkpoint'",
    "b'extra_usage_tag_num_actor_tasks'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:0f1422470667b3f48f546f6150d094b6ef7c1a1ab815953965873c95'",
    "b'RemoteFunction:01000000:H\\xc0\\xeeu\\x12\\xedy\\xbeN\\x919\\xed\\x84\\xd9\\x9fC\\xc0\\xf40\\xd4\\xc0~A\\xcf\\xc0\\x19\\xd1\\xe2'",
    "b'extra_usage_tag_pg_num_created'",
    "b'extra_usage_tag_dashboard_metrics_grafana_enabled'",
    "b'extra_usage_tag_serve_num_deployments'",
    "b'SERVE_CONTROLLER_ACTOR-_ray_internal_dashboard-serve-deployment-state-checkpoint'",
    "b'ray_cluster_id'",
    "b'extra_usage_tag_serve_num_apps'"
]
>>> len(kv._internal_kv_list(""))
52

fishbone · 2024-01-04T23:06:53Z

I'll take a look.

fishbone · 2024-01-05T02:34:05Z

prof.pdf

It seems the events data take some space. Let me increase the interval and reduce the observability events.

fishbone · 2024-01-05T02:41:21Z

This logic doesn't exist in 2.8.1. So high probably it's because of this.

fishbone · 2024-01-05T03:04:09Z

Removed the logic and run it again. Waiting for result.

fishbone · 2024-01-05T03:28:03Z

The memory looks stable after removing it. I'll sync with @rickyyx to see how to fix it.

mbalassi · 2024-01-05T06:53:33Z

Great, @iycheng and @shrekris-anyscale. Thanks for the fast repro and investigation.

fishbone · 2024-01-05T19:13:21Z

The memory is pretty stable. I'll dig deeper

fishbone · 2024-01-05T19:43:31Z

It looks like the size of this field never got GC https://github.com/ray-project/ray/blob/master/src/ray/protobuf/gcs.proto#L204

Not sure what's the correct logic for this. I'll sync it offline with @rickyyx

fishbone · 2024-01-05T21:20:33Z

@morhidi as @rickyyx mentioned that setting RAY_enable_timeline=0 can mitigated this.

morhidi · 2024-01-08T16:58:38Z

RAY_enable_timeline=0

Thank you folks, we're on KubeRay. Where should this property go?

shrekris-anyscale · 2024-01-08T17:14:42Z

Thank you folks, we're on KubeRay. Where should this property go?

Set the RAY_enable_timeline=0 env var directly on the Ray containers in both the headGroupSpec and workerGroupSpec. You can follow the syntax from this example.

morhidi · 2024-01-08T21:03:44Z

I've applied the suggested workaround, and can confirm things looking better with it:

Looking forward to the final fix. Thanks @shrekris-anyscale @rickyyx for the extra efforts with the investigation

morhidi · 2024-01-09T17:55:03Z

Thanks folks, for the fix!

shrekris-anyscale · 2024-01-09T19:37:43Z

I retried the repro with the fix. There's no longer a memory leak:

morhidi added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 2, 2024

anyscalesam added the serve Ray Serve Related Issue label Jan 3, 2024

shrekris-anyscale self-assigned this Jan 4, 2024

shrekris-anyscale added P1 Issue that should be fixed within a few weeks P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) P1 Issue that should be fixed within a few weeks labels Jan 4, 2024

fishbone assigned fishbone and unassigned shrekris-anyscale Jan 4, 2024

rickyyx self-assigned this Jan 5, 2024

rickyyx unassigned fishbone Jan 8, 2024

jjyao added the release-blocker P0 Issue that blocks the release label Jan 8, 2024

This was referenced Jan 9, 2024

[core][gcs] Fix task events profile events per task leak #42248

Merged

[core][release] Add memory regression metrics for long running tests #42250

Open

rickyyx closed this as completed in #42248 Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Memory leak after upgrading from 2.8.1 to 2.9.0 #42144

[Serve] Memory leak after upgrading from 2.8.1 to 2.9.0 #42144

morhidi commented Jan 2, 2024 •

edited

Loading

shrekris-anyscale commented Jan 4, 2024

shrekris-anyscale commented Jan 4, 2024

shrekris-anyscale commented Jan 4, 2024 •

edited

Loading

fishbone commented Jan 4, 2024

fishbone commented Jan 5, 2024

fishbone commented Jan 5, 2024

fishbone commented Jan 5, 2024

fishbone commented Jan 5, 2024

mbalassi commented Jan 5, 2024

fishbone commented Jan 5, 2024

fishbone commented Jan 5, 2024

fishbone commented Jan 5, 2024

morhidi commented Jan 8, 2024

shrekris-anyscale commented Jan 8, 2024 •

edited

Loading

morhidi commented Jan 8, 2024

morhidi commented Jan 9, 2024

shrekris-anyscale commented Jan 9, 2024

[Serve] Memory leak after upgrading from 2.8.1 to 2.9.0 #42144

[Serve] Memory leak after upgrading from 2.8.1 to 2.9.0 #42144

Comments

morhidi commented Jan 2, 2024 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

shrekris-anyscale commented Jan 4, 2024

shrekris-anyscale commented Jan 4, 2024

shrekris-anyscale commented Jan 4, 2024 • edited Loading

fishbone commented Jan 4, 2024

fishbone commented Jan 5, 2024

fishbone commented Jan 5, 2024

fishbone commented Jan 5, 2024

fishbone commented Jan 5, 2024

mbalassi commented Jan 5, 2024

fishbone commented Jan 5, 2024

fishbone commented Jan 5, 2024

fishbone commented Jan 5, 2024

morhidi commented Jan 8, 2024

shrekris-anyscale commented Jan 8, 2024 • edited Loading

morhidi commented Jan 8, 2024

morhidi commented Jan 9, 2024

shrekris-anyscale commented Jan 9, 2024

morhidi commented Jan 2, 2024 •

edited

Loading

shrekris-anyscale commented Jan 4, 2024 •

edited

Loading

shrekris-anyscale commented Jan 8, 2024 •

edited

Loading