Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Memory leak after upgrading from 2.8.1 to 2.9.0 #42144

Closed
morhidi opened this issue Jan 2, 2024 · 17 comments · Fixed by #42248
Closed

[Serve] Memory leak after upgrading from 2.8.1 to 2.9.0 #42144

morhidi opened this issue Jan 2, 2024 · 17 comments · Fixed by #42248
Assignees
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release serve Ray Serve Related Issue

Comments

@morhidi
Copy link

morhidi commented Jan 2, 2024

What happened + What you expected to happen

After upgrading Ray version from 2.8.1 to 2.9.0 we noticed a memory leak ray_node_mem_used on the head node. image

Versions / Dependencies

2.9.0

Reproduction script

Load is 200 QPS

import logging

from fastapi import FastAPI, Request
from fastapi.encoders import jsonable_encoder
from ray import serve
from starlette.responses import JSONResponse

logger = logging.getLogger("ray.serve")

app = FastAPI()


@serve.deployment
@serve.ingress(app)
class ModelServer:

    def __init__(self):
        logger.info("Initialized")


    @app.post("/inference")
    def inference(self, request: Request) -> JSONResponse:

        response = {
            "result": "OK"
        }

        return JSONResponse(content=jsonable_encoder(response))

esp_model_app = ModelServer.bind()

Issue Severity

High: It blocks me from completing my task.

@morhidi morhidi added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 2, 2024
@anyscalesam anyscalesam added the serve Ray Serve Related Issue label Jan 3, 2024
@shrekris-anyscale
Copy link
Contributor

Thanks @morhidi for all your help with narrowing down this issue. With the following repro, we observe a memory leak:

Repro
import logging

from fastapi import FastAPI, Request
from fastapi.encoders import jsonable_encoder
from ray import serve
from starlette.responses import JSONResponse

logger = logging.getLogger("ray.serve")

app = FastAPI()


@serve.deployment
@serve.ingress(app)
class ModelServer:

    def __init__(self):
        logger.info("Initialized")


    @app.post("/inference")
    def inference(self, request: Request) -> JSONResponse:

        response = {
            "result": "OK"
        }

        return JSONResponse(content=jsonable_encoder(response))

esp_model_app = ModelServer.bind()

2.8.1 head node memory:

Screenshot 2024-01-04 at 9 55 10 AM

2.9.0 head node memory:

Screenshot 2024-01-04 at 9 55 51 AM

@shrekris-anyscale
Copy link
Contributor

The process growing in memory seems to be the GCS server. Here's some top outputs over time (courtesy of @morhidi):

top outputs Screenshot 2024-01-04 at 10 01 11 AM Screenshot 2024-01-04 at 9 59 10 AM Screenshot 2024-01-04 at 9 59 22 AM

@shrekris-anyscale
Copy link
Contributor

shrekris-anyscale commented Jan 4, 2024

Here are the keys from the internal kv list in the Ray 2.9.0 cluster:

2.9.0 KV keys
>>> import ray
>>> import ray.experimental.internal_kv as kv
>>> ray.init(address="auto")
RayContext(dashboard_url='session-44rt3qcdjplsk1v8pl23eg875t.i.anyscaleuserdata-staging.com', python_version='3.9.18', ray_version='2.9.0', ray_commit='34ab695d5248aff4ddecbf5fb7d6e8035f74437b', protocol_version=None)
>>> import json
>>> print(json.dumps([str(key) for key in kv._internal_kv_list("")], indent=4))
[
    "b'CLUSTER_METADATA'",
    "b'ActorClass:01000000:](\\xc1\\xe2\\xe9\\x1e\\xb6\\x9b\\x96\\xbc\\x0f\\xdc&\\x90\"\\x95Y@\\xcf\\xf5\\x07\\xbai\\x9f\\x9f9f\\xfe'",
    "b'extra_usage_tag_serve_http_proxy_used'",
    "b'extra_usage_tag_serve_num_apps'",
    "b'extra_usage_tag_serve_num_deployments'",
    "b'extra_usage_tag_serve_num_gpu_deployments'",
    "b'temp_dir'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:8a1c4ff64b9a2ad2ccaddd3b4c5a88972f5ef16bb895cfdef755013e'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:51b003aa8458aaed48d101050c3429d5eb6b0cd4360dcf848302dbb8'",
    "b'dashboard'",
    "b'extra_usage_tag_gcs_storage'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:239fec11187df96c8e7608901552ecabfefdd5c3a8c3a8d7e56ee0a1'",
    "b'DashboardMetricsAddress'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:b0986c77e97fca702edc1ce3599d387de773ebfee7b54f996bb25f9f'",
    "b'SERVE_CONTROLLER_ACTOR-_ray_internal_dashboard-serve-application-state-checkpoint'",
    "b'SERVE_CONTROLLER_ACTOR-_ray_internal_dashboard-serve-app-config-checkpoint'",
    "b'SERVE_CONTROLLER_ACTOR-_ray_internal_dashboard-serve-deployment-state-checkpoint'",
    "b'extra_usage_tag_dashboard_metrics_prometheus_enabled'",
    "b'SERVE_CONTROLLER_ACTOR-_ray_internal_dashboard-serve-endpoint-state-checkpoint'",
    "b'extra_usage_tag_num_drivers'",
    "b'extra_usage_tag_serve_get_deployment_handle_api_used'",
    "b'RemoteFunction:01000000:\\xe6\\x9bI\\xc8\\x99\\xcd\\x91\\xcb\\xbb\\xf6]\\xecj\\x05\\x0c\\xa8\\tU{Y\\xe1l\\x04\\xa7$3\\x04O'",
    "b'RemoteFunction:01000000:\\x1e>\\xdc\\xfa\\xe9&l\\x88?+\\xc8\\x17\\x86\\xb5\\x8cUt\\x7f~TQ\\xc7B%l\\xad\\xae\\x9d'",
    "b'library_usage_serve'",
    "b'extra_usage_tag_pg_num_created'",
    "b'extra_usage_tag_serve_rest_api_version'",
    "b'extra_usage_tag_serve_api_version'",
    "b'ActorClass:01000000:\\x1d\\x19\\xaa+\\xce\\x87\\xc7\\xd3\\x06\\xc6\\x9et\\x1a\\xb3\\xcb\\xece\\x99\\x02i\\xfe\\xb2UIo\\x07>\\x8f'",
    "b'head_node_id'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:ca9e904f57a85065adc08c45d3487dfd6cf3632abc544c53a6e43e19'",
    "b'webui:url'",
    "b'SERVE_CONTROLLER_ACTOR-_ray_internal_dashboard-serve-logging-config-checkpoint'",
    "b'ActorClass:01000000:F\\xb49\\xa2\\xb5\\x97\\x7f\\xa76\\xdd\\xc7\\xb0\\n\\xf9\\xc7\\xc9+5\\x9es\\x00\\x96\\xe2\\xa2\\xbc\\x7f|8'",
    "b'ray_cluster_id'",
    "b'ray_client_server'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:171f40ad49706c7557fb3e6d5f99888a1399eb91eec1c8d94a1a9d09'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:6b833e51c96ef6b7a1cdb4db86afce4c2ee8e2468bb822bd18bfc38a'",
    "b'extra_usage_tag_actor_num_created'",
    "b'hardware_usage_Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:d5c9da700adc3c0046e412223f7ba63f4eb94cacfdfa304a64d8cc33'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:c600bdaa5fde305e6ad5a0861e72a6d50cfaa0ac829eee898bd9874e'",
    "b'session_dir'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:550c9f52029ca67360b183880513790f8b02334d95d0c7aa1e363b95'",
    "b'extra_usage_tag_dashboard_metrics_grafana_enabled'",
    "b'extra_usage_tag_serve_fastapi_used'",
    "b'__autoscaler_v2_enabled'",
    "b'extra_usage_tag_num_actor_tasks'",
    "b'extra_usage_tag_num_actor_creation_tasks'",
    "b'session_name'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:bc01916ca0020fdb0d6c0a2b77177e3caef8ab023bb63ce0c20b971e'",
    "b'extra_usage_tag_num_normal_tasks'",
    "b'extra_usage_tag_dashboard_used'",
    "b'dashboard_rpc'"
]
>>> len(kv._internal_kv_list(""))
53

And the 2.8.1 cluster:

2.8.1 KV keys
>>> import ray
>>> import ray.experimental.internal_kv as kv
>>> import json
>>> ray.init(address="auto")
RayContext(dashboard_url='session-g9ruxv1ijt26a5h2gh142v7slg.i.anyscaleuserdata-staging.com', python_version='3.9.15', ray_version='2.8.1', ray_commit='523c184201976c46d1be3a60d461c4bd9b5e473a', protocol_version=None)
>>> print(json.dumps([str(key) for key in kv._internal_kv_list("")], indent=4))
[
    "b'extra_usage_tag_num_normal_tasks'",
    "b'extra_usage_tag_num_actor_creation_tasks'",
    "b'CLUSTER_METADATA'",
    "b'dashboard'",
    "b'extra_usage_tag_serve_get_deployment_handle_api_used'",
    "b'extra_usage_tag_serve_num_gpu_deployments'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:e81beb9ea4d005441b87b21372c6c37bdf6f6d4c06b832120b093675'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:114c6a567dcfc43f652a9148854d3b48723cf0870eb4b8bdc2e78500'",
    "b'head_node_id'",
    "b'dashboard_rpc'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:b608ef87ab5c62f617638555cfe0fd3653cd2dda06def1c9f8511089'",
    "b'extra_usage_tag_actor_num_created'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:479628f252c0a1b04885e8c5c4383ee6b3de28666a910c621de4e532'",
    "b'ActorClass:01000000:\\xab\\xd6\\xe3S\\xc9\\x871\\x14\\xb3\\xe7:&\\xa0\\xbe\\x9a\\xb7\\x99\\xb0\\xd2F\\xb4\\xf2\\xc2\\xe6\\xb9\\x87\\xfbP'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:c860039280cc6f264ca88e3c2de94c31fe69a250f90547f1a477058d'",
    "b'extra_usage_tag_dashboard_metrics_prometheus_enabled'",
    "b'SERVE_CONTROLLER_ACTOR-_ray_internal_dashboard-serve-endpoint-state-checkpoint'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:2243b1dc05e3a8219394a6b44ff496bbb278620c0a0cac5e7a77dd58'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:8fb32b2efaa3f04351bd14d9adc480143d681cfca856038d6e44c7c6'",
    "b'DashboardMetricsAddress'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:f9f3025727cd75a00780cea6e863a2e99191cbd939378d0bb58ab4be'",
    "b'extra_usage_tag_serve_fastapi_used'",
    "b'extra_usage_tag_num_drivers'",
    "b'extra_usage_tag_core_state_api_get_log'",
    "b'extra_usage_tag_serve_api_version'",
    "b'extra_usage_tag_serve_rest_api_version'",
    "b'extra_usage_tag_dashboard_used'",
    "b'extra_usage_tag_serve_http_proxy_used'",
    "b'RemoteFunction:01000000:\\xe0\\xae\\xb4\\xf0\\xbb;H\\xc87\\xd7\\xd1\\x9e;\\x92\\x0f\\xa6\\xb6\\x1f)\\xe8ctn\\xb0;\\xb8\\x92O'",
    "b'SERVE_CONTROLLER_ACTOR-_ray_internal_dashboard-serve-app-config-checkpoint'",
    "b'ray_client_server'",
    "b'__autoscaler_v2_enabled'",
    "b'library_usage_serve'",
    "b'temp_dir'",
    "b'ActorClass:01000000:w|\\x81_\\xfa1\\x7fe\\x17\\xc10\\xfa\\xd8C~:\\x06\\xe7\\xe3+\\xe1#\\xe9\\x1dH\\xfc\\xd3-'",
    "b'extra_usage_tag_gcs_storage'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:ce50e1d00f14860b813ead9425f25c00d4cdc3d8de214935fda0aaf4'",
    "b'session_dir'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:4c6c065351a39adbbeb67c304acdc6cfafb285bc0625440625e3e191'",
    "b'webui:url'",
    "b'session_name'",
    "b'ActorClass:01000000:R\\xa3JE\\xb7R\\x0fFJ8\\xfe\\xac\\xb6\\x0c\\xf9\\x98\\xe2g\\x04\\xbc\\x00\\xf8p\\x11\\xf2\\xa6\\xbc`'",
    "b'SERVE_CONTROLLER_ACTOR-_ray_internal_dashboard-serve-application-state-checkpoint'",
    "b'extra_usage_tag_num_actor_tasks'",
    "b'DASHBOARD_AGENT_PORT_PREFIX:0f1422470667b3f48f546f6150d094b6ef7c1a1ab815953965873c95'",
    "b'RemoteFunction:01000000:H\\xc0\\xeeu\\x12\\xedy\\xbeN\\x919\\xed\\x84\\xd9\\x9fC\\xc0\\xf40\\xd4\\xc0~A\\xcf\\xc0\\x19\\xd1\\xe2'",
    "b'extra_usage_tag_pg_num_created'",
    "b'extra_usage_tag_dashboard_metrics_grafana_enabled'",
    "b'extra_usage_tag_serve_num_deployments'",
    "b'SERVE_CONTROLLER_ACTOR-_ray_internal_dashboard-serve-deployment-state-checkpoint'",
    "b'ray_cluster_id'",
    "b'extra_usage_tag_serve_num_apps'"
]
>>> len(kv._internal_kv_list(""))
52

@shrekris-anyscale shrekris-anyscale self-assigned this Jan 4, 2024
@shrekris-anyscale shrekris-anyscale added P1 Issue that should be fixed within a few weeks P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) P1 Issue that should be fixed within a few weeks labels Jan 4, 2024
@fishbone
Copy link
Contributor

fishbone commented Jan 4, 2024

I'll take a look.

@fishbone
Copy link
Contributor

fishbone commented Jan 5, 2024

prof.pdf

It seems the events data take some space. Let me increase the interval and reduce the observability events.

@fishbone
Copy link
Contributor

fishbone commented Jan 5, 2024

This logic doesn't exist in 2.8.1. So high probably it's because of this.

@fishbone
Copy link
Contributor

fishbone commented Jan 5, 2024

Removed the logic and run it again. Waiting for result.

@fishbone
Copy link
Contributor

fishbone commented Jan 5, 2024

The memory looks stable after removing it. I'll sync with @rickyyx to see how to fix it.

@mbalassi
Copy link

mbalassi commented Jan 5, 2024

Great, @iycheng and @shrekris-anyscale. Thanks for the fast repro and investigation.

@fishbone
Copy link
Contributor

fishbone commented Jan 5, 2024

The memory is pretty stable. I'll dig deeper

@fishbone
Copy link
Contributor

fishbone commented Jan 5, 2024

It looks like the size of this field never got GC https://github.com/ray-project/ray/blob/master/src/ray/protobuf/gcs.proto#L204

Not sure what's the correct logic for this. I'll sync it offline with @rickyyx

@rickyyx rickyyx self-assigned this Jan 5, 2024
@fishbone
Copy link
Contributor

fishbone commented Jan 5, 2024

@morhidi as @rickyyx mentioned that setting RAY_enable_timeline=0 can mitigated this.

@morhidi
Copy link
Author

morhidi commented Jan 8, 2024

RAY_enable_timeline=0

Thank you folks, we're on KubeRay. Where should this property go?

@shrekris-anyscale
Copy link
Contributor

shrekris-anyscale commented Jan 8, 2024

Thank you folks, we're on KubeRay. Where should this property go?

Set the RAY_enable_timeline=0 env var directly on the Ray containers in both the headGroupSpec and workerGroupSpec. You can follow the syntax from this example.

@morhidi
Copy link
Author

morhidi commented Jan 8, 2024

I've applied the suggested workaround, and can confirm things looking better with it:
image

Looking forward to the final fix. Thanks @shrekris-anyscale @rickyyx for the extra efforts with the investigation

@morhidi
Copy link
Author

morhidi commented Jan 9, 2024

Thanks folks, for the fix!

@shrekris-anyscale
Copy link
Contributor

I retried the repro with the fix. There's no longer a memory leak:

Screenshot 2024-01-09 at 11 37 29 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release serve Ray Serve Related Issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants