-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Memory leak after upgrading from 2.8.1 to 2.9.0 #42144
Comments
Thanks @morhidi for all your help with narrowing down this issue. With the following repro, we observe a memory leak: Reproimport logging
from fastapi import FastAPI, Request
from fastapi.encoders import jsonable_encoder
from ray import serve
from starlette.responses import JSONResponse
logger = logging.getLogger("ray.serve")
app = FastAPI()
@serve.deployment
@serve.ingress(app)
class ModelServer:
def __init__(self):
logger.info("Initialized")
@app.post("/inference")
def inference(self, request: Request) -> JSONResponse:
response = {
"result": "OK"
}
return JSONResponse(content=jsonable_encoder(response))
esp_model_app = ModelServer.bind() 2.8.1 head node memory: 2.9.0 head node memory: |
The process growing in memory seems to be the GCS server. Here's some |
Here are the keys from the internal kv list in the Ray 2.9.0 cluster: 2.9.0 KV keys
And the 2.8.1 cluster: 2.8.1 KV keys
|
I'll take a look. |
It seems the events data take some space. Let me increase the interval and reduce the observability events. |
This logic doesn't exist in 2.8.1. So high probably it's because of this. |
Removed the logic and run it again. Waiting for result. |
The memory looks stable after removing it. I'll sync with @rickyyx to see how to fix it. |
Great, @iycheng and @shrekris-anyscale. Thanks for the fast repro and investigation. |
The memory is pretty stable. I'll dig deeper |
It looks like the size of this field never got GC https://github.com/ray-project/ray/blob/master/src/ray/protobuf/gcs.proto#L204 Not sure what's the correct logic for this. I'll sync it offline with @rickyyx |
Thank you folks, we're on KubeRay. Where should this property go? |
Set the |
I've applied the suggested workaround, and can confirm things looking better with it: Looking forward to the final fix. Thanks @shrekris-anyscale @rickyyx for the extra efforts with the investigation |
Thanks folks, for the fix! |
What happened + What you expected to happen
After upgrading Ray version from 2.8.1 to 2.9.0 we noticed a memory leak
ray_node_mem_used
on the head node.Versions / Dependencies
2.9.0
Reproduction script
Load is 200 QPS
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: