Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray Serve] Memory leak in 2.6 #38089

Closed
akshay-anyscale opened this issue Aug 3, 2023 · 8 comments · Fixed by #38152
Closed

[Ray Serve] Memory leak in 2.6 #38089

akshay-anyscale opened this issue Aug 3, 2023 · 8 comments · Fixed by #38152
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release serve Ray Serve Related Issue

Comments

@akshay-anyscale
Copy link
Contributor

What happened + What you expected to happen

See a memory leak in HTTPProxy on multiple serve clusters in Anyscale Workspaces/Services. This regression was likely introduced in 2.6
Screenshot 2023-08-03 at 1 52 12 PM

Versions / Dependencies

Ray 2.6.1

Reproduction script

Run a serve deployment and send requests repeatedly

Issue Severity

High: It blocks me from completing my task.

@akshay-anyscale akshay-anyscale added bug Something that is supposed to be working; but isn't release-blocker P0 Issue that blocks the release P0 Issues that should be fixed in short order labels Aug 3, 2023
@zhe-thoughts
Copy link
Collaborator

When did this behavior start to happen? E.g. is it between 2.6.0 and 2.6.1?

@akshay-anyscale
Copy link
Contributor Author

Between 2.5.1 and 2.6

@edoakes
Copy link
Contributor

edoakes commented Aug 3, 2023

After much profiling and pain, we discovered that the root cause is a bug in the Ray core streaming object ref generator code causing the "end of stream" object to never be removed from the in memory object store.

Verified by:

  1. Adding log statements for the number of objects on each request and seeing that it goes up monotonically by 1 each request.
  2. Logging all objects being deleted from in memory store, the end of stream object is never deleted.
  3. The destructor for the stream object is being called, so it's not an issue in the Python layer.

@rkooo567 is taking it from here

@edoakes
Copy link
Contributor

edoakes commented Aug 3, 2023

Also verified the leak is not present with RAY_SERVE_ENABLE_EXPERIMENTAL_STREAMING=0, so we can recommend that to any users with problems.

@edoakes
Copy link
Contributor

edoakes commented Aug 3, 2023

For posterity, some flamegraphs taken over time as the leak occurred for the leaking (flamegraph_{i}) and non-leaking (ffs_off_flamegraph_{i}) cases.

flamegraphs.zip

@akshay-anyscale akshay-anyscale added serve Ray Serve Related Issue core Issues that should be addressed in Ray Core labels Aug 3, 2023
@rkooo567
Copy link
Contributor

rkooo567 commented Aug 5, 2023

I am investigating it now. Btw, how did you guys find the memory leak? Do you run release tests with memory usage now?

@edoakes
Copy link
Contributor

edoakes commented Aug 8, 2023

Re-opening until cherry picked

@edoakes
Copy link
Contributor

edoakes commented Aug 8, 2023

NVM it's already cherry-picked!

@edoakes edoakes closed this as completed Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release serve Ray Serve Related Issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants