-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Data jobs able to trigger persistent mmap file leak in between job runs #39229
Comments
I found a similar problem. Plasma object store in raylet process doesn't release mmaped memory of objects used in a finished job. My environment is:
My test script is:
run following shell command:
And you will find the disk usage increases, run the |
cc @c21 |
Do you know if it's disk usage or object store memory itself that's leaked? It would be useful to post the graphs of physical disk as well as the logical object store memory usage from the Ray dashboard here |
Thanks for the screenshots. I think I can confirm the mmap file leaks from the following observations:
That's 200GB of leaked plasma files, whereas normal object store allocation would at most create 2GB of files in /tmp (maybe additional 2GB for fallback allocation). I'm going to tag this as P0 until further verification, since this seems like a possibly serious regression in a core component. |
@rynewang any update on this? |
I think Ray successfully released all object data in the object store; but then Ray did not release the mmap'd files. Further tracking down the release code... |
Update: I found the issue being the Plasma Client (core worker and raylet's client to the shared memory) never releases mmap'd files. This only happens on high memory pressure when the main memory is not enough and Ray is forced to allocate files and mmap them. After the memory pressure, we never munmap the files even though they are no longer used. Actively working on PRs to tackle this. Preview: #40370 (not mature to merge yet) |
Tested that master branch Ray leaks on this repro script, and Ray with this PR no longer leaks. Next step: make tests fixed and @rkooo567 will merge it |
The fix will be included in 2.8 |
…maps on release. (#40370) Plasma memory sharing works this way: the plasma server creates a temp file and mmaps it; then upon plasma client's Get, the server sends the fd to client who mmaps that fd. Then client can Release an object, and if all clients released refs to an fd, the server unmaps it. See the missing piece? the plasma client never unmaps. This is normally not a problem because we don't want to unmap the main memory in /dev/shm anyway; but on memory pressure when we do fallback allocations (mmaps to disk files in /tmp/ray), we will leak mmaps by never unmapping them in plasma client, even if nobody is using those mmap files. To make things worse, raylet itself has a plasma client so even if the core workers exit we are still leaking. A good place for a plasma client to unmap is at Release, after which it may no longer read or write an object ID. However a mmap region may be used by more than 1 object (this is NOT the case today for fallback allocations, but we want to be future proof); also if a mmap region is unmapped and mapped again, the plasma client fails, because the plasma server did not know the client unmapped it and hence would not send the fd. This PR allows the plasma server to ask a plasma client to unmap. The server maintains a per-client ref count table: {object ID -> mmap fd}, and if a certain fd is no longer referenced by a Release request, the server sends a boolean "should_unmap" which orders the client to do that. The client MUST unmap. If some time later the same fd needs to be mapped again in a Get request, it's fine, because the server knows the client no longer mmaps that fd, and would send the fd; and the client had removed the knowledge of that fd so it receives the fd and maps again. Fixes #39229
we fixed this in the master, but the fix wasn't included in Ray 2.8 due to the risk of last-minute changes. The fix should be available in the master and will be available from Ray 2.9 |
What happened + What you expected to happen
When I have a long-running Ray cluster that I’m running multiple jobs on (for iteration in development), eventually I will start getting OOMs during initial SQL read (i.e.
SplitBlocks
). I’m under the impression that sometimes Ray is “spilling” some data into tmpfs or shmem, but I don’t think it should persist in between jobs. Is there any way to clean up this junk in between runs so that I don’t have to restart my cluster?I’m aware that spilling in Ray happens when object memory is high, but I have 500GiB object memory and my usage never reaches anywhere near that.
What's more is that the OOMs are always just slightly above Ray's 95% kill rate -- it's as if there is a rounding error happening somewhere.
// This is only tangentially related to datasets, I’m writing it here because my jobs only use the datasets library, and I’m wondering if this phenomenon is specific to the ray library. If this is the wrong place to post this, please let me know
Versions / Dependencies
Ray 2.6.1 using this Docker image
ray:2.6.1.7474f8-py38-cpu
. You can pull down the docker image I have on top of it here, but I don't think it's relevantPython 3.8.13
Running using Ray clusters (ie not in Kuberay)
Running on a fleet of AWS
m5.4xlarge
andg5.2xlarge
Reproduction script
Run a script like this repeatedly. Eventually machines will OOM
Issue Severity
Low: It annoys or frustrates me.
The text was updated successfully, but these errors were encountered: