[Core] Data jobs able to trigger persistent mmap file leak in between job runs #39229

plv · 2023-09-01T22:49:06Z

What happened + What you expected to happen

When I have a long-running Ray cluster that I’m running multiple jobs on (for iteration in development), eventually I will start getting OOMs during initial SQL read (i.e. SplitBlocks). I’m under the impression that sometimes Ray is “spilling” some data into tmpfs or shmem, but I don’t think it should persist in between jobs. Is there any way to clean up this junk in between runs so that I don’t have to restart my cluster?

I’m aware that spilling in Ray happens when object memory is high, but I have 500GiB object memory and my usage never reaches anywhere near that.

What's more is that the OOMs are always just slightly above Ray's 95% kill rate -- it's as if there is a rounding error happening somewhere.

// This is only tangentially related to datasets, I’m writing it here because my jobs only use the datasets library, and I’m wondering if this phenomenon is specific to the ray library. If this is the wrong place to post this, please let me know

Versions / Dependencies

Ray 2.6.1 using this Docker imageray:2.6.1.7474f8-py38-cpu. You can pull down the docker image I have on top of it here, but I don't think it's relevant
Python 3.8.13
Running using Ray clusters (ie not in Kuberay)
Running on a fleet of AWS m5.4xlarge and g5.2xlarge

Reproduction script

Run a script like this repeatedly. Eventually machines will OOM

import ray

ray.init()

@ray.remote()
def do_data():
   ds = ray.data.read_sql("<read huge SQL table here>") # we use this snowflake table: https://app.snowflake.com/marketplace/listing/GZTSZ290BUX1X/cybersyn-inc-cybersyn-llm-training-essentials
     .materialize().show()

ray.get(do_data.remote())

Issue Severity

Low: It annoys or frustrates me.

The text was updated successfully, but these errors were encountered:

z4y1b2 · 2023-09-05T04:45:13Z

I found a similar problem. Plasma object store in raylet process doesn't release mmaped memory of objects used in a finished job.

My environment is:

MacBook Pro M2 Max, 64 GB RAM, 12 CPU Cores
MacOS 13.5.1 (22G90)
Python 3.11.4 arm64
Ray version 2.6.3

My test script is:

import ray
import pyarrow as pa

schema = pa.schema([('v', pa.binary())])
def map(batch):
    lines = []
    for __ in range(len(batch['id'])):
        lines.append(b'*' * 99)
    return pa.Table.from_arrays([lines], schema=schema)

def test(n):
    ray.init()
    ray.data.DataContext.get_current().execution_options.verbose_progress = True
    ray.data.DataContext.get_current().use_push_based_shuffle = True
    parallelism = max(1, int(n * 100 / (128 * 1024 * 1024)))
    ds = ray.data.range(n, parallelism=parallelism)\
            .map_batches(map, zero_copy_batch=True)\
            .repartition(1)
    ds.materialize()

if __name__ == '__main__':
    test(700000000)

run following shell command:

ray start --head
python ./test.py

And you will find the disk usage increases, run the vmmap PID command and you will find a lot of mmaped files still exist in raylet process, and when you ray stop killing all the ray processes then the disk usage decreases.

richardliaw · 2023-09-05T15:50:38Z

cc @c21

ericl · 2023-09-26T17:25:40Z

Do you know if it's disk usage or object store memory itself that's leaked? It would be useful to post the graphs of physical disk as well as the logical object store memory usage from the Ray dashboard here

z4y1b2 · 2023-09-27T02:15:34Z

I found a similar problem. Plasma object store in raylet process doesn't release mmaped memory of objects used in a finished job.

My environment is:
* MacBook Pro M2 Max, 64 GB RAM, 12 CPU Cores

* MacOS 13.5.1 (22G90)

* Python 3.11.4 arm64

* Ray version 2.6.3
My test script is:
import ray
import pyarrow as pa

schema = pa.schema([('v', pa.binary())])
def map(batch):
    lines = []
    for __ in range(len(batch['id'])):
        lines.append(b'*' * 99)
    return pa.Table.from_arrays([lines], schema=schema)

def test(n):
    ray.init()
    ray.data.DataContext.get_current().execution_options.verbose_progress = True
    ray.data.DataContext.get_current().use_push_based_shuffle = True
    parallelism = max(1, int(n * 100 / (128 * 1024 * 1024)))
    ds = ray.data.range(n, parallelism=parallelism)\
            .map_batches(map, zero_copy_batch=True)\
            .repartition(1)
    ds.materialize()

if __name__ == '__main__':
    test(700000000)
run following shell command:
ray start --head
python ./test.py
And you will find the disk usage increases, run the vmmap PID command and you will find a lot of mmaped files still exist in raylet process, and when you ray stop killing all the ray processes then the disk usage decreases.

I run the test script using Ray 2.7.0 (Python 3.11.5) in the same machine and the problem still exists.

Here are screenshots of Ray dashboard about the node.

Before running the script:

After running the script:

There's no visible files under '/tmp' because mmapped 'plasmaXXXXXX' files are unlinked right after its creation.

Here's the output of vmmap PID after running the script:

leak_vmmap.txt

ericl · 2023-09-27T18:51:52Z

Thanks for the screenshots. I think I can confirm the mmap file leaks from the following observations:

Before: 0/2GB reported object store memory, 381GB used disk.
After: 0/2GB reported object store memory, 516GB disk used.

That's 200GB of leaked plasma files, whereas normal object store allocation would at most create 2GB of files in /tmp (maybe additional 2GB for fallback allocation).

I'm going to tag this as P0 until further verification, since this seems like a possibly serious regression in a core component.

anyscalesam · 2023-10-05T00:45:22Z

@rynewang any update on this?

rynewang · 2023-10-05T01:20:46Z

I think Ray successfully released all object data in the object store; but then Ray did not release the mmap'd files. Further tracking down the release code...

rynewang · 2023-10-16T19:49:19Z

Update: I found the issue being the Plasma Client (core worker and raylet's client to the shared memory) never releases mmap'd files. This only happens on high memory pressure when the main memory is not enough and Ray is forced to allocate files and mmap them. After the memory pressure, we never munmap the files even though they are no longer used.

Actively working on PRs to tackle this. Preview: #40370 (not mature to merge yet)

rynewang · 2023-10-17T18:54:57Z

Tested that master branch Ray leaks on this repro script, and Ray with this PR no longer leaks. Next step: make tests fixed and @rkooo567 will merge it

rkooo567 · 2023-10-17T23:32:14Z

The fix will be included in 2.8

…maps on release. (#40370) Plasma memory sharing works this way: the plasma server creates a temp file and mmaps it; then upon plasma client's Get, the server sends the fd to client who mmaps that fd. Then client can Release an object, and if all clients released refs to an fd, the server unmaps it. See the missing piece? the plasma client never unmaps. This is normally not a problem because we don't want to unmap the main memory in /dev/shm anyway; but on memory pressure when we do fallback allocations (mmaps to disk files in /tmp/ray), we will leak mmaps by never unmapping them in plasma client, even if nobody is using those mmap files. To make things worse, raylet itself has a plasma client so even if the core workers exit we are still leaking. A good place for a plasma client to unmap is at Release, after which it may no longer read or write an object ID. However a mmap region may be used by more than 1 object (this is NOT the case today for fallback allocations, but we want to be future proof); also if a mmap region is unmapped and mapped again, the plasma client fails, because the plasma server did not know the client unmapped it and hence would not send the fd. This PR allows the plasma server to ask a plasma client to unmap. The server maintains a per-client ref count table: {object ID -> mmap fd}, and if a certain fd is no longer referenced by a Release request, the server sends a boolean "should_unmap" which orders the client to do that. The client MUST unmap. If some time later the same fd needs to be mapped again in a Get request, it's fine, because the server knows the client no longer mmaps that fd, and would send the fd; and the client had removed the knowledge of that fd so it receives the fd and maps again. Fixes #39229

rkooo567 · 2023-10-27T02:04:09Z

we fixed this in the master, but the fix wasn't included in Ray 2.8 due to the risk of last-minute changes. The fix should be available in the master and will be available from Ray 2.9

plv added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 1, 2023

richardliaw added the data Ray Data-related issues label Sep 5, 2023

ericl added P0 Issues that should be fixed in short order core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) data Ray Data-related issues labels Sep 27, 2023

ericl changed the title ~~[Data] Persistent memory spillover in between job runs~~ [Core] Data jobs able to trigger persistent mmap file leak in between job runs Sep 27, 2023

anyscalesam added the ray 2.8 label Sep 27, 2023

rkooo567 added the size:large label Sep 29, 2023

rkooo567 assigned rynewang Oct 2, 2023

jjyao added the release-blocker P0 Issue that blocks the release label Oct 20, 2023

rynewang mentioned this issue Oct 23, 2023

[core] on plasma server, ref count fds to each client, and request unmaps on release. #40370

Merged

vitsai removed the release-blocker P0 Issue that blocks the release label Oct 26, 2023

rkooo567 closed this as completed in #40370 Oct 27, 2023

anyscalesam added ray 2.9 Issues targeting Ray 2.9 release (~Q4 CY2023) and removed ray 2.8 labels Nov 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Data jobs able to trigger persistent mmap file leak in between job runs #39229

[Core] Data jobs able to trigger persistent mmap file leak in between job runs #39229

plv commented Sep 1, 2023 •

edited

Loading

z4y1b2 commented Sep 5, 2023 •

edited

Loading

richardliaw commented Sep 5, 2023

ericl commented Sep 26, 2023

z4y1b2 commented Sep 27, 2023

ericl commented Sep 27, 2023

anyscalesam commented Oct 5, 2023

rynewang commented Oct 5, 2023

rynewang commented Oct 16, 2023

rynewang commented Oct 17, 2023

rkooo567 commented Oct 17, 2023

rkooo567 commented Oct 27, 2023

[Core] Data jobs able to trigger persistent mmap file leak in between job runs #39229

[Core] Data jobs able to trigger persistent mmap file leak in between job runs #39229

Comments

plv commented Sep 1, 2023 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

z4y1b2 commented Sep 5, 2023 • edited Loading

richardliaw commented Sep 5, 2023

ericl commented Sep 26, 2023

z4y1b2 commented Sep 27, 2023

ericl commented Sep 27, 2023

anyscalesam commented Oct 5, 2023

rynewang commented Oct 5, 2023

rynewang commented Oct 16, 2023

rynewang commented Oct 17, 2023

rkooo567 commented Oct 17, 2023

rkooo567 commented Oct 27, 2023

plv commented Sep 1, 2023 •

edited

Loading

z4y1b2 commented Sep 5, 2023 •

edited

Loading