Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Data jobs able to trigger persistent mmap file leak in between job runs #39229

Closed
plv opened this issue Sep 1, 2023 · 11 comments · Fixed by #40370
Closed

[Core] Data jobs able to trigger persistent mmap file leak in between job runs #39229

plv opened this issue Sep 1, 2023 · 11 comments · Fixed by #40370
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order ray 2.9 Issues targeting Ray 2.9 release (~Q4 CY2023) size:large

Comments

@plv
Copy link

plv commented Sep 1, 2023

What happened + What you expected to happen

When I have a long-running Ray cluster that I’m running multiple jobs on (for iteration in development), eventually I will start getting OOMs during initial SQL read (i.e. SplitBlocks). I’m under the impression that sometimes Ray is “spilling” some data into tmpfs or shmem, but I don’t think it should persist in between jobs. Is there any way to clean up this junk in between runs so that I don’t have to restart my cluster?

I’m aware that spilling in Ray happens when object memory is high, but I have 500GiB object memory and my usage never reaches anywhere near that.

What's more is that the OOMs are always just slightly above Ray's 95% kill rate -- it's as if there is a rounding error happening somewhere.

// This is only tangentially related to datasets, I’m writing it here because my jobs only use the datasets library, and I’m wondering if this phenomenon is specific to the ray library. If this is the wrong place to post this, please let me know

Versions / Dependencies

  • Ray 2.6.1 using this Docker imageray:2.6.1.7474f8-py38-cpu. You can pull down the docker image I have on top of it here, but I don't think it's relevant

  • Python 3.8.13

  • Running using Ray clusters (ie not in Kuberay)

  • Running on a fleet of AWS m5.4xlarge and g5.2xlarge

Reproduction script

Run a script like this repeatedly. Eventually machines will OOM

import ray

ray.init()

@ray.remote()
def do_data():
   ds = ray.data.read_sql("<read huge SQL table here>") # we use this snowflake table: https://app.snowflake.com/marketplace/listing/GZTSZ290BUX1X/cybersyn-inc-cybersyn-llm-training-essentials
     .materialize().show()

ray.get(do_data.remote())

Issue Severity

Low: It annoys or frustrates me.

@plv plv added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 1, 2023
@z4y1b2
Copy link
Contributor

z4y1b2 commented Sep 5, 2023

I found a similar problem. Plasma object store in raylet process doesn't release mmaped memory of objects used in a finished job.

My environment is:

  • MacBook Pro M2 Max, 64 GB RAM, 12 CPU Cores
  • MacOS 13.5.1 (22G90)
  • Python 3.11.4 arm64
  • Ray version 2.6.3

My test script is:

import ray
import pyarrow as pa

schema = pa.schema([('v', pa.binary())])
def map(batch):
    lines = []
    for __ in range(len(batch['id'])):
        lines.append(b'*' * 99)
    return pa.Table.from_arrays([lines], schema=schema)

def test(n):
    ray.init()
    ray.data.DataContext.get_current().execution_options.verbose_progress = True
    ray.data.DataContext.get_current().use_push_based_shuffle = True
    parallelism = max(1, int(n * 100 / (128 * 1024 * 1024)))
    ds = ray.data.range(n, parallelism=parallelism)\
            .map_batches(map, zero_copy_batch=True)\
            .repartition(1)
    ds.materialize()

if __name__ == '__main__':
    test(700000000)

run following shell command:

ray start --head
python ./test.py

And you will find the disk usage increases, run the vmmap PID command and you will find a lot of mmaped files still exist in raylet process, and when you ray stop killing all the ray processes then the disk usage decreases.

@richardliaw richardliaw added the data Ray Data-related issues label Sep 5, 2023
@richardliaw
Copy link
Contributor

cc @c21

@ericl
Copy link
Contributor

ericl commented Sep 26, 2023

Do you know if it's disk usage or object store memory itself that's leaked? It would be useful to post the graphs of physical disk as well as the logical object store memory usage from the Ray dashboard here

@z4y1b2
Copy link
Contributor

z4y1b2 commented Sep 27, 2023

I found a similar problem. Plasma object store in raylet process doesn't release mmaped memory of objects used in a finished job.

My environment is:

* MacBook Pro M2 Max, 64 GB RAM, 12 CPU Cores

* MacOS 13.5.1 (22G90)

* Python 3.11.4 arm64

* Ray version 2.6.3

My test script is:

import ray
import pyarrow as pa

schema = pa.schema([('v', pa.binary())])
def map(batch):
    lines = []
    for __ in range(len(batch['id'])):
        lines.append(b'*' * 99)
    return pa.Table.from_arrays([lines], schema=schema)

def test(n):
    ray.init()
    ray.data.DataContext.get_current().execution_options.verbose_progress = True
    ray.data.DataContext.get_current().use_push_based_shuffle = True
    parallelism = max(1, int(n * 100 / (128 * 1024 * 1024)))
    ds = ray.data.range(n, parallelism=parallelism)\
            .map_batches(map, zero_copy_batch=True)\
            .repartition(1)
    ds.materialize()

if __name__ == '__main__':
    test(700000000)

run following shell command:

ray start --head
python ./test.py

And you will find the disk usage increases, run the vmmap PID command and you will find a lot of mmaped files still exist in raylet process, and when you ray stop killing all the ray processes then the disk usage decreases.

I run the test script using Ray 2.7.0 (Python 3.11.5) in the same machine and the problem still exists.

Here are screenshots of Ray dashboard about the node.

Before running the script:

After running the script:

There's no visible files under '/tmp' because mmapped 'plasmaXXXXXX' files are unlinked right after its creation.

Here's the output of vmmap PID after running the script:

leak_vmmap.txt

@ericl
Copy link
Contributor

ericl commented Sep 27, 2023

Thanks for the screenshots. I think I can confirm the mmap file leaks from the following observations:

  • Before: 0/2GB reported object store memory, 381GB used disk.
  • After: 0/2GB reported object store memory, 516GB disk used.

That's 200GB of leaked plasma files, whereas normal object store allocation would at most create 2GB of files in /tmp (maybe additional 2GB for fallback allocation).

I'm going to tag this as P0 until further verification, since this seems like a possibly serious regression in a core component.

@ericl ericl added P0 Issues that should be fixed in short order core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) data Ray Data-related issues labels Sep 27, 2023
@ericl ericl changed the title [Data] Persistent memory spillover in between job runs [Core] Data jobs able to trigger persistent mmap file leak in between job runs Sep 27, 2023
@anyscalesam
Copy link
Collaborator

@rynewang any update on this?

@rynewang
Copy link
Contributor

rynewang commented Oct 5, 2023

I think Ray successfully released all object data in the object store; but then Ray did not release the mmap'd files. Further tracking down the release code...

@rynewang
Copy link
Contributor

Update: I found the issue being the Plasma Client (core worker and raylet's client to the shared memory) never releases mmap'd files. This only happens on high memory pressure when the main memory is not enough and Ray is forced to allocate files and mmap them. After the memory pressure, we never munmap the files even though they are no longer used.

Actively working on PRs to tackle this. Preview: #40370 (not mature to merge yet)

@rynewang
Copy link
Contributor

Tested that master branch Ray leaks on this repro script, and Ray with this PR no longer leaks. Next step: make tests fixed and @rkooo567 will merge it

@rkooo567
Copy link
Contributor

The fix will be included in 2.8

@jjyao jjyao added the release-blocker P0 Issue that blocks the release label Oct 20, 2023
@vitsai vitsai removed the release-blocker P0 Issue that blocks the release label Oct 26, 2023
rkooo567 pushed a commit that referenced this issue Oct 27, 2023
…maps on release. (#40370)

Plasma memory sharing works this way: the plasma server creates a temp file and mmaps it; then upon plasma client's Get, the server sends the fd to client who mmaps that fd. Then client can Release an object, and if all clients released refs to an fd, the server unmaps it. See the missing piece? the plasma client never unmaps.

This is normally not a problem because we don't want to unmap the main memory in /dev/shm anyway; but on memory pressure when we do fallback allocations (mmaps to disk files in /tmp/ray), we will leak mmaps by never unmapping them in plasma client, even if nobody is using those mmap files. To make things worse, raylet itself has a plasma client so even if the core workers exit we are still leaking.

A good place for a plasma client to unmap is at Release, after which it may no longer read or write an object ID. However a mmap region may be used by more than 1 object (this is NOT the case today for fallback allocations, but we want to be future proof); also if a mmap region is unmapped and mapped again, the plasma client fails, because the plasma server did not know the client unmapped it and hence would not send the fd.

This PR allows the plasma server to ask a plasma client to unmap. The server maintains a per-client ref count table: {object ID -> mmap fd}, and if a certain fd is no longer referenced by a Release request, the server sends a boolean "should_unmap" which orders the client to do that. The client MUST unmap.

If some time later the same fd needs to be mapped again in a Get request, it's fine, because the server knows the client no longer mmaps that fd, and would send the fd; and the client had removed the knowledge of that fd so it receives the fd and maps again.

Fixes #39229
@rkooo567
Copy link
Contributor

we fixed this in the master, but the fix wasn't included in Ray 2.8 due to the risk of last-minute changes. The fix should be available in the master and will be available from Ray 2.9

@anyscalesam anyscalesam added ray 2.9 Issues targeting Ray 2.9 release (~Q4 CY2023) and removed ray 2.8 labels Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order ray 2.9 Issues targeting Ray 2.9 release (~Q4 CY2023) size:large
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants