-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Apparent object store memory leak #25779
Comments
An even simpler demonstration, a loop, send 10MB to a single task, receive 10MB back
This results in unbounded object spilling:
And log errors
|
Thanks for reporting. I think there is some race condition between Ray GC and spill worker, so when the object has been GCed the spill worker already grab the object for spilling. cc @rkooo567 |
I'm looking into this in my spare time. I've reproduced it and, interestingly, the issue only repros when the task takes an argument and returns data. Furthermore, the amount of memory leaked per invocation+ I'll continue to root-cause later, just wanted to repro and see whether it was the arg or rval that was leaking.
In this test, the size of the task argument is 3MB, return value is 2MB. task definitions: @ray.remote
def rval_only_worker():
return os.urandom(rval_data_size)
@ray.remote
def arg_only_worker(x):
assert len(x) == arg_data_size
@ray.remote
def arg_and_rval_worker(x):
assert len(x) == arg_data_size
return os.urandom(rval_data_size) See here for the full test script. |
Is there an ETA for the resolution on this? |
Right now, we have no bandwidth to prioritize until the next release (ray 2.4. scheduled by end of March). We can try fixing this by ray 2.5 |
@cadedaniel Thanks for the test script - it reproduces the exact problem I have been experiencing (though mine involves passing 300mb arrays around so I run into trouble often). As a workarround - if you do a ref = ray.put(large_object), method.remote(ref) the problem dissapears. Here is the test script modified to demonstrate this. Hopefully this will help with tracking down the underlying cause.
|
I think this issue should have been addressed. I tested the repro script from @merrysailor and the one from @cadedaniel and both did NOT repro. I used the off the shelf Ray 2.5, on local desktop and on the workspace. |
side by side test: running the repro script on ray 1.13.0 vs ray 2.5, use ray 1.13.0 (in progress):
ray 1.13.0 (finished):
ray 2.5 (finished):
So we do have a bug in ray 1.13.0 that it spills when it feels memory pressure but did not interact well with GC, while in 2.5 it correctly GCs all used up objects and did not spill. Repro procedure:
Same thing for |
Per the above - we are now closing as we believe Ray 2.5+ has GC operating correctly. @merrysailor please re-open if you still observe this issue. |
What happened + What you expected to happen
Hello,
I am new ray user and trying to parallelize general python workflows. A very basic example I put together seems to exhaust the object store memory and result in spilling, even though from what I can see the code should remove all object references. Also, the logs highlight some issues:
raylet.out, on every task submission:
[2022-06-14 20:59:38,442 I 191450 191450] (raylet) object_buffer_pool.cc:153: Not enough memory to create requested object 00ffffffffffffffffffffffffffffffffffffff0100000002000000, aborting
python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_191395.log:
[2022-06-14 21:00:49,960 W 191395 191634] reference_count.cc:1225: Spilled object a8485d936ac2e7ccffffffffffffffffffffffff0100000001000000 already out of scope
[2022-06-14 21:00:59,800 W 191395 191634] reference_count.cc:1415: Object locations requested for ec502c4fdc3aeab0ffffffffffffffffffffffff0100000001000000, but ref already removed. This may be a bug in the distributed reference counting protocol.
The fact that spilling takes place is highlight both by the logs and
ray memory
invocations, e.g.,--- Aggregate object store stats across all nodes ---
Plasma memory usage 152 MiB, 16 objects, 80.0% full, 5.0% needed
Spilled 314 MiB, 33 objects, avg write throughput 235 MiB/s
Objects consumed by Ray tasks: 343 MiB.
Thank you.
Versions / Dependencies
Ray: 1.12.1, 1.13.0
Python: 3.9
OS: Ubuntu 20.04 LTS
Reproduction script
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: