Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Returning an object that is >100KB from an Actor with max_task_retries>0 leaks IDLE workers #44931

Closed
jfaust opened this issue Apr 23, 2024 · 4 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core p0.5 uueeehhh

Comments

@jfaust
Copy link

jfaust commented Apr 23, 2024

What happened + What you expected to happen

If you run the repro script below in a loop, you'll end up with a hung IDLE worker per run:
image

If you change max_task_retries to 0, or the size of the numpy array being returned to <100KB, it does not leak workers.

This does seem to require the two-level indirection that's in the repro. If I call actor.test.remote() directly in the driver, it does not leak any workers.

If you look at the logs for one of the leaked workers, you'll see:

931[2024-04-23 12:42:12,179 I 23271 23525] core_worker.cc:4235: Force exiting worker that owns object. This may cause other workers that depends on the object to lose it. Own objects: 1 # Pins in flight: 0
932[2024-04-23 12:42:12,180 I 23271 23525] core_worker.cc:834: Exit signal received, this process will exit after all outstanding tasks have finished, exit_type=INTENDED_SYSTEM_EXIT, detail=Worker exits because it was idle (it doesn't have objects it owns while no task or actor has been scheduled) for a long time.
934[2024-04-23 12:42:12,180 W 23271 23271] reference_count.cc:54: This worker is still managing 1 objects, waiting for them to go out of scope before shutting down.

And the worker never exits.

Versions / Dependencies

Ray 2.9.3, 2.10.0, 2.11.0
Python 3.8.10
Ubuntu 20.04

Reproduction script

import ray
import numpy as np

ray.init(namespace="test")


@ray.remote(max_task_retries=1)
class Actor:

    def test(self):
        return np.zeros((1024 * 100), dtype=np.uint8)


@ray.remote
def test(actor):
    return ray.get(actor.test.remote())


a = Actor.remote()
ray.get(test.remote(actor=a))

Issue Severity

High: It blocks me from completing my task.

@jfaust jfaust added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 23, 2024
@jfaust
Copy link
Author

jfaust commented Apr 23, 2024

Changing the Actor to ray.put() its return value seems to work around this issue:

import ray
import numpy as np

ray.init(namespace="test")


@ray.remote(max_task_retries=1)
class Actor:

    def test(self):
        return ray.put(np.zeros((1024 * 100), dtype=np.uint8))


@ray.remote
def test(actor):
    return ray.get(ray.get(actor.test.remote()))


a = Actor.remote()
ray.get(test.remote(actor=a))

@anyscalesam anyscalesam added the core Issues that should be addressed in Ray Core label May 3, 2024
@jjyao jjyao self-assigned this May 6, 2024
@jjyao jjyao added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 6, 2024
@anyscalesam
Copy link
Collaborator

@jfaust PR in progress to fix this; TY for supporting this.

@anyscalesam anyscalesam added p0.5 uueeehhh and removed P1 Issue that should be fixed within a few weeks labels May 10, 2024
@jfaust
Copy link
Author

jfaust commented May 10, 2024

@anyscalesam excellent! Any chance it will also fix #44438? (Not sure why I keep running into these).

@jjyao
Copy link
Contributor

jjyao commented Jun 24, 2024

This should have been fixed in master by #44214 (I tested locally). Feel free to reopen it if it's not the case for you.

@jjyao jjyao closed this as completed Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core p0.5 uueeehhh
Projects
None yet
Development

No branches or pull requests

3 participants