You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you run the repro script below in a loop, you'll end up with a hung IDLE worker per run:
If you change max_task_retries to 0, or the size of the numpy array being returned to <100KB, it does not leak workers.
This does seem to require the two-level indirection that's in the repro. If I call actor.test.remote() directly in the driver, it does not leak any workers.
If you look at the logs for one of the leaked workers, you'll see:
931[2024-04-23 12:42:12,179 I 23271 23525] core_worker.cc:4235: Force exiting worker that owns object. This may cause other workers that depends on the object to lose it. Own objects: 1 # Pins in flight: 0
932[2024-04-23 12:42:12,180 I 23271 23525] core_worker.cc:834: Exit signal received, this process will exit after all outstanding tasks have finished, exit_type=INTENDED_SYSTEM_EXIT, detail=Worker exits because it was idle (it doesn't have objects it owns while no task or actor has been scheduled) for a long time.
934[2024-04-23 12:42:12,180 W 23271 23271] reference_count.cc:54: This worker is still managing 1 objects, waiting for them to go out of scope before shutting down.
And the worker never exits.
Versions / Dependencies
Ray 2.9.3, 2.10.0, 2.11.0
Python 3.8.10
Ubuntu 20.04
The text was updated successfully, but these errors were encountered:
jfaust
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Apr 23, 2024
jjyao
added
P1
Issue that should be fixed within a few weeks
and removed
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
May 6, 2024
What happened + What you expected to happen
If you run the repro script below in a loop, you'll end up with a hung IDLE worker per run:
![image](https://private-user-images.githubusercontent.com/1388377/324974597-60424ca4-a490-445a-91ee-cccc062f5852.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjA1NTc4MjcsIm5iZiI6MTcyMDU1NzUyNywicGF0aCI6Ii8xMzg4Mzc3LzMyNDk3NDU5Ny02MDQyNGNhNC1hNDkwLTQ0NWEtOTFlZS1jY2NjMDYyZjU4NTIucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDcwOSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MDlUMjAzODQ3WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9M2ZkYmM3ZTE5OTY0Mzk5MGU1MjQ5NzQ5MThjMjczY2UyMTJjNjM5ZTZiMjU4ZDA5MzI3ODI0MThlYzJkMGFlYyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.UUjenXMy4DzNX27lSgGUb3Di2WLPhE2LDBf4cpI2h08)
If you change max_task_retries to 0, or the size of the numpy array being returned to <100KB, it does not leak workers.
This does seem to require the two-level indirection that's in the repro. If I call
actor.test.remote()
directly in the driver, it does not leak any workers.If you look at the logs for one of the leaked workers, you'll see:
And the worker never exits.
Versions / Dependencies
Ray 2.9.3, 2.10.0, 2.11.0
Python 3.8.10
Ubuntu 20.04
Reproduction script
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: