[Core] Task waiting to start due to insufficient resources even though resources are present and detected #33000
Labels
bug
Something that is supposed to be working; but isn't
core
Issues that should be addressed in Ray Core
core-correctness
Leak, crash, hang
core-scheduler
needs-repro-script
Issue needs a runnable script to be reproduced
P1.5
Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared
stability
What happened + What you expected to happen
Our OSS library launches Ray tasks on a cluster on behalf of our users. One user (@dcavadia) reported (run-house/runhouse#9 (comment)) that the task is failing due to a Ray error, and shared the following stack traces and Ray status:
Status:
We start tasks with
{'CPU': 0.0001, 'GPU': 0.0001}
(if GPU is present, otherwise just CPU) because we don't want Ray to manage the contention and rather just leave it to the OS, and because we want to launch a new worker so we can collect the logs for just that task (even though we know it's slower, there's no way we see to collect logs for just one task, only a worker).This looks like a bug to me - the resources appear to be abundantly available but the task is being held. I may be reading it wrong, because I don't recognize the exception.
Versions / Dependencies
Reproduction script
Copied from user shared script here. Requires runhouse latest, langchain latest, transformers, and torch.
Issue Severity
None
The text was updated successfully, but these errors were encountered: