Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Better error message for task/actors when unschedulable (integrate with autoscaler) #15933

Closed
richardliaw opened this issue May 20, 2021 — with Slack · 4 comments · Fixed by #18724
Closed

[core] Better error message for task/actors when unschedulable (integrate with autoscaler) #15933

richardliaw opened this issue May 20, 2021 — with Slack · 4 comments · Fixed by #18724
Assignees
Labels
enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks size:medium usability
Milestone

Comments

Copy link
Contributor

richardliaw commented May 20, 2021

The problem

This error message is not very actionable nor usable:

2021-05-20 01:48:33,485	WARNING worker.py:1115 -- The actor or task with ID fffffffffffffffffc714cea33cb71acd82c19fb01000000 cannot be scheduled right now. It requires {CPU_group_7e13d05021cb9c6d4284227951fdd56a: 1.000000} for placement, but this node only has remaining {3.000000/4.000000 CPU, 11.200000 GiB/11.200000 GiB memory, 0.000000/1.000000 GPU, 5.000000 GiB/5.000000 GiB object_store_memory, 0.000000/1.000000 CPU_group_7e13d05021cb9c6d4284227951fdd56a, 0.000000/1.000000 CPU_group_1_7e13d05021cb9c6d4284227951fdd56a, 0.000000/1.000000 GPU_group_1_7e13d05021cb9c6d4284227951fdd56a, 0.000000/1.000000 GPU_group_7e13d05021cb9c6d4284227951fdd56a, 1.000000/1.000000 accelerator_type:T4, 1.000000/1.000000 node:172.31.57.235}

A few improvements are needed:

  • We should show the specific task/actor in the error message. This includes
    • Knowing whether it is a task or actor
    • Providing the class or function name
    • The call site where the task or actor was launched (nice to have)
  • The autoscaler should drive printing of this message. There are two cases:
    • If the autoscaler is actively scaling up to handle the task, we should suppress the message (unless it is taking longer than a certain time).
    • If the autoscaler is not able to scale up, we should print this accordingly from the autoscaler event logs.
@rkooo567 rkooo567 added this to the Core Bugs milestone May 20, 2021
@rkooo567 rkooo567 added fix-error-msg This issue has a bad error message that should be improved. triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 20, 2021
@ericl ericl added usability P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 1, 2021
@ericl ericl added size:large and removed fix-error-msg This issue has a bad error message that should be improved. labels Jul 2, 2021
@ericl ericl changed the title [core] Better error message for task placement [core] Better error message for task/actors when unschedulable (integrate with autoscaler) Jul 2, 2021
@ericl ericl added the enhancement Request for new feature and/or capability label Jul 2, 2021
@DLWCMD
Copy link

DLWCMD commented Jul 13, 2021

I am commenting here as the related issues seem to have been closed. If should should be on another issue, please let me know.

I have received the same blocking message as reported in 13905: The actor or task cannot be scheduled right now. My use case is associated with Poputation Based Training replays: [https://docs.ray.io/en/master/tune/tutorials/tune-advanced-tutorial.html#replaying-a-pbt-run].

In my case, the pbt policy text file specifies 6 workers. As reported by another user, if I manually change that value to 1, the replay runs as expected, albeit slowly since I am using only one worker. Further, the replay will run if local mode is activated.

FYI, I have experienced the same issue with loading and running checkpoints not associated with PBT. A better error message would be useful, particularly if it described how to resolve the problem without giving up the benefits of multiple workers.

I have code and sample PBT policy files that I would be happy to share if you feel they would be useful.

Thanks.

David Wilt

@rkooo567
Copy link
Contributor

rkooo567 commented Aug 9, 2021

@DLWCMD thanks a lot! After handling the first half of TODO, I will ping you if the error message will be useful for your case.

@DLWCMD
Copy link

DLWCMD commented Aug 9, 2021 via email

@rkooo567
Copy link
Contributor

#15962 => This will be also done as a part of this work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks size:medium usability
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants