Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate with autoscaler to improve error message: The actor or task with ID [] is pending and cannot currently be scheduled #8326

Closed
J1810Z opened this issue May 5, 2020 · 50 comments
Assignees
Labels
enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks
Milestone

Comments

@J1810Z
Copy link

J1810Z commented May 5, 2020

After setting up a new conda environment with ray, I am running into the issue that ray complains about insufficient resources. While sometimes the actors are still starting after 30s, most of the time my python program gets stuck at this point.

I am initializing ray from within my python script, which runs on a node scheduled by slurm. Access to CPUs and GPUs is limited via cgroups. psutil.Process().cpu_affinity() provides me with the correct number of available cores, which is higher than the necessary resources for ray. Interestingly, I didn't run into this issue in my previous conda environment.

The error message does not help much:
2020-05-05 17:17:32,657 WARNING worker.py:1072 -- The actor or task with ID ffffffffffffffff45b95b1c0100 is pending and cannot currently be scheduled. It requires {object_store_memory: 0.048828 GiB} for execution and {CPU: 1.000000}, {object_store_memory: 0.048828 GiB} for placement, but this node only has remaining {node:192.168.7.50: 1.000000}, {CPU: 28.000000}, {memory: 30.029297 GiB}, {GPU: 1.000000}, {object_store_memory: 10.351562 GiB}. In total there are 0 pending tasks and 3 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

@rkooo567
Copy link
Contributor

rkooo567 commented May 5, 2020

@edoakes @wuisawesome I also see this warning when there are plenty of resources. This can be scheduler issues?

@rkooo567
Copy link
Contributor

rkooo567 commented May 7, 2020

Turns out this error occurs as a warning of resources deadlock. Here is a related issue. #5790.

@GoingMyWay
Copy link

GoingMyWay commented May 10, 2020

@J1810Z @rkooo567 Same issue in Ray 0.8.4, and the task got stuck and was pending. Although the CPUs and memories are enough. There are 2 rollout workers with each sampling 512 sample_batch.

Ray Version: 0.8.4
OS: Linux
num_workers: 2,
batch_sample_size/rollout_fragment_length : 512
CPUs in machine: 41 CPUs.

WARNING worker.py:1072 -- The actor or task with ID ffffffffffffffff45b95b1c0100 is pending and cannot currently be scheduled. It requires {CPU: 20.000000} for execution and {CPU: 20.000000} for placement, but this node only has remaining {node:100.102.34.3: 1.000000}, {CPU: 56.000000}, {memory: 148.095703 GiB}, {GPU: 8.000000}, {object_store_memory: 46.533203 GiB}. In total there are 0 pending tasks and 2 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

I do not why It requires {CPU: 20.000000} for execution and {CPU: 20.000000} for placement, it looks confusing.

@ericl
Copy link
Contributor

ericl commented May 10, 2020

Is it possible to reproduce this with a dummy program that can be shared?

@GoingMyWay
Copy link

GoingMyWay commented May 11, 2020

Is it possible to reproduce this with a dummy program that can be shared?

Hi, I will send you the code later. I run my code with docker by specializing CPU=41 and Memory=120GB on a machine (CPU=128 or more, and Memory=256GB or more). But it returned

The actor or task with ID ffffffffffffffff45b95b1c0100 is pending and cannot currently be scheduled. It requires {CPU: 20.000000} for execution and {CPU: 20.000000} for placement, but this node only has remaining {node:9.146.140.151: 1.000000}, {CPU: 180.000000}, {memory: 271.191406 GiB}, {object_store_memory: 82.958984 GiB}. In total there are 0 pending tasks and 4 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

As you can see, the docker detected 180 CPUs and 271 GB memory. But I only assigned 41 CPUs and 120 GB memory to my task. So, do you think Ray can detect the whole resources of one machine even running in the docker with specified resources?

@pitoupitou
Copy link

pitoupitou commented May 15, 2020

Hi, I have got the same error and it's difficult to understand since I have lots of resources.

Running my code with Docker as well on a machine that should have enough capacity (CPU=16 & 4 GPUs - Tesla V100 with Memory=244GB).
I tried different versions of Ray but it triggers the same error.

Interestingly, there is no error when I define the number of Ray workers as being 1. But when it is any number >1, the error keeps appearing.

Notes :

  • Every worker is defined with:
    @ray.remote(num_cpus=1, num_gpus=1 if torch.cuda.is_available() else 0)

  • Error shows:
    WARNING worker.py:1090 -- The actor or task with ID ffffffffffffffffcd8f56890100 is pending and cannot currently be scheduled. It requires {CPU: 1.000000}, {GPU: 1.000000} for execution and {CPU: 1.000000}, {GPU: 1.000000} for placement, but this node only has remaining {node:172.31.69.55: 1.000000}, {CPU: 10.000000}, {memory: 210.498047 GiB}, {object_store_memory: 12.841797 GiB}. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

  • "debug_state.txt" shows :
    NodeManager: InitialConfigResources: {GPU,4.000000},{CPU,16.000000} ClusterResources: de21c14dccdf508d03d9119b8b7acb99afa89a00: \n- total: {GPU,4.000000},{CPU,16.000000} \n- avail: {GPU,0.000000},{CPU,10.000000}

How can I send you my code?

@ericl
Copy link
Contributor

ericl commented May 15, 2020

Can you trim down your code to a minimal script that can be pasted here?

@GoingMyWay
Copy link

@pitoupitou Hi, what task are you working on?

@ericl ericl added P0 Issues that should be fixed in short order and removed P1 Issue that should be fixed within a few weeks labels May 15, 2020
@ericl
Copy link
Contributor

ericl commented May 15, 2020

Moving to p0 since this has come up anecdotally several times now. Unfortunately it seems difficult to reproduce :/

@GoingMyWay
Copy link

GoingMyWay commented May 17, 2020

I think the main reason of this issue is when running task with docker in a machine where there are many CPUs and memory but few left for the current task, Ray can detect the whole resource setup of the task but many are not accessible, as a consequence, the task was pending (in offline training, Ray can still read the data until OOM). Another thing is, for my task, the task only took 40GB memory, but if you assign a memory limit to the docker (use default memory-related settings in Ray), say 45 GB, OOM occurred. So, I assigned a memory limit with 80GB to the docker, and 45% used, OOM did not occur, task finished successfully. My concern is the memory management is not very efficient in the current version of Ray (0.8.4).

@rkooo567
Copy link
Contributor

@ericl What's our plan to resolve this issue? We will have a new release in about 3 weeks. Will anyone handle this issue?

@ericl ericl added the needs-repro-script Issue needs a runnable script to be reproduced label May 18, 2020
@ericl
Copy link
Contributor

ericl commented May 18, 2020

I'm not sure we can do a lot without a reproduction case that can be run locally or in the public cloud. The reports above are not very clear.

@ericl ericl added P1 Issue that should be fixed within a few weeks and removed P0 Issues that should be fixed in short order labels May 18, 2020
@GoingMyWay
Copy link

I'm not sure we can do a lot without a reproduction case that can be run locally or in the public cloud. The reports above are not very clear.

Hi, I sent you the code before, I think you can use my code to reproduce this issue, this issue occurs very often in the cloud.

@ericl
Copy link
Contributor

ericl commented May 19, 2020 via email

@rkooo567
Copy link
Contributor

rkooo567 commented May 19, 2020

The problem seems like Ray doesn't detect cgroup resources. I indeed found a related issue benoitc/gunicorn#2028

(We detects cpu count from multiprocessing.cpu_count() and memory by reading cgroup memory info).

if num_cpus is None:

To verify it, we need to be able to reproduce the issue. As Eric said, it'll be very helpful if you can provide a small reproducible script + env.

@ericl
Copy link
Contributor

ericl commented May 19, 2020 via email

@J1810Z
Copy link
Author

J1810Z commented May 19, 2020

Since the problem occurs in my case while running ray on a slurm cluster, I am not able to reproduce it on a single machine. My minimal setup would still require at least two VMs (one control and one compute node).

Interestingly, the issue only occurs if ray is installed in a user-specific conda environment. If I install ray in the conda environment of the root user, I don't experience any issues.

@J1810Z
Copy link
Author

J1810Z commented May 20, 2020

After further experimentation, I think that I figured out what is happening.

On initialization, ray starts a worker process for each identified core. If an actor is instantiated immediately after initialization, these worker processes are not yet ready and ray attempts to create a new worker process. As a results, the number of worker processes exceeds the number of cores. If the worker to be started has specific CPU requirements (num_cpus=1), this results in the ressource error.

Adding a time delay between the initialization of ray and the actor instantiation resolves the issue. In that case, existing idle worker processes are used for the new actor.

Since this problem seems to be unrelated to slurm or docker, I am trying to build a minimal example that can be shared.

@ericl ericl self-assigned this May 22, 2020
@ericl
Copy link
Contributor

ericl commented May 23, 2020

@J1810Z any hints at how to create a reproduction? I've tried a couple simple things with starting actors quickly while workers are tied up executing tasks, but no luck so far.

@J1810Z
Copy link
Author

J1810Z commented May 23, 2020

Sorry, it took me a little bit longer to reproduce this issue. It seems to be more specific to my setup than I thought: I am running ray within a conda environment on an NFS share.

If I set lookupcache to none, the described issue occurs where ray creates additional worker processes instead of using the ones set upon initialization. The issue does not occur if lookupcache is set to the default value.

Interestingly, I have not yet been able to reliably reproduce the error message from above. On some systems, ray just creates the additional processes and starts without any issues. On other systems, the warning message occurs.

It would be interesting to know whether these additional processes are also created in the docker case or if both issues are completely unrelated.

With my NFS setup, the problem can be reproduced with the following minimal example:

import ray
import time
ray.init()

@ray.remote(num_cpus=1)
class TestActor():
def init(self):
time.sleep(60)

actor_list = []
for i in range(11): # I am running this example on a 12 core system
actor_list.append(TestActor.remote())

time.sleep(180)

@ericl
Copy link
Contributor

ericl commented May 24, 2020 via email

@ericl
Copy link
Contributor

ericl commented Jun 5, 2020

@J1810Z in the example you had, does the warning have any adverse effect? I'm able to reproduce the warning by injecting a sleep(10) into the init method of worker.py::Worker, but the following script still eventually runs and finishes.

import ray
import time
ray.init(num_cpus=12)

@ray.remote(num_cpus=1)
class TestActor():
  def init(self):
    time.sleep(60)
  def f(self):
    pass

actor_list = []
for i in range(11): # I am running this example on a 12 core system
   actor_list.append(TestActor.remote())

ray.get([a.f.remote() for a in actor_list])
print("OK")

This PR should make it so we warn less frequently in case of slow worker start: https://github.com/ray-project/ray/pull/8810/files

@J1810Z
Copy link
Author

J1810Z commented Jun 7, 2020

With the warning, startup is significantly slowed down. However, the behavior is quite inconsistent. Sometimes it takes around 30s until the script starts, sometimes I just canceled the run since nothing happened for several minutes. WIth the sleep time workaround, overall startup is much faster.

@ericl
Copy link
Contributor

ericl commented Jun 8, 2020

Hmm, I wonder if there is some other underlying issue with starting workers that is just manifesting this warning as a symptom here. @J1810Z can you reproduce, and when you encounter a "real hang", grab the output of /tmp/ray/session_latest/logs/debug_state.txt? It should include a bunch of stats including the worker pool size:

WorkerPool:
- num PYTHON workers: 8
- num PYTHON drivers: 1

This will help narrow down whether it's an issue starting workers or in the scheduler. It would be also great to have an entire zip of the session logs directory if possible.

@J1810Z
Copy link
Author

J1810Z commented Jun 18, 2020

Sorry for the late reply! Yes, I am trying to do that over this weekend.

@Schweini-PEK
Copy link

Reporting the same issue from a Berkeley researcher. The code used to run well on Savio till June, without updating any packages. The Ray version was 0.8.2 and has been tried with 0.8.5. But it's still working on Macbook. Adding time.sleep() works for me so far.

@ericl
Copy link
Contributor

ericl commented Jun 19, 2020

@Schweini-PEK could you try grabbing the log data mentioned two comments up too?

@annaluo676
Copy link
Contributor

Got the same error with Amazon SageMaker. I was able to reproduce the issue consistently with the homogeneous scaling part in this notebook example.

Docker image: custom docker built on top of ray 0.8.2 (Dockerfile)
instance type: ml.p2.8xlarge
instance count: 2

With num_gpus=15, this is what I got:

Resources requested: 61/64 CPUs, 15/16 GPUs, 0.0/899.46 GiB heap, 0.0/25.68 GiB objects
...
2020-06-25 03:19:40,072#011WARNING worker.py:1058 -- The actor or task with ID ffffffffffffffff45b95b1c0100 is infeasible and cannot currently be scheduled. 
It requires {GPU: 15.000000}, {CPU: 1.000000} for execution and {GPU: 15.000000}, {CPU: 1.000000} for placement, however there are no nodes in the cluster that can provide the requested resources. 
To resolve this issue, consider reducing the resource requests of this task or add nodes that can fit the task.

The experiments worked fine with num_gpus <= 8.

Adding a time.sleep() doesn't solve the issue.

@GoingMyWay
Copy link

Got the same error with Amazon SageMaker. I was able to reproduce the issue consistently with the homogeneous scaling part in this notebook example.

Docker image: custom docker built on top of ray 0.8.2 (Dockerfile)
instance type: ml.p2.8xlarge
instance count: 2

With num_gpus=15, this is what I got:

Resources requested: 61/64 CPUs, 15/16 GPUs, 0.0/899.46 GiB heap, 0.0/25.68 GiB objects
...
2020-06-25 03:19:40,072#011WARNING worker.py:1058 -- The actor or task with ID ffffffffffffffff45b95b1c0100 is infeasible and cannot currently be scheduled. 
It requires {GPU: 15.000000}, {CPU: 1.000000} for execution and {GPU: 15.000000}, {CPU: 1.000000} for placement, however there are no nodes in the cluster that can provide the requested resources. 
To resolve this issue, consider reducing the resource requests of this task or add nodes that can fit the task.

The experiments worked fine with num_gpus <= 8.

Adding a time.sleep() doesn't solve the issue.

Hi, Ms Luo, have you tried ray 0.9 dev?

@annaluo676
Copy link
Contributor

annaluo676 commented Jun 25, 2020

Got the same error with Amazon SageMaker. I was able to reproduce the issue consistently with the homogeneous scaling part in this notebook example.
Docker image: custom docker built on top of ray 0.8.2 (Dockerfile)
instance type: ml.p2.8xlarge
instance count: 2
With num_gpus=15, this is what I got:

Resources requested: 61/64 CPUs, 15/16 GPUs, 0.0/899.46 GiB heap, 0.0/25.68 GiB objects
...
2020-06-25 03:19:40,072#011WARNING worker.py:1058 -- The actor or task with ID ffffffffffffffff45b95b1c0100 is infeasible and cannot currently be scheduled. 
It requires {GPU: 15.000000}, {CPU: 1.000000} for execution and {GPU: 15.000000}, {CPU: 1.000000} for placement, however there are no nodes in the cluster that can provide the requested resources. 
To resolve this issue, consider reducing the resource requests of this task or add nodes that can fit the task.

The experiments worked fine with num_gpus <= 8.
Adding a time.sleep() doesn't solve the issue.

Hi, Ms Luo, have you tried ray 0.9 dev?

Unfortunately I was not able to. I have to reproduce with SageMaker with public sagemaker-ray image.

p.s. I cannot reproduce with CPU resources. For example, with multiple cpu instances, setting num_workers = total_cpus - 1 works fine, with "1" left for the schedular. In all experiments there exists only 1 trial.

@ericl
Copy link
Contributor

ericl commented Jun 25, 2020 via email

@annaluo676
Copy link
Contributor

There are two p2.8xlarge instances (instance count: 2), leading to 16 GPUs in total.

@ericl
Copy link
Contributor

ericl commented Jun 25, 2020 via email

@ericl
Copy link
Contributor

ericl commented Jul 13, 2020

Closing this issue as it has become confused with other problems. Please open a new bug if a reproduction is possible.

@ericl ericl closed this as completed Jul 13, 2020
@ericl ericl removed the needs-repro-script Issue needs a runnable script to be reproduced label Jul 1, 2021
@ericl ericl changed the title The actor or task with ID [] is pending and cannot currently be scheduled Improve error message: The actor or task with ID [] is pending and cannot currently be scheduled Jul 1, 2021
@ericl ericl added the enhancement Request for new feature and/or capability label Jul 1, 2021
@ericl ericl added this to the Core Backlog milestone Jul 1, 2021
@ericl ericl reopened this Jul 1, 2021
@ericl ericl changed the title Improve error message: The actor or task with ID [] is pending and cannot currently be scheduled Integrate with autoscaler to improve error message: The actor or task with ID [] is pending and cannot currently be scheduled Jul 1, 2021
@ericl ericl added the usability label Jul 1, 2021
@ericl
Copy link
Contributor

ericl commented Jul 1, 2021

Duplicates #8326

@ericl ericl closed this as completed Jul 1, 2021
@wuisawesome
Copy link
Contributor

Wait @ericl did you intentionally close this as a duplicate of itself?

@ericl
Copy link
Contributor

ericl commented Jul 1, 2021

Sorry, it duplicates #15933

@ericl ericl removed the usability label Jul 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants