[tune] The actor or task cannot be scheduled right now #13905

BBDrive · 2021-02-04T09:07:39Z

I have enough resources but still report a warning:

The actor or task with ID 124a2b0fc855a8f8ffffffff01000000 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining {accelerator_type:P5000: 1.000000}, {node:172.31.226.37: 1.000000}, {memory: 71.142578 GiB}, {object_store_memory: 23.779297 GiB}, {GPU: 0.250000}. In total there are 7 pending tasks and 0 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

How should i deal with this problem？
Thanks.

The text was updated successfully, but these errors were encountered:

richardliaw · 2021-02-04T09:12:02Z

Try using reuse_actors=True?

BBDrive · 2021-02-04T09:16:56Z

Unfortunately this does not solve the problem.

Memory usage on this node: 14.6/125.6 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 64.000: None | Iter 16.000: None | Iter 4.000: None | Iter 1.000: None
Resources requested: 10/80 CPUs, 0.25/2 GPUs, 0.0/71.14 GiB heap, 0.0/23.78 GiB objects (0/1.0 accelerator_type:P5000)
Result logdir: /home/user/kaylenproject/discrete_continuous/ray_results/DEFAULT_2021-02-04_17-14-02
Number of trials: 1/1000 (1 RUNNING)
+------------------+----------+-------+----------+------------------+------------+----------------+-------+
| Trial name       | status   | loc   |    gamma |   minibatch_size |    penalty |   penalty_dist | pyg   |
|------------------+----------+-------+----------+------------------+------------+----------------+-------|
| DEFAULT_52a587a4 | RUNNING  |       | 0.995058 |              256 | 0.00456211 |             30 | mypyg |
+------------------+----------+-------+----------+------------------+------------+----------------+-------+
2021-02-04 17:14:10,769	WARNING worker.py:1034 -- The actor or task with ID 0feb9ab18c7d53d4ffffffff01000000 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining {accelerator_type:P5000: 1.000000}, {node:172.31.226.37: 1.000000}, {memory: 71.142578 GiB}, {object_store_memory: 23.779297 GiB}. In total there are 8 pending tasks and 0 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

krfricke · 2021-02-04T09:25:59Z

Can you give us a bit more context? What version of Ray are you using, and can you share parts of the code you're using to run your training run? The call to tune.run() would be interesting to see

BBDrive · 2021-02-04T10:06:16Z

I try to use ray 1.1.0 and 2.0.0 both have this problem.
My code is as follows.

        ray.init(num_gpus=2, num_cpus=80)
        config = {'minibatch_size': tune.choice([128, 256]),
                  'gamma': tune.loguniform(0.995, 0.9999),
                  'penalty': tune.loguniform(0.001, 0.1),
                  'pyg': tune.choice(['pyg', 'mypyg']),
                  'penalty_dist': tune.choice([10, 20, 30])
                  }
        tune.run(
            partial(main, args=args),
            search_alg=HyperOptSearch(),
            scheduler=AsyncHyperBandScheduler(),
            local_dir="~/kaylenproject/discrete_continuous//ray_results",
            num_samples=1000,
            config=config,
            metric='safe_v',
            mode='max',
            resources_per_trial={"gpu": 0.25, "cpu": 10},
        )

The main function is a reinforcement learning algorithm written by myself

BBDrive · 2021-02-04T10:34:10Z

I reduce the number of GPUs per trial and does not specify the number of cpu per trial. It works.

resources_per_trial={"gpu": 0.1}

krfricke · 2021-02-04T13:11:27Z

So is your problem solved by this? Just for completeness sake, how many CPUs are actually on your machine?

BBDrive · 2021-02-04T13:13:03Z

80 cpus

krfricke · 2021-02-04T13:14:53Z

I see. So usually this should work. Another question would be if you're scheduling other remote Ray jobs in your trainable (main), as that would occupy cluster resources that then can't be used by other trials. Is this something you're doing?

BBDrive · 2021-02-04T13:23:03Z

yeah, there are other remote ray jobs in main.
Isn’t the 10 cpus defined in resource_per_trial provided for remote jobs in main function to use?

krfricke · 2021-02-04T13:34:51Z

Not exactly - the 10 CPUs are reserved just for the main function of the trainable. If this main function requests more resources, you need to use the extra_* variables.

E.g.:

resources_per_trial={
    "cpu": 1,
    "extra_cpu": 9,
    "extra_gpu": 0.25
}

This would reserve 10 CPUs and 0.25 GPUs. The main function will be allocated 1 CPU, and then 9 CPUs and 0.25 GPUs would be left for the main function to schedule itself.

See also here: https://docs.ray.io/en/latest/tune/tutorials/overview.html#how-do-i-set-resources

Please note that in the future we will deprecate support for extra_ arguments in favor for placement groups. This will take another couple of weeks though, so you should be safe to use it as is.

krfricke · 2021-02-04T13:36:13Z

That said, how you allocate the resources depends on your main function, which I don't have insight into. If you're starting a number of remote CPU workers, these resources need to be included in the extra_cpu key. If the GPU learner is a remote actor, the request should be done with extra_gpu, but if the learner lives in the main function, gpu should be used instead.

BBDrive · 2021-02-04T13:43:31Z

I get it.
Thank you very much.

krfricke · 2021-02-04T13:44:58Z

You're welcome! Please feel free to re-open if any issues remain.

mirekphd · 2022-09-03T07:40:30Z

I think Ray needs a method to clear the queue of unwanted actors. Both in the API and in the Dashboard. I can open a new Issue if you like, but the reason is the same as in this issue here: the "actors queue" gets filled after model training, and you cannot make a prediction from the trained model, because all CPU resources (of which none is actually used) remain locked (until at least client python kernel restart in most cases, but sometimes also the master and slave(s) Ray servers, incl. a similar case to the one reported by OP, encountered during ML models training and scoring using lightgbm_ray/xgboost_ray).

mirekphd · 2022-09-03T07:42:20Z

Notice that small caps "cpu" and "gpu" are not affecting CPU / GPU settings at all... see what happens if you pass a small-cap "cpu" key in a dict to the resources_per_trial argument, while at the same time setting 'num_cpus' argument to 1:

(scheduler +1h14m52s) Error: No available node types can fulfill resource request {'cpu': 1.0, 'CPU': 1.0}. Add suitable node types to this cluster to resolve this issue.

Ray expects all-caps keys ("CPU" and "GPU") in the dict passed to resources_per_trial, but if you pass them, it rejects them in favor of the 'num_cpus' and 'num_gpus' args:
ValueError: Use the 'num_cpus' and 'num_gpus' keyword instead of 'CPU' and 'GPU' in 'resources' keyword

So the workaround quoted below may not work at all (like in my case), and we need a well-understood method to release locked or otherwise unwanted resources (actors etc.), something like gc.collect():

resources_per_trial={"gpu": 0.1}

BBDrive added the enhancement Request for new feature and/or capability label Feb 4, 2021

krfricke closed this as completed Feb 4, 2021

viola521 mentioned this issue Jul 9, 2023

data_paint.py throw. error too many actors . viola521/LAV-main#5

Open

tg2k mentioned this issue Aug 7, 2023

step_size modulo check complicates predict_insample() with step_size > 1 Nixtla/neuralforecast#714

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] The actor or task cannot be scheduled right now #13905

[tune] The actor or task cannot be scheduled right now #13905

BBDrive commented Feb 4, 2021 •

edited

Loading

richardliaw commented Feb 4, 2021

BBDrive commented Feb 4, 2021 •

edited

Loading

krfricke commented Feb 4, 2021

BBDrive commented Feb 4, 2021

BBDrive commented Feb 4, 2021

krfricke commented Feb 4, 2021

BBDrive commented Feb 4, 2021

krfricke commented Feb 4, 2021

BBDrive commented Feb 4, 2021

krfricke commented Feb 4, 2021

krfricke commented Feb 4, 2021

BBDrive commented Feb 4, 2021

krfricke commented Feb 4, 2021

mirekphd commented Sep 3, 2022 •

edited

Loading

mirekphd commented Sep 3, 2022

[tune] The actor or task cannot be scheduled right now #13905

[tune] The actor or task cannot be scheduled right now #13905

Comments

BBDrive commented Feb 4, 2021 • edited Loading

richardliaw commented Feb 4, 2021

BBDrive commented Feb 4, 2021 • edited Loading

krfricke commented Feb 4, 2021

BBDrive commented Feb 4, 2021

BBDrive commented Feb 4, 2021

krfricke commented Feb 4, 2021

BBDrive commented Feb 4, 2021

krfricke commented Feb 4, 2021

BBDrive commented Feb 4, 2021

krfricke commented Feb 4, 2021

krfricke commented Feb 4, 2021

BBDrive commented Feb 4, 2021

krfricke commented Feb 4, 2021

mirekphd commented Sep 3, 2022 • edited Loading

mirekphd commented Sep 3, 2022

BBDrive commented Feb 4, 2021 •

edited

Loading

BBDrive commented Feb 4, 2021 •

edited

Loading

mirekphd commented Sep 3, 2022 •

edited

Loading