Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] The actor or task cannot be scheduled right now #13905

Closed
BBDrive opened this issue Feb 4, 2021 · 15 comments
Closed

[tune] The actor or task cannot be scheduled right now #13905

BBDrive opened this issue Feb 4, 2021 · 15 comments
Labels
enhancement Request for new feature and/or capability

Comments

@BBDrive
Copy link

BBDrive commented Feb 4, 2021

I have enough resources but still report a warning:

The actor or task with ID 124a2b0fc855a8f8ffffffff01000000 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining {accelerator_type:P5000: 1.000000}, {node:172.31.226.37: 1.000000}, {memory: 71.142578 GiB}, {object_store_memory: 23.779297 GiB}, {GPU: 0.250000}. In total there are 7 pending tasks and 0 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

How should i deal with this problem?
Thanks.

@BBDrive BBDrive added the enhancement Request for new feature and/or capability label Feb 4, 2021
@richardliaw
Copy link
Contributor

Try using reuse_actors=True?

@BBDrive
Copy link
Author

BBDrive commented Feb 4, 2021

Unfortunately this does not solve the problem.

Memory usage on this node: 14.6/125.6 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 64.000: None | Iter 16.000: None | Iter 4.000: None | Iter 1.000: None
Resources requested: 10/80 CPUs, 0.25/2 GPUs, 0.0/71.14 GiB heap, 0.0/23.78 GiB objects (0/1.0 accelerator_type:P5000)
Result logdir: /home/user/kaylenproject/discrete_continuous/ray_results/DEFAULT_2021-02-04_17-14-02
Number of trials: 1/1000 (1 RUNNING)
+------------------+----------+-------+----------+------------------+------------+----------------+-------+
| Trial name       | status   | loc   |    gamma |   minibatch_size |    penalty |   penalty_dist | pyg   |
|------------------+----------+-------+----------+------------------+------------+----------------+-------|
| DEFAULT_52a587a4 | RUNNING  |       | 0.995058 |              256 | 0.00456211 |             30 | mypyg |
+------------------+----------+-------+----------+------------------+------------+----------------+-------+
2021-02-04 17:14:10,769	WARNING worker.py:1034 -- The actor or task with ID 0feb9ab18c7d53d4ffffffff01000000 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining {accelerator_type:P5000: 1.000000}, {node:172.31.226.37: 1.000000}, {memory: 71.142578 GiB}, {object_store_memory: 23.779297 GiB}. In total there are 8 pending tasks and 0 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

@krfricke
Copy link
Contributor

krfricke commented Feb 4, 2021

Can you give us a bit more context? What version of Ray are you using, and can you share parts of the code you're using to run your training run? The call to tune.run() would be interesting to see

@BBDrive
Copy link
Author

BBDrive commented Feb 4, 2021

I try to use ray 1.1.0 and 2.0.0 both have this problem.
My code is as follows.

        ray.init(num_gpus=2, num_cpus=80)
        config = {'minibatch_size': tune.choice([128, 256]),
                  'gamma': tune.loguniform(0.995, 0.9999),
                  'penalty': tune.loguniform(0.001, 0.1),
                  'pyg': tune.choice(['pyg', 'mypyg']),
                  'penalty_dist': tune.choice([10, 20, 30])
                  }
        tune.run(
            partial(main, args=args),
            search_alg=HyperOptSearch(),
            scheduler=AsyncHyperBandScheduler(),
            local_dir="~/kaylenproject/discrete_continuous//ray_results",
            num_samples=1000,
            config=config,
            metric='safe_v',
            mode='max',
            resources_per_trial={"gpu": 0.25, "cpu": 10},
        )

The main function is a reinforcement learning algorithm written by myself

@BBDrive
Copy link
Author

BBDrive commented Feb 4, 2021

I reduce the number of GPUs per trial and does not specify the number of cpu per trial. It works.

resources_per_trial={"gpu": 0.1}

@krfricke
Copy link
Contributor

krfricke commented Feb 4, 2021

So is your problem solved by this? Just for completeness sake, how many CPUs are actually on your machine?

@BBDrive
Copy link
Author

BBDrive commented Feb 4, 2021

80 cpus

@krfricke
Copy link
Contributor

krfricke commented Feb 4, 2021

I see. So usually this should work. Another question would be if you're scheduling other remote Ray jobs in your trainable (main), as that would occupy cluster resources that then can't be used by other trials. Is this something you're doing?

@BBDrive
Copy link
Author

BBDrive commented Feb 4, 2021

yeah, there are other remote ray jobs in main.
Isn’t the 10 cpus defined in resource_per_trial provided for remote jobs in main function to use?

@krfricke
Copy link
Contributor

krfricke commented Feb 4, 2021

Not exactly - the 10 CPUs are reserved just for the main function of the trainable. If this main function requests more resources, you need to use the extra_* variables.

E.g.:

resources_per_trial={
    "cpu": 1,
    "extra_cpu": 9,
    "extra_gpu": 0.25
}

This would reserve 10 CPUs and 0.25 GPUs. The main function will be allocated 1 CPU, and then 9 CPUs and 0.25 GPUs would be left for the main function to schedule itself.

See also here: https://docs.ray.io/en/latest/tune/tutorials/overview.html#how-do-i-set-resources

Please note that in the future we will deprecate support for extra_ arguments in favor for placement groups. This will take another couple of weeks though, so you should be safe to use it as is.

@krfricke
Copy link
Contributor

krfricke commented Feb 4, 2021

That said, how you allocate the resources depends on your main function, which I don't have insight into. If you're starting a number of remote CPU workers, these resources need to be included in the extra_cpu key. If the GPU learner is a remote actor, the request should be done with extra_gpu, but if the learner lives in the main function, gpu should be used instead.

@BBDrive
Copy link
Author

BBDrive commented Feb 4, 2021

I get it.
Thank you very much.

@krfricke
Copy link
Contributor

krfricke commented Feb 4, 2021

You're welcome! Please feel free to re-open if any issues remain.

@krfricke krfricke closed this as completed Feb 4, 2021
@mirekphd
Copy link

mirekphd commented Sep 3, 2022

I think Ray needs a method to clear the queue of unwanted actors. Both in the API and in the Dashboard. I can open a new Issue if you like, but the reason is the same as in this issue here: the "actors queue" gets filled after model training, and you cannot make a prediction from the trained model, because all CPU resources (of which none is actually used) remain locked (until at least client python kernel restart in most cases, but sometimes also the master and slave(s) Ray servers, incl. a similar case to the one reported by OP, encountered during ML models training and scoring using lightgbm_ray/xgboost_ray).

@mirekphd
Copy link

mirekphd commented Sep 3, 2022

Notice that small caps "cpu" and "gpu" are not affecting CPU / GPU settings at all... see what happens if you pass a small-cap "cpu" key in a dict to the resources_per_trial argument, while at the same time setting 'num_cpus' argument to 1:

(scheduler +1h14m52s) Error: No available node types can fulfill resource request {'cpu': 1.0, 'CPU': 1.0}. Add suitable node types to this cluster to resolve this issue.

Ray expects all-caps keys ("CPU" and "GPU") in the dict passed to resources_per_trial, but if you pass them, it rejects them in favor of the 'num_cpus' and 'num_gpus' args:
ValueError: Use the 'num_cpus' and 'num_gpus' keyword instead of 'CPU' and 'GPU' in 'resources' keyword

So the workaround quoted below may not work at all (like in my case), and we need a well-understood method to release locked or otherwise unwanted resources (actors etc.), something like gc.collect():

resources_per_trial={"gpu": 0.1}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability
Projects
None yet
Development

No branches or pull requests

4 participants