Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] It is not allowed to specify both num_cpus and num_gpus for map tasks #33908

Open
v4if opened this issue Mar 30, 2023 · 9 comments
Open
Assignees
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks

Comments

@v4if
Copy link

v4if commented Mar 30, 2023

What happened + What you expected to happen

It is not allowed to specify both num_cpus and num_gpus for map tasks. When only num_gpus is specified, num_cpus seems to be specified as 1 by default, actor pending due to insufficient cpu resources. However, gpu computing is often the performance bottleneck of the system. How to increase the concurrency of actors when gpu resources are still available?

ray status

 {'CPU': 1.0, 'GPU': 0.01}: 4+ pending tasks/actors

run log

Resource usage vs limits: 16.0/16.0 CPU, 0.2/1.0 GPU, 0.0 MiB/13.49 GiB object_store_memory 0:   0%|                                    | 0/1 [14:11<?, ?it/s]
ReadRange: 16 active, 8598 queued 1:  14%|██████████▊                                                                   | 1386/10000 [14:11<01:06, 130.05it/s]
MapBatches(ModelPredict): 30 active, 0 queued, 16 actors (4 pending) [0 locality hits, 1386 misses] 2:  14%|█▍         | 1356/10000 [14:25<1:08:53,  2.09it/s]
output: 0 queued 3:  14%|████████████▋                                                                                 | 1356/10000 [14:25<1:08:56,  2.09it/s]

Versions / Dependencies

ray, version 3.0.0.dev0

cluster_resources

{'memory': 256000000000.0, 'node:172.18.0.196': 1.0, 'object_store_memory': 57921323827.0, 'GPU': 1.0, 'accelerator_type:T4': 1.0, 'node:172.16.1.16': 1.0, 'CPU': 16.0}

Reproduction script

import ray
import time


class ModelPredict:
    def __call__(self, df):
        time.sleep(10)
        return df


ds = ray.data.range_table(10000, parallelism=10000)
ds = ds.map_batches(
    ModelPredict,
    # num_cpus=0.5,
    num_gpus=0.01,
    compute="actors",
    batch_size=1,
)
for batch in ds.iterator().iter_batches(batch_size=1):
    ...

Issue Severity

High: It blocks me from completing my task.

@v4if v4if added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 30, 2023
@hora-anyscale hora-anyscale added the core Issues that should be addressed in Ray Core label Mar 31, 2023
@hora-anyscale hora-anyscale changed the title [Datasets] It is not allowed to specify both num_cpus and num_gpus for map tasks [Core] It is not allowed to specify both num_cpus and num_gpus for map tasks Mar 31, 2023
@hora-anyscale hora-anyscale added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 31, 2023
@clarng
Copy link
Contributor

clarng commented Mar 31, 2023

This seems to be data, since it is using dataset to specify resources, which it uses Ray core internally

@clarng clarng added data Ray Data-related issues and removed core Issues that should be addressed in Ray Core labels Mar 31, 2023
@choiikkyu
Copy link

choiikkyu commented Apr 28, 2023

same issue with me. Did you solve it?

@msminhas93
Copy link

Any update on this @hora-anyscale @clarng?

@hora-anyscale
Copy link
Contributor

cc: @xieus

@anyscalesam anyscalesam added enhancement Request for new feature and/or capability and removed bug Something that is supposed to be working; but isn't labels Nov 8, 2023
@sdcope3
Copy link

sdcope3 commented Jan 12, 2024

Any progress on this issue? Is the implication that if num_gpus is defined, that the associated task is constrained to 1 CPU?

@seastar105
Copy link

@raulchen Any progress on this issue? or any alternative method for fractional gpu and several cpu worker mapping?

@danickzhu
Copy link

@raulchen the GPU utilization is bottlenecked by the num_cpus (currently is 1) for the mapper task, do you have any suggestion?

@Superskyyy
Copy link
Contributor

Superskyyy commented Jul 5, 2024

This is intentional behavior to avoid deadlocks I believe, but there could be workarounds. I'm planning to look into it in July.

@pzdkn
Copy link

pzdkn commented Aug 13, 2024

Is it possible to use placement_groups here?

I tried:

predictions = ds_val.map_batches(predictor_cls,
scheduling_strategy=PlacementGroupSchedulingStrategy(ray.util.placement_group([{"CPU": 1}, {"GPU": 1}]  * num_workers, strategy="PACK")  , placement_group_capture_child_tasks=True))

It seems however, that the resources are not available to the actor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests