[Ray data] [stable diffussion batch inference] cpu resources in cluster cannot be fully utilized when running stable diffusion batch inference task. #44556

mct2611 · 2024-04-08T01:59:05Z

What happened + What you expected to happen

Hi, i want to use the cluster's cpu resources to run the stable diffusion inference demo. i do not have GPUs. I thought through the ray framework, the cpus can also be used to execute some inference task.

I got two wsls to set up the ray cluster. wsl A has 12 cpus and is as the head node. wsl B has 12 cpus and is as the worker node. So run 'ray status' command, it shows:
======== Autoscaler status: 2024-03-22 00:59:14.244899 ========
Node status
Active:
1 node_88349db0fa0ccd3086db2f5a4c79ab9a527acb4aca4c023cb8120c8b
1 node_5cb133607c13b47fa48631b86114996f49a7ced083a5bcbeafbc20b8
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources
Usage:
0.0/24.0 CPU
0B/43.54GiB memory
0B/21.04GiB object_store_memory

Demands:
(no resource demands)

Then i run the stable diffusion batch inference demo, and set the pipe and device parameters to 'cpu' as below script shows. Then i set the num_cpus=16. In my opinion, the ray cluster may use the 16/24 cpus to run the task. However , it raise the error:

(autoscaler +6s) Error: No available node types can fulfill resource request {'CPU': 16.0}. Add suitable node types to this cluster to resolve this issue.

Only when i set the num_cpus <= 12 (the original wsl A's total cpu num), it will work and only one of the two worker will execute the task.

I saw the document says, the num_cpus is the number of CPUs to reserve for each parallel map worker and the concurrency is the number of ray workers to use concurrently. So i try to set the concurrency=2 and the num_cpus=8, i thought 2*8=16 cpus may work. However, when executing the inference process, the error occurred again.

So my point is ,how can i make use of the cpu resources in the cluster to execute one inference task?

Versions / Dependencies

ray 2.9.3
python3.10.12
wsl2

Reproduction script

model_id = "stabilityai/stable-diffusion-2-1"
prompt = "a photo of an astronaut riding a horse on mars"

import ray
import ray.data
import pandas as pd

ds = ray.data.from_pandas(pd.DataFrame([prompt], columns=['prompt']))

class PredictCallable:
def init(self, model_id: str, revision: str = None):
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch

self.pipe = StableDiffusionPipeline.from_pretrained(
    model_id, torch_dtype=torch.float
)
self.pipe.scheduler = DPMSolverMultistepScheduler.from_config(
    self.pipe.scheduler.config
)

self.pipe = self.pipe.to("cpu")

def call(self, batch: pd.DataFrame) -> pd.DataFrame:
import torch
import numpy as np

# Set a different seed for every image in batch
self.pipe.generator = [
    torch.Generator(device="cpu").manual_seed(i) for i in range(len(batch))
]
images = self.pipe(list(batch["prompt"])).images
return {"images": np.array(images, dtype=object)}

preds = ds.map_batches(
PredictCallable,
fn_constructor_kwargs=dict(model_id=model_id),
concurrency=1,
num_cpus=16,
batch_size=1,
batch_format='pandas'
)

results = preds.take_all()

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

mct2611 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 8, 2024

anyscalesam added the data Ray Data-related issues label May 13, 2024

c21 added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ray data] [stable diffussion batch inference] cpu resources in cluster cannot be fully utilized when running stable diffusion batch inference task. #44556

[Ray data] [stable diffussion batch inference] cpu resources in cluster cannot be fully utilized when running stable diffusion batch inference task. #44556

mct2611 commented Apr 8, 2024

[Ray data] [stable diffussion batch inference] cpu resources in cluster cannot be fully utilized when running stable diffusion batch inference task. #44556

[Ray data] [stable diffussion batch inference] cpu resources in cluster cannot be fully utilized when running stable diffusion batch inference task. #44556

Comments

mct2611 commented Apr 8, 2024

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity