Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray data] [stable diffussion batch inference] cpu resources in cluster cannot be fully utilized when running stable diffusion batch inference task. #44556

Open
mct2611 opened this issue Apr 8, 2024 · 0 comments
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P2 Important issue, but not time-critical

Comments

@mct2611
Copy link

mct2611 commented Apr 8, 2024

What happened + What you expected to happen

Hi, i want to use the cluster's cpu resources to run the stable diffusion inference demo. i do not have GPUs. I thought through the ray framework, the cpus can also be used to execute some inference task.

I got two wsls to set up the ray cluster. wsl A has 12 cpus and is as the head node. wsl B has 12 cpus and is as the worker node. So run 'ray status' command, it shows:
======== Autoscaler status: 2024-03-22 00:59:14.244899 ========
Node status
Active:
1 node_88349db0fa0ccd3086db2f5a4c79ab9a527acb4aca4c023cb8120c8b
1 node_5cb133607c13b47fa48631b86114996f49a7ced083a5bcbeafbc20b8
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources
Usage:
0.0/24.0 CPU
0B/43.54GiB memory
0B/21.04GiB object_store_memory

Demands:
(no resource demands)

Then i run the stable diffusion batch inference demo, and set the pipe and device parameters to 'cpu' as below script shows. Then i set the num_cpus=16. In my opinion, the ray cluster may use the 16/24 cpus to run the task. However , it raise the error:

(autoscaler +6s) Error: No available node types can fulfill resource request {'CPU': 16.0}. Add suitable node types to this cluster to resolve this issue.

Only when i set the num_cpus <= 12 (the original wsl A's total cpu num), it will work and only one of the two worker will execute the task.

I saw the document says, the num_cpus is the number of CPUs to reserve for each parallel map worker and the concurrency is the number of ray workers to use concurrently. So i try to set the concurrency=2 and the num_cpus=8, i thought 2*8=16 cpus may work. However, when executing the inference process, the error occurred again.

So my point is ,how can i make use of the cpu resources in the cluster to execute one inference task?

Versions / Dependencies

ray 2.9.3
python3.10.12
wsl2

Reproduction script

model_id = "stabilityai/stable-diffusion-2-1"
prompt = "a photo of an astronaut riding a horse on mars"

import ray
import ray.data
import pandas as pd

ds = ray.data.from_pandas(pd.DataFrame([prompt], columns=['prompt']))

class PredictCallable:
def init(self, model_id: str, revision: str = None):
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch

self.pipe = StableDiffusionPipeline.from_pretrained(
    model_id, torch_dtype=torch.float
)
self.pipe.scheduler = DPMSolverMultistepScheduler.from_config(
    self.pipe.scheduler.config
)

self.pipe = self.pipe.to("cpu")

def call(self, batch: pd.DataFrame) -> pd.DataFrame:
import torch
import numpy as np

# Set a different seed for every image in batch
self.pipe.generator = [
    torch.Generator(device="cpu").manual_seed(i) for i in range(len(batch))
]
images = self.pipe(list(batch["prompt"])).images
return {"images": np.array(images, dtype=object)}

preds = ds.map_batches(
PredictCallable,
fn_constructor_kwargs=dict(model_id=model_id),
concurrency=1,
num_cpus=16,
batch_size=1,
batch_format='pandas'
)

results = preds.take_all()

Issue Severity

High: It blocks me from completing my task.

@mct2611 mct2611 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 8, 2024
@anyscalesam anyscalesam added the data Ray Data-related issues label May 13, 2024
@c21 c21 added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P2 Important issue, but not time-critical
Projects
None yet
Development

No branches or pull requests

3 participants