[serve][llm] add cpu support to ray serve by srinarayan-srikanthan · Pull Request #58334 · ray-project/ray

srinarayan-srikanthan · 2025-10-31T03:51:32Z

Description

Enable serving of vllm on cpu when the accelerator provided is none, or explicitly there are cpu workers only in a cluster.

Related issues

"Fixes #53603 ","Related to #56636 ".

Additional information

enabled mp runtine for cpu and use_gpu by default when accelerator not provided to false.
In addition to the changes need to build vllm for cpu instead of vllm pip package as shown in the vllm installation page : https://docs.vllm.ai/en/latest/getting_started/installation/cpu.html#intelamd-x86

srinarayan-srikanthan · 2025-10-31T03:52:32Z

@eicherseiji please help review

gemini-code-assist

Code Review

This pull request aims to add CPU support for vLLM in Ray Serve by changing the default device from GPU to CPU when no accelerator is specified. The change to the use_gpu method correctly implements this. However, there is a critical logic flaw in get_initialization_kwargs that incorrectly configures the distributed_executor_backend for CPU execution, which will lead to issues. I've provided a detailed comment and a suggested fix for this. Additionally, the warning message for CPU installation could be more informative.

python/ray/llm/_internal/serve/engines/vllm/vllm_models.py

eicherseiji · 2025-10-31T04:29:52Z

python/ray/llm/_internal/serve/engines/vllm/vllm_models.py

        if not self.accelerator_type:
-            # By default, GPU resources are used
-            return True
+            # Use cpu if gpu not provided or none provided


I don’t think we want to change the default here. We can either add a “NONE” accelerator type, or add “use_cpu” to the LLMConfig and pipe it through.

I’m leaning toward adding a NONE accelerator type and checking for it here.

I can do that, but if the accelerator_type is not specified in LLMConfig, the default value is None. Wouldn't it still be the samelogic?

@eicherseiji quick question, the function name is "use_gpu". does it make sense to return True even we don't have GPU on the cluster?

self.accelerator_type isn't necessarily an indicator of what hardware is on the cluster, it's more of a request for a certain hardware type. TBH I'm not the biggest fan of it, I think it's better replaced by placement_group_config.

But in the meantime, if accelerator_type is left unset, we want to preserve the behavior that GPU is assumed. CPU support is not included in the basic vLLM install, or even prebuilt wheel, so I think this is reasonable.

So, @srinarayan-srikanthan I think we should add an explicit use_gpu field to LLMConfig

ray/python/ray/llm/_internal/serve/core/configs/llm_config.py

Line 141 in fedb6a7

class LLMConfig(BaseModelExtended):

By default it should be set to None, and can be passed down to VLLMEngineConfig to be used to override this property.

@eicherseiji , i think that is reasonable. The cpu support will be included in the basic vllm install very soon, in the meantime do you want me to go ahead with adding a use_gpu filed to the LLMConfig?

@srinarayan-srikanthan yes, let's add use_gpu in the meantime.

should i take use_gpu route or add cpu as accelerator type?

@kouroshHakha IMO exposing use_gpu to LLMConfig as a top-level switch will be cleaner than overloading the accelerator_type field/enum, because "CPU" accelerator type really means accelerator type None.

For CPU-only mode we also need to flip distributed_executor_backend to mp and edit counts for placement bundles. For these we will check accelerator_type == "CPU", which will be effectively equivalent to use_gpu=False while introducing a orthogonal semantic to accelerator_type.

@srinarayan-srikanthan let's see what @kouroshHakha has to say and go from there. Thanks again for the PR.

@eicherseiji Okay i have made the change by adding use_cpu and once thats set to true and acclerator type is none will fallback to mp. Can we reopen this pr? i am unable to do so

eicherseiji · 2025-10-31T04:32:41Z

Thanks for the PR! Would you be open to including some docs (basically just record the steps to test) so folks know how to use the feature?

srinarayan-srikanthan · 2025-10-31T14:55:18Z

@eicherseiji Sure, i can add some documents. Do you have a suggestion on where to add the documents? I tested using the sample here : https://github.com/ray-project/ray/blob/ray-2.50.1/doc/source/ray-overview/examples/e2e-rag/notebooks/03_Deploy_LLM_with_Ray_Serve.ipynb

louie-tsai

looks good to me. thanks for the effort to enable CPU for Ray Serve

nrghosh

Thanks @srinarayan-srikanthan, +1 to Seiji's last comment

If you run ./ci/lint/lint.sh pre_commit code_format, we can unblock the unit tests and get them running. (see: https://buildkite.com/ray-project/microcheck/builds/30264/steps/table)

Thanks for your contribution

python/ray/llm/_internal/serve/engines/vllm/vllm_models.py

kouroshHakha

I think the most straightforward and backward compatible way to support CPU is to add a new accelerator_type="CPU" to the list and treat it as special.

Changing accelerator_type=None from meaning GPU to non GPU is a backward compatibility break that is not desired.

github-actions · 2025-11-22T12:25:02Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

github-actions · 2025-12-06T12:25:14Z

This pull request has been automatically closed because there has been no more activity in the 14 days
since being marked stale.

Please feel free to reopen or open a new pull request if you'd still like this to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for your contribution!

Signed-off-by: root <root@ip-10-0-11-206.ec2.internal>

srinarayan-srikanthan · 2025-12-11T23:17:41Z

@eicherseiji and @kouroshHakha please help review and merge changes

nrghosh

defer to @eicherseiji on style, having a top level param makes sense to me vs. overriding/adding to accelerator_type when it's not an accelerator.

Thanks for the contribution @srinarayan-srikanthan

srinarayan-srikanthan · 2025-12-12T15:58:06Z

@nrghosh , agree thats why took the llm_config route, thanks ;)

kouroshHakha

@srinarayan-srikanthan thanks for contribution. Some small comments that need addressing before merging:

kouroshHakha · 2025-12-12T18:01:42Z

python/ray/llm/_internal/serve/engines/vllm/vllm_models.py

+        if self.use_cpu is True:
+            return False
+        if self.use_cpu is False:
+            return True


better way of writing this:

if isinstance(self.use_cpu, bool): return not self.use_cpu

i agree, will modify this

kouroshHakha · 2025-12-12T22:34:08Z

python/ray/llm/_internal/serve/engines/vllm/vllm_models.py

+        if not self.use_gpu:
+            # For CPU mode, always use "mp" backend
+            engine_kwargs["distributed_executor_backend"] = "mp"
+            logger.warning("install vllm package for cpu to ensure seamless execution")


shouldn't we check for this and raise an error if there is mismatch from expectation?

vllm cpu pip package was not available earlier, but this weekend it was merged, so vllm wheels will support cpu. This is not needed , ray package when installing vllm should handle this. We can create a separate pr on that if ray does not handle that. Comments? @kouroshHakha

Will this be in 0.13.0 then? I think we can merge this and then upgrade vllm in a follow up PR which is happening here: #59440

this is the PR that got merged, vllm-project/vllm#28848 , i need to verify but given it got merged it should be.

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

eicherseiji · 2025-12-15T19:55:37Z

Btw @srinarayan-srikanthan have you been able to test this?

srinarayan-srikanthan · 2025-12-15T20:52:09Z

@eicherseiji yes, i was able to validate the changes and was able to serve models on cpu

eicherseiji · 2025-12-15T21:11:03Z

Thanks @srinarayan-srikanthan. Overall looking good on my end as well. Last thing for me, can you add some basic tests validating the use_cpu logic/interaction with accelerator_type to this file?

ray/python/ray/llm/tests/serve/cpu/configs/test_models.py

Line 4 in 956e31c

import pydantic

No need to start an engine for these tests

… over accelerator_type Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

srinarayan-srikanthan · 2025-12-15T21:36:50Z

@eicherseiji I have added two basic checks, please check if this looks good.

cursor

Bug: Incomplete GPU list causes wrong CPU mode for B200

The use_gpu property checks against a hardcoded list of GPU types that is missing NVIDIA_B200 (defined in accelerators.py) and other valid GPU accelerators like AMD and Intel GPUs. Before this PR, use_gpu wasn't used to select the distributed executor backend or placement bundles. Now, when a user sets accelerator_type="B200", the property returns False because B200 isn't in the list, causing the system to incorrectly configure CPU mode (distributed_executor_backend="mp" and CPU-only bundles) instead of GPU mode. This is a regression for users with unlisted GPU accelerators.

python/ray/llm/_internal/serve/engines/vllm/vllm_models.py#L298-L314

ray/python/ray/llm/_internal/serve/engines/vllm/vllm_models.py

Lines 298 to 314 in 0206cc4

    
           return self.accelerator_type in ( 
        
               GPUType.NVIDIA_TESLA_V100.value, 
        
               GPUType.NVIDIA_TESLA_P100.value, 
        
               GPUType.NVIDIA_TESLA_T4.value, 
        
               GPUType.NVIDIA_TESLA_P4.value, 
        
               GPUType.NVIDIA_TESLA_K80.value, 
        
               GPUType.NVIDIA_TESLA_A10G.value, 
        
               GPUType.NVIDIA_L4.value, 
        
               GPUType.NVIDIA_L40S.value, 
        
               GPUType.NVIDIA_A100.value, 
        
               GPUType.NVIDIA_H100.value, 
        
               GPUType.NVIDIA_H200.value, 
        
               GPUType.NVIDIA_H20.value, 
        
               GPUType.NVIDIA_A100_40G.value, 
        
               GPUType.NVIDIA_A100_80G.value, 
        
           )

python/ray/llm/_internal/serve/engines/vllm/vllm_models.py#L134-L148

ray/python/ray/llm/_internal/serve/engines/vllm/vllm_models.py

Lines 134 to 148 in 0206cc4

    
           # Handle distributed_executor_backend based on GPU/CPU mode 
        
           if not self.use_gpu: 
        
               # For CPU mode, always use "mp" backend 
        
               engine_kwargs["distributed_executor_backend"] = "mp" 
        
           elif ( 
        
               "distributed_executor_backend" in engine_kwargs 
        
               and engine_kwargs["distributed_executor_backend"] != "ray" 
        
           ): 
        
               # For GPU mode, only "ray" backend is allowed 
        
               raise ValueError( 
        
                   "distributed_executor_backend != 'ray' is not allowed in engine_kwargs when using Ray Serve LLM Configs." 
        
               ) 
        
           else: 
        
               # For GPU mode, use "ray" backend (default) 
        
               engine_kwargs["distributed_executor_backend"] = "ray"

eicherseiji

lgtm, thanks. @kouroshHakha in the future, we can consider removing the internal use_gpu flag:

ray/python/ray/llm/_internal/serve/engines/vllm/vllm_models.py

Line 264 in bd18bf8

def use_gpu(self) -> bool:

eicherseiji · 2025-12-15T22:44:23Z

@srinarayan-srikanthan PR is failing lint. I'd fix it this way:

cd ray
pip install pre-commit && pre-commit install
Add whitespace to some file in this change
git add -u && git commit -sm "Lint"
Linter will make changes
git add -u && git commit -sm "Lint"
git push

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

srinarayan-srikanthan · 2025-12-16T00:25:42Z

thank you @eicherseiji , made the changes, should pass the test now.

cursor

Bug: Accelerator hints added to bundles in CPU mode

When use_cpu=True is set but accelerator_type is also specified (possibly by configuration mistake), the custom placement_group_config path still adds accelerator type hints to bundles. The default bundles path at lines 267-274 correctly checks use_gpu before adding accelerator hints, but this path at line 260 only checks if self.accelerator_type: without considering use_gpu. This inconsistency causes GPU accelerator hints to be added to placement bundles even in CPU mode when custom placement configs are used, potentially causing incorrect scheduling behavior.

python/ray/llm/_internal/serve/engines/vllm/vllm_models.py#L259-L262

ray/python/ray/llm/_internal/serve/engines/vllm/vllm_models.py

Lines 259 to 262 in d0e6b87

    
           bundle = bundle_dict.copy() 
        
           if self.accelerator_type: 
        
               # Use setdefault to add accelerator hint WITHOUT overriding explicit user values 
        
               bundle.setdefault(self.ray_accelerator_type(), 0.001)

…roups Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

srinarayan-srikanthan · 2025-12-16T15:02:26Z

@eicherseiji and @kouroshHakha anything else needed to merge this PR?

kouroshHakha

could we add some release tests on cpu-only machines @eicherseiji ?

eicherseiji · 2025-12-16T23:20:51Z

@kouroshHakha yeah should I take those as a follow up PR?

eicherseiji · 2025-12-17T02:25:53Z

@kouroshHakha FYI the change works well with vllm==0.12.0+cpu on an EC2 m5 instance with the following Ray Serve LLM config:

from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app

llm_config = LLMConfig(
    model_loading_config={
        "model_id": "qwen-0.5b",
        "model_source": "Qwen/Qwen2.5-0.5B-Instruct",
    },
    engine_kwargs={"dtype": "float16"},  
    use_cpu=True,
)

app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)

Install command:

VLLM_USE_PRECOMPILED=1 VLLM_PRECOMPILED_WHEEL_VARIANT=cpu VLLM_TARGET_DEVICE=cpu uv pip install --editable . --system --extra-index-url https://download.pytorch.org/whl/cpu --index-strategy unsafe-best-match

Signed-off-by: root <root@ip-10-0-11-206.ec2.internal> Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> Co-authored-by: root <root@ip-10-0-11-206.ec2.internal>

srinarayan-srikanthan requested a review from a team as a code owner October 31, 2025 03:51

srinarayan-srikanthan marked this pull request as draft October 31, 2025 03:51

srinarayan-srikanthan marked this pull request as ready for review October 31, 2025 03:51

This comment was marked as outdated.

Sign in to view

gemini-code-assist bot reviewed Oct 31, 2025

View reviewed changes

python/ray/llm/_internal/serve/engines/vllm/vllm_models.py Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

eicherseiji self-assigned this Oct 31, 2025

This comment was marked as outdated.

Sign in to view

eicherseiji reviewed Oct 31, 2025

View reviewed changes

ray-gardener bot added serve Ray Serve Related Issue docs An issue or change related to documentation llm community-contribution Contributed by the community labels Oct 31, 2025

louie-tsai approved these changes Oct 31, 2025

View reviewed changes

nrghosh changed the title ~~add cpu suuport to ray serve~~ add cpu support to ray serve Nov 4, 2025

nrghosh changed the title ~~add cpu support to ray serve~~ [serve][llm] add cpu support to ray serve Nov 4, 2025

nrghosh reviewed Nov 4, 2025

View reviewed changes

cursor bot reviewed Nov 4, 2025

View reviewed changes

python/ray/llm/_internal/serve/engines/vllm/vllm_models.py Outdated Show resolved Hide resolved

kouroshHakha reviewed Nov 8, 2025

View reviewed changes

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 22, 2025

github-actions bot closed this Dec 6, 2025

root added 3 commits December 11, 2025 17:36

add cpu suuport to ray serve

eab4410

Signed-off-by: root <root@ip-10-0-11-206.ec2.internal>

add logger

eed98ec

Signed-off-by: root <root@ip-10-0-11-206.ec2.internal>

modified validation logic

45c34ce

Signed-off-by: root <root@ip-10-0-11-206.ec2.internal>

srinarayan-srikanthan requested review from kouroshHakha and nrghosh December 11, 2025 23:13

nrghosh added the go add ONLY when ready to merge, run all tests label Dec 11, 2025

nrghosh reviewed Dec 11, 2025

View reviewed changes

github-actions bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Dec 12, 2025

kouroshHakha reviewed Dec 12, 2025

View reviewed changes

Improve use_cpu precedence logic with cleaner implementation

d0e6db0

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

Add basic tests validating use_cpu field functionality and precedence…

0206cc4

… over accelerator_type Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

cursor bot reviewed Dec 15, 2025

View reviewed changes

eicherseiji approved these changes Dec 15, 2025

View reviewed changes

Lint

d0e6b87

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

cursor bot reviewed Dec 16, 2025

View reviewed changes

Fix accelerator hint inconsistency in CPU mode for custom placement g…

29321a1

…roups Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

kouroshHakha reviewed Dec 16, 2025

View reviewed changes

kouroshHakha merged commit a02b695 into ray-project:master Dec 17, 2025
6 checks passed

eicherseiji mentioned this pull request Dec 18, 2025

[serve][llm] CPU mode Release test #59564

Draft


	return self.accelerator_type in (
	GPUType.NVIDIA_TESLA_V100.value,
	GPUType.NVIDIA_TESLA_P100.value,
	GPUType.NVIDIA_TESLA_T4.value,
	GPUType.NVIDIA_TESLA_P4.value,
	GPUType.NVIDIA_TESLA_K80.value,
	GPUType.NVIDIA_TESLA_A10G.value,
	GPUType.NVIDIA_L4.value,
	GPUType.NVIDIA_L40S.value,
	GPUType.NVIDIA_A100.value,
	GPUType.NVIDIA_H100.value,
	GPUType.NVIDIA_H200.value,
	GPUType.NVIDIA_H20.value,
	GPUType.NVIDIA_A100_40G.value,
	GPUType.NVIDIA_A100_80G.value,
	)

	# Handle distributed_executor_backend based on GPU/CPU mode
	if not self.use_gpu:
	# For CPU mode, always use "mp" backend
	engine_kwargs["distributed_executor_backend"] = "mp"
	elif (
	"distributed_executor_backend" in engine_kwargs
	and engine_kwargs["distributed_executor_backend"] != "ray"
	):
	# For GPU mode, only "ray" backend is allowed
	raise ValueError(
	"distributed_executor_backend != 'ray' is not allowed in engine_kwargs when using Ray Serve LLM Configs."
	)
	else:
	# For GPU mode, use "ray" backend (default)
	engine_kwargs["distributed_executor_backend"] = "ray"

	bundle = bundle_dict.copy()
	if self.accelerator_type:
	# Use setdefault to add accelerator hint WITHOUT overriding explicit user values
	bundle.setdefault(self.ray_accelerator_type(), 0.001)

Conversation

srinarayan-srikanthan commented Oct 31, 2025

Description

Related issues

Additional information

Uh oh!

srinarayan-srikanthan commented Oct 31, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

louie-tsai Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eicherseiji Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eicherseiji commented Oct 31, 2025

Uh oh!

srinarayan-srikanthan commented Oct 31, 2025

Uh oh!

louie-tsai left a comment

Choose a reason for hiding this comment

Uh oh!

nrghosh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kouroshHakha left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 22, 2025

Uh oh!

github-actions bot commented Dec 6, 2025

Uh oh!

srinarayan-srikanthan commented Dec 11, 2025

Uh oh!

nrghosh left a comment

Choose a reason for hiding this comment

Uh oh!

srinarayan-srikanthan commented Dec 12, 2025

Uh oh!

kouroshHakha left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

louie-tsai Oct 31, 2025 •

edited

Loading

eicherseiji Oct 31, 2025 •

edited

Loading

srinarayan-srikanthan commented Dec 16, 2025 •

edited

Loading

eicherseiji commented Dec 17, 2025 •

edited

Loading