[serve][llm] add cpu support to ray serve#58334
[serve][llm] add cpu support to ray serve#58334kouroshHakha merged 16 commits intoray-project:masterfrom
Conversation
|
@eicherseiji please help review |
There was a problem hiding this comment.
Code Review
This pull request aims to add CPU support for vLLM in Ray Serve by changing the default device from GPU to CPU when no accelerator is specified. The change to the use_gpu method correctly implements this. However, there is a critical logic flaw in get_initialization_kwargs that incorrectly configures the distributed_executor_backend for CPU execution, which will lead to issues. I've provided a detailed comment and a suggested fix for this. Additionally, the warning message for CPU installation could be more informative.
| if not self.accelerator_type: | ||
| # By default, GPU resources are used | ||
| return True | ||
| # Use cpu if gpu not provided or none provided |
There was a problem hiding this comment.
I don’t think we want to change the default here. We can either add a “NONE” accelerator type, or add “use_cpu” to the LLMConfig and pipe it through.
I’m leaning toward adding a NONE accelerator type and checking for it here.
There was a problem hiding this comment.
I can do that, but if the accelerator_type is not specified in LLMConfig, the default value is None. Wouldn't it still be the samelogic?
There was a problem hiding this comment.
@eicherseiji quick question, the function name is "use_gpu". does it make sense to return True even we don't have GPU on the cluster?
There was a problem hiding this comment.
self.accelerator_type isn't necessarily an indicator of what hardware is on the cluster, it's more of a request for a certain hardware type. TBH I'm not the biggest fan of it, I think it's better replaced by placement_group_config.
But in the meantime, if accelerator_type is left unset, we want to preserve the behavior that GPU is assumed. CPU support is not included in the basic vLLM install, or even prebuilt wheel, so I think this is reasonable.
So, @srinarayan-srikanthan I think we should add an explicit use_gpu field to LLMConfig
By default it should be set to None, and can be passed down to VLLMEngineConfig to be used to override this property.
There was a problem hiding this comment.
@eicherseiji , i think that is reasonable. The cpu support will be included in the basic vllm install very soon, in the meantime do you want me to go ahead with adding a use_gpu filed to the LLMConfig?
There was a problem hiding this comment.
@srinarayan-srikanthan yes, let's add use_gpu in the meantime.
There was a problem hiding this comment.
should i take use_gpu route or add cpu as accelerator type?
There was a problem hiding this comment.
@kouroshHakha IMO exposing use_gpu to LLMConfig as a top-level switch will be cleaner than overloading the accelerator_type field/enum, because "CPU" accelerator type really means accelerator type None.
For CPU-only mode we also need to flip distributed_executor_backend to mp and edit counts for placement bundles. For these we will check accelerator_type == "CPU", which will be effectively equivalent to use_gpu=False while introducing a orthogonal semantic to accelerator_type.
@srinarayan-srikanthan let's see what @kouroshHakha has to say and go from there. Thanks again for the PR.
There was a problem hiding this comment.
@eicherseiji Okay i have made the change by adding use_cpu and once thats set to true and acclerator type is none will fallback to mp. Can we reopen this pr? i am unable to do so
|
Thanks for the PR! Would you be open to including some docs (basically just record the steps to test) so folks know how to use the feature? |
|
@eicherseiji Sure, i can add some documents. Do you have a suggestion on where to add the documents? I tested using the sample here : https://github.com/ray-project/ray/blob/ray-2.50.1/doc/source/ray-overview/examples/e2e-rag/notebooks/03_Deploy_LLM_with_Ray_Serve.ipynb |
louie-tsai
left a comment
There was a problem hiding this comment.
looks good to me. thanks for the effort to enable CPU for Ray Serve
nrghosh
left a comment
There was a problem hiding this comment.
Thanks @srinarayan-srikanthan, +1 to Seiji's last comment
If you run ./ci/lint/lint.sh pre_commit code_format, we can unblock the unit tests and get them running. (see: https://buildkite.com/ray-project/microcheck/builds/30264/steps/table)
Thanks for your contribution
kouroshHakha
left a comment
There was a problem hiding this comment.
I think the most straightforward and backward compatible way to support CPU is to add a new accelerator_type="CPU" to the list and treat it as special.
Changing accelerator_type=None from meaning GPU to non GPU is a backward compatibility break that is not desired.
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
|
This pull request has been automatically closed because there has been no more activity in the 14 days Please feel free to reopen or open a new pull request if you'd still like this to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for your contribution! |
Signed-off-by: root <root@ip-10-0-11-206.ec2.internal>
Signed-off-by: root <root@ip-10-0-11-206.ec2.internal>
Signed-off-by: root <root@ip-10-0-11-206.ec2.internal>
|
@eicherseiji and @kouroshHakha please help review and merge changes |
nrghosh
left a comment
There was a problem hiding this comment.
defer to @eicherseiji on style, having a top level param makes sense to me vs. overriding/adding to accelerator_type when it's not an accelerator.
Thanks for the contribution @srinarayan-srikanthan
|
@nrghosh , agree thats why took the llm_config route, thanks ;) |
kouroshHakha
left a comment
There was a problem hiding this comment.
@srinarayan-srikanthan thanks for contribution. Some small comments that need addressing before merging:
| if self.use_cpu is True: | ||
| return False | ||
| if self.use_cpu is False: | ||
| return True |
There was a problem hiding this comment.
better way of writing this:
if isinstance(self.use_cpu, bool): return not self.use_cpu
There was a problem hiding this comment.
i agree, will modify this
| if not self.use_gpu: | ||
| # For CPU mode, always use "mp" backend | ||
| engine_kwargs["distributed_executor_backend"] = "mp" | ||
| logger.warning("install vllm package for cpu to ensure seamless execution") |
There was a problem hiding this comment.
shouldn't we check for this and raise an error if there is mismatch from expectation?
There was a problem hiding this comment.
vllm cpu pip package was not available earlier, but this weekend it was merged, so vllm wheels will support cpu. This is not needed , ray package when installing vllm should handle this. We can create a separate pr on that if ray does not handle that. Comments? @kouroshHakha
There was a problem hiding this comment.
Will this be in 0.13.0 then? I think we can merge this and then upgrade vllm in a follow up PR which is happening here: #59440
There was a problem hiding this comment.
this is the PR that got merged, vllm-project/vllm#28848 , i need to verify but given it got merged it should be.
Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>
|
Btw @srinarayan-srikanthan have you been able to test this? |
|
@eicherseiji yes, i was able to validate the changes and was able to serve models on cpu |
|
Thanks @srinarayan-srikanthan. Overall looking good on my end as well. Last thing for me, can you add some basic tests validating the No need to start an engine for these tests |
… over accelerator_type Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>
|
@eicherseiji I have added two basic checks, please check if this looks good. |
There was a problem hiding this comment.
Bug: Incomplete GPU list causes wrong CPU mode for B200
The use_gpu property checks against a hardcoded list of GPU types that is missing NVIDIA_B200 (defined in accelerators.py) and other valid GPU accelerators like AMD and Intel GPUs. Before this PR, use_gpu wasn't used to select the distributed executor backend or placement bundles. Now, when a user sets accelerator_type="B200", the property returns False because B200 isn't in the list, causing the system to incorrectly configure CPU mode (distributed_executor_backend="mp" and CPU-only bundles) instead of GPU mode. This is a regression for users with unlisted GPU accelerators.
python/ray/llm/_internal/serve/engines/vllm/vllm_models.py#L298-L314
ray/python/ray/llm/_internal/serve/engines/vllm/vllm_models.py
Lines 298 to 314 in 0206cc4
python/ray/llm/_internal/serve/engines/vllm/vllm_models.py#L134-L148
ray/python/ray/llm/_internal/serve/engines/vllm/vllm_models.py
Lines 134 to 148 in 0206cc4
eicherseiji
left a comment
There was a problem hiding this comment.
lgtm, thanks. @kouroshHakha in the future, we can consider removing the internal use_gpu flag:
|
@srinarayan-srikanthan PR is failing lint. I'd fix it this way:
|
|
thank you @eicherseiji , made the changes, should pass the test now. |
There was a problem hiding this comment.
Bug: Accelerator hints added to bundles in CPU mode
When use_cpu=True is set but accelerator_type is also specified (possibly by configuration mistake), the custom placement_group_config path still adds accelerator type hints to bundles. The default bundles path at lines 267-274 correctly checks use_gpu before adding accelerator hints, but this path at line 260 only checks if self.accelerator_type: without considering use_gpu. This inconsistency causes GPU accelerator hints to be added to placement bundles even in CPU mode when custom placement configs are used, potentially causing incorrect scheduling behavior.
python/ray/llm/_internal/serve/engines/vllm/vllm_models.py#L259-L262
ray/python/ray/llm/_internal/serve/engines/vllm/vllm_models.py
Lines 259 to 262 in d0e6b87
…roups Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>
|
@eicherseiji and @kouroshHakha anything else needed to merge this PR? |
kouroshHakha
left a comment
There was a problem hiding this comment.
could we add some release tests on cpu-only machines @eicherseiji ?
|
@kouroshHakha yeah should I take those as a follow up PR? |
@kouroshHakha FYI the change works well with from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app
llm_config = LLMConfig(
model_loading_config={
"model_id": "qwen-0.5b",
"model_source": "Qwen/Qwen2.5-0.5B-Instruct",
},
engine_kwargs={"dtype": "float16"},
use_cpu=True,
)
app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)Install command: |
Signed-off-by: root <root@ip-10-0-11-206.ec2.internal> Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> Co-authored-by: root <root@ip-10-0-11-206.ec2.internal>
Signed-off-by: root <root@ip-10-0-11-206.ec2.internal> Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> Co-authored-by: root <root@ip-10-0-11-206.ec2.internal>

Description
Enable serving of vllm on cpu when the accelerator provided is none, or explicitly there are cpu workers only in a cluster.
Related issues
"Fixes #53603 ","Related to #56636 ".
Additional information
enabled mp runtine for cpu and use_gpu by default when accelerator not provided to false.
In addition to the changes need to build vllm for cpu instead of vllm pip package as shown in the vllm installation page : https://docs.vllm.ai/en/latest/getting_started/installation/cpu.html#intelamd-x86