Skip to content

Conversation

@johnnynunez
Copy link
Contributor

@johnnynunez johnnynunez commented Nov 18, 2025

fix DGX Spark vllm issue

Purpose

Continue with this PR. No answer from owner. #26844

cc @mgoin

@johnnynunez johnnynunez changed the title Guard SM100 CUTLASS MoE macro to SM100 builds v2 [NVIDIA] Guard SM100 CUTLASS MoE macro to SM100 builds v2 Nov 18, 2025
Signed-off-by: johnnynunez <johnnynuca14@gmail.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix build issues on DGX Spark by ensuring that SM100-specific CUTLASS MoE kernels are only built for SM100 architectures. The changes correctly remove SM120 architectures from some of the build configurations in CMakeLists.txt. While the changes are correct, the fix appears to be incomplete. I've identified other sections in CMakeLists.txt for SM100 kernels that still incorrectly include SM120 architectures. I've left a specific comment pointing to these locations. Applying the fix consistently across the file will prevent future build problems. Overall, this is a good step towards improving build correctness.

johnnynunez and others added 4 commits November 18, 2025 12:54
Signed-off-by: johnnynunez <johnnynuca14@gmail.com>
Signed-off-by: Johnny <johnnynuca14@gmail.com>
@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 18, 2025
Copy link
Contributor

@wrmedford wrmedford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, and preserves functionality on sm110a across its rename.

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@github-project-automation github-project-automation bot moved this to In review in NVIDIA Nov 19, 2025
@vllm-bot vllm-bot merged commit 49ef847 into vllm-project:main Nov 19, 2025
87 of 89 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in NVIDIA Nov 19, 2025
khluu pushed a commit that referenced this pull request Nov 19, 2025
Signed-off-by: johnnynunez <johnnynuca14@gmail.com>
Signed-off-by: Johnny <johnnynuca14@gmail.com>
(cherry picked from commit 49ef847)
Victor49152 pushed a commit to Victor49152/vllm that referenced this pull request Nov 20, 2025
…ct#28938)

Signed-off-by: johnnynunez <johnnynuca14@gmail.com>
Signed-off-by: Johnny <johnnynuca14@gmail.com>
@ericcurtin
Copy link
Contributor

I see we released a new wheel with this fix for DGX Spark, should we expect the aarch64 wheel to be compatible with DGX Spark and run accelerated workloads soon?

https://github.com/vllm-project/vllm/releases/tag/v0.11.2

@ericcurtin
Copy link
Contributor

Coffee time:

https://buymeacoffee.com/johnnycano

bhagyashrigai pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Nov 20, 2025
…ct#28938)

Signed-off-by: johnnynunez <johnnynuca14@gmail.com>
Signed-off-by: Johnny <johnnynuca14@gmail.com>
Signed-off-by: Bhagyashri <Bhagyashri.Gaikwad2@ibm.com>
@johnnynunez
Copy link
Contributor Author

I see we released a new wheel with this fix for DGX Spark, should we expect the aarch64 wheel to be compatible with DGX Spark and run accelerated workloads soon?

https://github.com/vllm-project/vllm/releases/tag/v0.11.2

yes it is compatible

@bbrowning
Copy link
Contributor

I happen to have a DGX spark I use daily for vLLM dev work anyway, so trying the v0.11.2 release on there:

mkdir -p ~/tmp/vllm-v0.11.2
cd ~/tmp/vllm-v0.11.2
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install "vllm==v0.11.2" --torch-backend=auto

That all works fine, and pulls in CUDA 13 libs as expected:

...
 + nvidia-cublas==13.0.0.19                                                                                                                                                                                                                                                                     
 + nvidia-cuda-cupti==13.0.48
 + nvidia-cuda-nvrtc==13.0.48
 + nvidia-cuda-runtime==13.0.48
 + nvidia-cudnn-cu13==9.13.0.50
 + nvidia-cudnn-frontend==1.16.0
 + nvidia-cufft==12.0.0.15
 + nvidia-cufile==1.15.0.42
 + nvidia-curand==10.4.0.35
 + nvidia-cusolver==12.0.3.29
 + nvidia-cusparse==12.6.2.49
 + nvidia-cusparselt-cu13==0.8.0
 + nvidia-cutlass-dsl==4.2.1
 + nvidia-ml-py==13.580.82
 + nvidia-nccl-cu13==2.27.7
 + nvidia-nvjitlink==13.0.39
 + nvidia-nvshmem-cu13==3.3.24
 + nvidia-nvtx==13.0.39
 ...

But, when trying a simple test to serve openai/gpt-oss-20b (something I regularly do on vLLM builds from source), I get:

Traceback (most recent call last):
  File "/home/bbrowning/tmp/vllm-v0.11.2/.venv/bin/vllm", line 4, in <module>
    from vllm.entrypoints.cli.main import main
  File "/home/bbrowning/tmp/vllm-v0.11.2/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/__init__.py", line 3, in <module>
    from vllm.entrypoints.cli.benchmark.latency import BenchmarkLatencySubcommand
  File "/home/bbrowning/tmp/vllm-v0.11.2/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/benchmark/latency.py", line 5, in <module>
    from vllm.benchmarks.latency import add_cli_args, main                                                                                      
  File "/home/bbrowning/tmp/vllm-v0.11.2/.venv/lib/python3.12/site-packages/vllm/benchmarks/latency.py", line 17, in <module>              
    from vllm.engine.arg_utils import EngineArgs          
  File "/home/bbrowning/tmp/vllm-v0.11.2/.venv/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 35, in <module>  
    from vllm.attention.backends.registry import AttentionBackendEnum
  File "/home/bbrowning/tmp/vllm-v0.11.2/.venv/lib/python3.12/site-packages/vllm/attention/__init__.py", line 4, in <module>
    from vllm.attention.backends.abstract import (                   
  File "/home/bbrowning/tmp/vllm-v0.11.2/.venv/lib/python3.12/site-packages/vllm/attention/backends/abstract.py", line 9, in <module>
    from vllm.model_executor.layers.linear import ColumnParallelLinear
  File "/home/bbrowning/tmp/vllm-v0.11.2/.venv/lib/python3.12/site-packages/vllm/model_executor/__init__.py", line 4, in <module>    
    from vllm.model_executor.parameter import BasevLLMParameter, PackedvLLMParameter
  File "/home/bbrowning/tmp/vllm-v0.11.2/.venv/lib/python3.12/site-packages/vllm/model_executor/parameter.py", line 11, in <module>
    from vllm.distributed import (                                                                                                              
  File "/home/bbrowning/tmp/vllm-v0.11.2/.venv/lib/python3.12/site-packages/vllm/distributed/__init__.py", line 4, in <module>     
    from .communication_op import *
  File "/home/bbrowning/tmp/vllm-v0.11.2/.venv/lib/python3.12/site-packages/vllm/distributed/communication_op.py", line 9, in <module>
    from .parallel_state import get_tp_group
  File "/home/bbrowning/tmp/vllm-v0.11.2/.venv/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 250, in <module>
    direct_register_custom_op(              
  File "/home/bbrowning/tmp/vllm-v0.11.2/.venv/lib/python3.12/site-packages/vllm/utils/torch_utils.py", line 640, in direct_register_custom_op
    from vllm.platforms import current_platform
  File "/home/bbrowning/tmp/vllm-v0.11.2/.venv/lib/python3.12/site-packages/vllm/platforms/__init__.py", line 257, in __getattr__             
    _current_platform = resolve_obj_by_qualname(platform_cls_qualname)() 
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                          
  File "/home/bbrowning/tmp/vllm-v0.11.2/.venv/lib/python3.12/site-packages/vllm/utils/import_utils.py", line 89, in resolve_obj_by_qualname
    module = importlib.import_module(module_name)                     
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                               
  File "/usr/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                 
  File "/home/bbrowning/tmp/vllm-v0.11.2/.venv/lib/python3.12/site-packages/vllm/platforms/cuda.py", line 16, in <module>
    import vllm._C  # noqa                                     
    ^^^^^^^^^^^^^^                                                                                                                              
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory

Do I need to do something differently with my install command to get the released wheel working?

@ericcurtin
Copy link
Contributor

ericcurtin commented Nov 20, 2025

@bbrowning I think this gets you past that:

uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

Then you end up with:

libtorch_cuda.so

missing...

@bbrowning
Copy link
Contributor

@ericcurtin That's one of the steps I take when building from source, along with several others including python use_existing_torch.py so that installing vLLM from source does not overwrite my torch. But, trying to consume the release wheels, the following still results in the same error:

uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
uv pip install "vllm==v0.11.2" --torch-backend=auto

That second command to install vLLM overwrites my torch, torchvision, and torchaudio I just installed above it.

I'm sure I can get the release installing from source with these kind of steps. But, since there was some indication the released wheel may just work on DGX Spark, I was trying to do that.

@bbrowning
Copy link
Contributor

I was able to install release v0.11.2 via these commands on a DGX spark:

mkdir -p ~/tmp/vllm-v0.11.2
cd ~/tmp/vllm-v0.11.2
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
uv pip install "vllm==v0.11.2" --no-binary vllm --torch-backend=auto

That compiled from source without issue. So, while we don't have any wheel releases that work yet for DGX Spark, release v0.11.2 does install on the system without extra hacks.

@ericcurtin
Copy link
Contributor

Does vllm serve "HuggingFaceTB/SmolLM-135M-Instruct" work with this installation technique?

@johnnynunez
Copy link
Contributor Author

Does vllm serve "HuggingFaceTB/SmolLM-135M-Instruct" work with this installation technique?

just try, now the best backend for spark is flash infer

@bbrowning
Copy link
Contributor

Does vllm serve "HuggingFaceTB/SmolLM-135M-Instruct" work with this installation technique?

Yes, it starts up without issue and I was able to send a simple chat completion request to it just to see some kind of generation working.

@ericcurtin
Copy link
Contributor

Does vllm serve "HuggingFaceTB/SmolLM-135M-Instruct" work with this installation technique?

Yes, it starts up without issue and I was able to send a simple chat completion request to it just to see some kind of generation working.

Most of the times I have been installing without these flags:

--no-binary --torch-backend=auto

I wonder is that the difference...

The iterations are slow in my house, I mean an iteration is an hour (bad bandwidth, thanks for answering)

@ericcurtin
Copy link
Contributor

I'd appreciate it if someone put together a simple:

FROM ubuntu:24.04

RUN
RUN

with commands like:

mkdir -p ~/tmp/vllm-v0.11.2
cd ~/tmp/vllm-v0.11.2
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
uv pip install "vllm==v0.11.2" --no-binary vllm --torch-backend=auto

that works, I feel like I've tried it 100 times with new errors each time and failed :'(

@bbrowning
Copy link
Contributor

@ericcurtin Oh, I'm doing this directly on my spark and not inside a container. A container will need additional steps, but let me see what I can figure out.

@bbrowning
Copy link
Contributor

@ericcurtin I was able to build and run a functioning v0.11.2 container directly on my DGX Spark with the Dockerfile at https://gist.github.com/bbrowning/e2efe77b617b741a23ed31333a7ecba9 - it just takes the first bits of the official vLLM container and installs from releases instead of source, along with removing as much of the unnecessary bits I could find to simplify things for just this one use case.

Make sure to pass --gpus=all when running the built container. The entrypoint is set to vllm serve just like the official container, so pass whatever args you need after that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants