Skip to content

[Bug]: Unable to run Qwen3-Next-80B-A3B-Instruct using B200 and Flashinfer backend #25811

@mihirp1998

Description

@mihirp1998

Your current environment

The output of python collect_env.py

/home/ubuntu/.local/share/mamba/envs/pho/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version                : Could not collect
CMake version                : version 3.22.1
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.8.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.11 | packaged by conda-forge | (main, Jun  4 2025, 14:45:31) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.8.0-60-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.8.93
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : 
GPU 0: NVIDIA B200
GPU 1: NVIDIA B200
GPU 2: NVIDIA B200
GPU 3: NVIDIA B200
GPU 4: NVIDIA B200
GPU 5: NVIDIA B200
GPU 6: NVIDIA B200
GPU 7: NVIDIA B200

Nvidia driver version        : 570.148.08
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        52 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               208
On-line CPU(s) list:                  0-207
Vendor ID:                            GenuineIntel
Model name:                           INTEL(R) XEON(R) PLATINUM 8570
CPU family:                           6
Model:                                207
Thread(s) per core:                   2
Core(s) per socket:                   26
Socket(s):                            4
Stepping:                             2
BogoMIPS:                             4200.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq dtes64 vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd arat vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b fsrm md_clear serialize tsxldtrk amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                       VT-x
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            6.5 MiB (208 instances)
L1i cache:                            6.5 MiB (208 instances)
L2 cache:                             416 MiB (104 instances)
L3 cache:                             64 MiB (4 instances)
NUMA node(s):                         4
NUMA node0 CPU(s):                    0-51
NUMA node1 CPU(s):                    52-103
NUMA node2 CPU(s):                    104-155
NUMA node3 CPU(s):                    156-207
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Mitigation; TSX disabled

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.3.1.post1
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.14.1
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-ml-py==13.580.82
[pip3] nvidia-nccl-cu12==2.27.3
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pynvml==13.0.1
[pip3] pyzmq==27.1.0
[pip3] torch==2.8.0
[pip3] torchaudio==2.8.0
[pip3] torchdata==0.11.0
[pip3] torchvision==0.23.0
[pip3] transformers==4.56.2
[pip3] triton==3.4.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.11.0rc2.dev12+g3958b96bf (git sha: 3958b96bf)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    NIC10   NIC11   NIC12   CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    PIX     0-51    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     NODE    0-51    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    PIX     SYS     SYS     52-103  1               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     NODE    SYS     SYS     52-103  1               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    PIX     SYS     SYS     SYS     SYS     104-155 2               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     NODE    SYS     SYS     SYS     SYS     104-155 2               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     SYS     NODE    PIX     SYS     SYS     SYS     SYS     SYS     SYS     156-207 3               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     SYS     PIX     NODE    SYS     SYS     SYS     SYS     SYS     SYS     156-207 3               N/A
NIC0    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC1    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PHB      X      PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC2    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PHB     PHB      X      PHB     PHB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC3    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB      X      PHB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC4    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC5    SYS     SYS     SYS     SYS     SYS     SYS     NODE    PIX     SYS     SYS     SYS     SYS     SYS      X      NODE    SYS     SYS     SYS     SYS     SYS     SYS
NIC6    SYS     SYS     SYS     SYS     SYS     SYS     PIX     NODE    SYS     SYS     SYS     SYS     SYS     NODE     X      SYS     SYS     SYS     SYS     SYS     SYS
NIC7    SYS     SYS     SYS     SYS     NODE    PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      NODE    SYS     SYS     SYS     SYS
NIC8    SYS     SYS     SYS     SYS     PIX     NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE     X      SYS     SYS     SYS     SYS
NIC9    SYS     SYS     NODE    PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      NODE    SYS     SYS
NIC10   SYS     SYS     PIX     NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE     X      SYS     SYS
NIC11   NODE    PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      NODE
NIC12   PIX     NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE     X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9
  NIC10: mlx5_10
  NIC11: mlx5_11
  NIC12: mlx5_12

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/lib/nvidia
CUDA_HOME=/usr/local/cuda
CUDA_HOME=/usr/local/cuda
CUDA_VISIBLE_DEVICES=0
CUDA_VISIBLE_DEVICES=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

🐛 Describe the bug

I get the following error when i run the folllowing code, note that when i change the attention backend from flashinfer to something else i dont get this error:


    base_model = "Qwen/Qwen3-Next-80B-A3B-Instruct"
    model = LLM(model=base_model, tensor_parallel_size=num_gpus, max_model_len=None)
    # st()
    # Define sampling parameters
    sampling_params = SamplingParams(
        # temperature=0.7,
        # top_p=0.95,
        max_tokens=2048,
        #stop=["</s>", "\n\n"]  # Common stop tokens
    )
    # Example prompt
    prompt = [{"role": "user", "content": "Who is the president of US?"}]
    # Generate response
    outputs = model.chat([prompt], sampling_params)

Full Error:

(EngineCore_DP0 pid=3339204) RuntimeError: Expect (32 <= headDim <= 2048) && (numTokensPerPage <= 128), got headDimPerCtaV=%d, headDimQk=%d, headDimV=%d, numTokensPerPage=%d256256256544

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3339204)   File "/home/ubuntu/.local/share/mamba/envs/pho/lib/python3.12/site-packages/torch/_ops.py", line 1243, in __call__
(EngineCore_DP0 pid=3339204)     return self._op(*args, **kwargs)
(EngineCore_DP0 pid=3339204)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3339204)   File "/lambda/nfs/cluster-mle-cluster/phd_projects/vllm/vllm/attention/layer.py", line 613, in unified_attention_with_output
(EngineCore_DP0 pid=3339204)     self.impl.forward(self,
(EngineCore_DP0 pid=3339204)   File "/lambda/nfs/cluster-mle-cluster/phd_projects/vllm/vllm/v1/attention/backends/flashinfer.py", line 941, in forward
(EngineCore_DP0 pid=3339204)     trtllm_batch_context_with_kv_cache(
(EngineCore_DP0 pid=3339204)   File "/home/ubuntu/.local/share/mamba/envs/pho/lib/python3.12/site-packages/flashinfer/prefill.py", line 3426, in trtllm_batch_context_with_kv_cache
(EngineCore_DP0 pid=3339204)     run_func(
(EngineCore_DP0 pid=3339204)   File "/home/ubuntu/.local/share/mamba/envs/pho/lib/python3.12/site-packages/torch/_ops.py", line 1243, in __call__
(EngineCore_DP0 pid=3339204)     return self._op(*args, **kwargs)
(EngineCore_DP0 pid=3339204)            ^^^^^^^^^^^^^^^^^^^^^^^^^
**(EngineCore_DP0 pid=3339204) RuntimeError: Expect (32 <= headDim <= 2048) && (numTokensPerPage <= 128), got headDimPerCtaV=%d, headDimQk=%d, headDimV=%d, numTokensPerPage=%d256256256544**
    return self.generate(
           ^^^^^^^^^^^^^^
  File "/lambda/nfs/cluster-mle-cluster/phd_projects/vllm/vllm/entrypoints/llm.py", line 401, in generate
    outputs = self._run_engine(use_tqdm=use_tqdm)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lambda/nfs/cluster-mle-cluster/phd_projects/vllm/vllm/entrypoints/llm.py", line 1600, in _run_engine
    step_outputs = self.llm_engine.step()
                   ^^^^^^^^^^^^^^^^^^^^^^
  File "/lambda/nfs/cluster-mle-cluster/phd_projects/vllm/vllm/v1/engine/llm_engine.py", line 265, in step
    outputs = self.engine_core.get_output()
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lambda/nfs/cluster-mle-cluster/phd_projects/vllm/vllm/v1/engine/core_client.py", line 670, in get_output
    raise self._format_exception(outputs) from None
vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(EngineCore_DP0 pid=3339204) Exception ignored in atexit callback: <bound method finalize._exitfunc of <class 'weakref.finalize'>>
(EngineCore_DP0 pid=3339204) Traceback (most recent call last):
(EngineCore_DP0 pid=3339204)   File "/home/ubuntu/.local/share/mamba/envs/pho/lib/python3.12/weakref.py", line 666, in _exitfunc
(EngineCore_DP0 pid=3339204)     f()
(EngineCore_DP0 pid=3339204)   File "/home/ubuntu/.local/share/mamba/envs/pho/lib/python3.12/weakref.py", line 590, in __call__
(EngineCore_DP0 pid=3339204)     return info.func(*info.args, **(info.kwargs or {}))
(EngineCore_DP0 pid=3339204)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3339204)   File "/home/ubuntu/.local/share/mamba/envs/pho/lib/python3.12/site-packages/torch/library.py", line 482, in _del_library
(EngineCore_DP0 pid=3339204)     m.reset()
(EngineCore_DP0 pid=3339204)   File "/lambda/nfs/cluster-mle-cluster/phd_projects/vllm/vllm/v1/engine/core.py", line 679, in signal_handler
(EngineCore_DP0 pid=3339204)     raise SystemExit()
(EngineCore_DP0 pid=3339204) SystemExit: 
[rank0]:[W928 00:04:00.546019218 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Processed prompts:   0%|                                                                                                                              | 0/1 [00:53<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions