-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
Description
Your current environment
The output of python collect_env.py
Collecting environment information...
==============================
System Info
==============================
OS : Ubuntu 24.04.2 LTS (x86_64)
GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version : Could not collect
CMake version : version 3.31.6
Libc version : glibc-2.39
==============================
PyTorch Info
==============================
PyTorch version : 2.8.0a0+5228986c39.nv25.05
Is debug build : False
CUDA used to build PyTorch : 12.9
ROCM used to build PyTorch : N/A
==============================
Python Environment
==============================
Python version : 3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0] (64-bit runtime)
Python platform : Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.39
==============================
CUDA / GPU Info
==============================
Is CUDA available : True
CUDA runtime version : 12.9.41
CUDA_MODULE_LOADING set to : LAZY
GPU models and configuration : GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
Nvidia driver version : 581.57
cuDNN version : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.10.1
HIP runtime version : N/A
MIOpen runtime version : N/A
Is XNNPACK available : True
==============================
CPU Info
==============================
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 18
On-line CPU(s) list: 0-17
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) Ultra 9 285K
CPU family: 6
Model: 198
Thread(s) per core: 1
Core(s) per socket: 18
Socket(s): 1
Stepping: 2
BogoMIPS: 7372.79
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni vnmi umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities
Virtualization: VT-x
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 864 KiB (18 instances)
L1i cache: 1.1 MiB (18 instances)
L2 cache: 54 MiB (18 instances)
L3 cache: 36 MiB (1 instance)
NUMA node(s): 1
NUMA node0 CPU(s): 0-17
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.5.2
[pip3] mypy_extensions==1.1.0
[pip3] numpy==1.26.4
[pip3] nvidia-cudnn-frontend==1.16.0
[pip3] nvidia-cutlass-dsl==4.3.0
[pip3] nvidia-dali-cuda120==1.49.0
[pip3] nvidia-ml-py==12.570.86
[pip3] nvidia-modelopt==0.27.1
[pip3] nvidia-modelopt-core==0.27.1
[pip3] nvidia-nvcomp-cu12==4.2.0.14
[pip3] nvidia-nvimgcodec-cu12==0.5.0.13
[pip3] nvidia-nvjpeg-cu12==12.4.0.16
[pip3] nvidia-nvjpeg2k-cu12==0.8.1.40
[pip3] nvidia-nvtiff-cu12==0.5.0.67
[pip3] nvidia-resiliency-ext==0.3.0
[pip3] onnx==1.17.0
[pip3] optree==0.15.0
[pip3] pynvml==12.0.0
[pip3] pytorch-triton==3.3.0+git96316ce52.nvinternal
[pip3] pyzmq==26.4.0
[pip3] torch==2.8.0a0+5228986c39.nv25.5
[pip3] torch_tensorrt==2.8.0a0
[pip3] torchprofile==0.0.4
[pip3] torchvision==0.22.0a0
[pip3] transformers==5.0.0.dev0
[pip3] triton==3.5.1
[conda] Could not collect
==============================
vLLM Info
==============================
ROCM Version : Could not collect
vLLM Version : 0.11.2.dev218+gf716a1537.d20251124 (git sha: f716a1537, date: 20251124)
vLLM Build Flags:
CUDA Archs: 12.0 12.1; ROCm: Disabled
GPU Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
==============================
Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=all
CUBLAS_VERSION=12.9.0.13
NVIDIA_REQUIRE_CUDA=cuda>=9.0
TORCH_CUDA_ARCH_LIST=12.0 12.1
NCCL_VERSION=2.26.5
NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
TORCH_NCCL_USE_COMM_NONBLOCKING=0
CUDA_ARCH_LIST=7.5 8.0 8.6 9.0 10.0 12.0
NVIDIA_PRODUCT_NAME=PyTorch
CUDA_VERSION=12.9.0.043
PYTORCH_VERSION=2.8.0a0+5228986
PYTORCH_BUILD_NUMBER=0
CUBLASMP_VERSION=0.4.0.789
CUDNN_FRONTEND_VERSION=1.11.0
CUDNN_VERSION=9.10.1.4
PYTORCH_HOME=/opt/pytorch/pytorch
LD_LIBRARY_PATH=/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NVIDIA_BUILD_ID=170559088
CUDA_DRIVER_VERSION=575.51.03
PYTORCH_BUILD_VERSION=2.8.0a0+5228986
CUDA_HOME=/usr/local/cuda
CUDA_HOME=/usr/local/cuda
CUDA_MODULE_LOADING=LAZY
NVIDIA_REQUIRE_JETPACK_HOST_MOUNTS=
NVIDIA_PYTORCH_VERSION=25.05
TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
🐛 Describe the bug
PS E:\Ollama-MMLU-Pro> docker run --rm -it --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -p 8000:8000 -e VLLM_USE_FLASHINFER_MOE_FP4=1 -e VLLM_USE_FLASHINFER_MOE_FP8=1 -e VLLM_FLASHINFER_MOE_BACKEND=latency -e VLLM_FLASHINFER_ALLREDUCE_FUSION_THRESHOLDS_MB='{"2":32,"4":32,"8":8}' -v E:/text-generation-webui-1.14/user_data/models:/model vllm-blackwell:latest python -m vllm.entrypoints.openai.api_server --model /model/Llama4-Scout17B-NVFP4 --trust-remote-code --host 0.0.0.0 --port 8000 --max-model-len 131072 --gpu-memory-utilization 0.95 --served-model-name Llama4-Scout17B-NVFP4 --chat-template /model/Llama4-Scout17B-NVFP4/chat_template.jinja --kv-cache-dtype fp8 --no-enable-prefix-caching true --async-scheduling --mm-encoder-tp-mode data --enable-auto-tool-choice --tool-call-parser llama4_pythonic --quantization modelopt_fp4
=============
== PyTorch ==
NVIDIA Release 25.05 (build 170559088)
PyTorch Version 2.8.0a0+5228986
Container image Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2014-2024 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
GOVERNING TERMS: The software and materials are governed by the NVIDIA Software License Agreement
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/)
and the Product-Specific Terms for NVIDIA AI Products
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/).
INFO 11-25 03:10:36 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 11-25 03:10:36 [api_server.py:2056] vLLM API server version 0.11.2.dev218+gf716a1537.d20251124
(APIServer pid=1) INFO 11-25 03:10:36 [utils.py:253] non-default args: {'model_tag': 'true', 'host': '0.0.0.0', 'chat_template': '/model/Llama4-Scout17B-NVFP4/chat_template.jinja', 'enable_auto_tool_choice': True, 'tool_call_parser': 'llama4_pythonic', 'model': '/model/Llama4-Scout17B-NVFP4', 'trust_remote_code': True, 'max_model_len': 131072, 'quantization': 'modelopt_fp4', 'served_model_name': ['Llama4-Scout17B-NVFP4'], 'gpu_memory_utilization': 0.95, 'kv_cache_dtype': 'fp8', 'enable_prefix_caching': False, 'mm_encoder_tp_mode': 'data', 'async_scheduling': True}
(APIServer pid=1) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1) rope_parameters's high_freq_factor field must be greater than low_freq_factor, got high_freq_factor=1.0 and low_freq_factor=1.0
(APIServer pid=1) INFO 11-25 03:10:52 [model.py:630] Resolved architecture: Llama4ForConditionalGeneration
(APIServer pid=1) INFO 11-25 03:10:52 [model.py:1745] Using max model len 131072
(APIServer pid=1) INFO 11-25 03:10:53 [cache.py:195] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=1) INFO 11-25 03:10:54 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=1) WARNING 11-25 03:10:54 [modelopt.py:812] Detected ModelOpt NVFP4 checkpoint. Please note that the format is experimental and could change in future.
(APIServer pid=1) WARNING 11-25 03:10:54 [vllm.py:754] There is a latency regression when using chunked local attention with the hybrid KV cache manager. Disabling it, by default. To enable it, set the environment VLLM_ALLOW_CHUNKED_LOCAL_ATTN_WITH_HYBRID_KV_CACHE=1.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File "", line 198, in _run_module_as_main
(APIServer pid=1) File "", line 88, in _run_code
(APIServer pid=1) File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 2178, in
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 96, in run
(APIServer pid=1) return __asyncio.run(
(APIServer pid=1) ^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=1) return runner.run(main)
(APIServer pid=1) ^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 48, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 2106, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 2125, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 196, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 237, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/workspace/vllm/vllm/utils/func_utils.py", line 116, in inner
(APIServer pid=1) return fn(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/workspace/vllm/vllm/v1/engine/async_llm.py", line 219, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File "/workspace/vllm/vllm/v1/engine/async_llm.py", line 114, in init
(APIServer pid=1) tokenizer = init_tokenizer_from_configs(self.model_config)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/workspace/vllm/vllm/transformers_utils/tokenizer.py", line 295, in init_tokenizer_from_configs
(APIServer pid=1) return get_tokenizer(
(APIServer pid=1) ^^^^^^^^^^^^^^
(APIServer pid=1) File "/workspace/vllm/vllm/transformers_utils/tokenizer.py", line 221, in get_tokenizer
(APIServer pid=1) tokenizer = AutoTokenizer.from_pretrained(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/tokenization_auto.py", line 1149, in from_pretrained
(APIServer pid=1) return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py", line 2164, in from_pretrained
(APIServer pid=1) return cls._from_pretrained(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py", line 2471, in _from_pretrained
(APIServer pid=1) if _is_local and _config.model_type not in [
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^
(APIServer pid=1) AttributeError: 'dict' object has no attribute 'model_type'
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.