[Bug]:llama4 AttributeError: 'dict' object has no attribute 'model_type'

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.2 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version                : Could not collect
CMake version                : version 3.31.6
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.8.0a0+5228986c39.nv25.05
Is debug build               : False
CUDA used to build PyTorch   : 12.9
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Feb  4 2025, 14:48:35) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.39

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.41
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
Nvidia driver version        : 581.57
cuDNN version                : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.10.1
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               18
On-line CPU(s) list:                  0-17
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Core(TM) Ultra 9 285K
CPU family:                           6
Model:                                198
Thread(s) per core:                   1
Core(s) per socket:                   18
Socket(s):                            1
Stepping:                             2
BogoMIPS:                             7372.79
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni vnmi umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities
Virtualization:                       VT-x
Hypervisor vendor:                    Microsoft
Virtualization type:                  full
L1d cache:                            864 KiB (18 instances)
L1i cache:                            1.1 MiB (18 instances)
L2 cache:                             54 MiB (18 instances)
L3 cache:                             36 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-17
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.5.2
[pip3] mypy_extensions==1.1.0
[pip3] numpy==1.26.4
[pip3] nvidia-cudnn-frontend==1.16.0
[pip3] nvidia-cutlass-dsl==4.3.0
[pip3] nvidia-dali-cuda120==1.49.0
[pip3] nvidia-ml-py==12.570.86
[pip3] nvidia-modelopt==0.27.1
[pip3] nvidia-modelopt-core==0.27.1
[pip3] nvidia-nvcomp-cu12==4.2.0.14
[pip3] nvidia-nvimgcodec-cu12==0.5.0.13
[pip3] nvidia-nvjpeg-cu12==12.4.0.16
[pip3] nvidia-nvjpeg2k-cu12==0.8.1.40
[pip3] nvidia-nvtiff-cu12==0.5.0.67
[pip3] nvidia-resiliency-ext==0.3.0
[pip3] onnx==1.17.0
[pip3] optree==0.15.0
[pip3] pynvml==12.0.0
[pip3] pytorch-triton==3.3.0+git96316ce52.nvinternal
[pip3] pyzmq==26.4.0
[pip3] torch==2.8.0a0+5228986c39.nv25.5
[pip3] torch_tensorrt==2.8.0a0
[pip3] torchprofile==0.0.4
[pip3] torchvision==0.22.0a0
[pip3] transformers==5.0.0.dev0
[pip3] triton==3.5.1
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.11.2.dev218+gf716a1537.d20251124 (git sha: f716a1537, date: 20251124)
vLLM Build Flags:
  CUDA Archs: 12.0 12.1; ROCm: Disabled
GPU Topology:
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X                              N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=all
CUBLAS_VERSION=12.9.0.13
NVIDIA_REQUIRE_CUDA=cuda>=9.0
TORCH_CUDA_ARCH_LIST=12.0 12.1
NCCL_VERSION=2.26.5
NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
TORCH_NCCL_USE_COMM_NONBLOCKING=0
CUDA_ARCH_LIST=7.5 8.0 8.6 9.0 10.0 12.0
NVIDIA_PRODUCT_NAME=PyTorch
CUDA_VERSION=12.9.0.043
PYTORCH_VERSION=2.8.0a0+5228986
PYTORCH_BUILD_NUMBER=0
CUBLASMP_VERSION=0.4.0.789
CUDNN_FRONTEND_VERSION=1.11.0
CUDNN_VERSION=9.10.1.4
PYTORCH_HOME=/opt/pytorch/pytorch
LD_LIBRARY_PATH=/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NVIDIA_BUILD_ID=170559088
CUDA_DRIVER_VERSION=575.51.03
PYTORCH_BUILD_VERSION=2.8.0a0+5228986
CUDA_HOME=/usr/local/cuda
CUDA_HOME=/usr/local/cuda
CUDA_MODULE_LOADING=LAZY
NVIDIA_REQUIRE_JETPACK_HOST_MOUNTS=
NVIDIA_PYTORCH_VERSION=25.05
TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
```

</details>


### 🐛 Describe the bug

PS E:\Ollama-MMLU-Pro> docker run --rm -it --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -p 8000:8000  -e VLLM_USE_FLASHINFER_MOE_FP4=1   -e VLLM_USE_FLASHINFER_MOE_FP8=1   -e VLLM_FLASHINFER_MOE_BACKEND=latency   -e VLLM_FLASHINFER_ALLREDUCE_FUSION_THRESHOLDS_MB='{"2":32,"4":32,"8":8}' -v E:/text-generation-webui-1.14/user_data/models:/model vllm-blackwell:latest python -m vllm.entrypoints.openai.api_server --model /model/Llama4-Scout17B-NVFP4 --trust-remote-code --host 0.0.0.0 --port 8000 --max-model-len 131072 --gpu-memory-utilization 0.95 --served-model-name Llama4-Scout17B-NVFP4 --chat-template /model/Llama4-Scout17B-NVFP4/chat_template.jinja --kv-cache-dtype fp8 --no-enable-prefix-caching true --async-scheduling --mm-encoder-tp-mode data --enable-auto-tool-choice --tool-call-parser llama4_pythonic --quantization modelopt_fp4

=============
== PyTorch ==
=============

NVIDIA Release 25.05 (build 170559088)
PyTorch Version 2.8.0a0+5228986
Container image Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2014-2024 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

GOVERNING TERMS: The software and materials are governed by the NVIDIA Software License Agreement
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/)
and the Product-Specific Terms for NVIDIA AI Products
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/).

INFO 11-25 03:10:36 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 11-25 03:10:36 [api_server.py:2056] vLLM API server version 0.11.2.dev218+gf716a1537.d20251124
(APIServer pid=1) INFO 11-25 03:10:36 [utils.py:253] non-default args: {'model_tag': 'true', 'host': '0.0.0.0', 'chat_template': '/model/Llama4-Scout17B-NVFP4/chat_template.jinja', 'enable_auto_tool_choice': True, 'tool_call_parser': 'llama4_pythonic', 'model': '/model/Llama4-Scout17B-NVFP4', 'trust_remote_code': True, 'max_model_len': 131072, 'quantization': 'modelopt_fp4', 'served_model_name': ['Llama4-Scout17B-NVFP4'], 'gpu_memory_utilization': 0.95, 'kv_cache_dtype': 'fp8', 'enable_prefix_caching': False, 'mm_encoder_tp_mode': 'data', 'async_scheduling': True}
(APIServer pid=1) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1) `rope_parameters`'s high_freq_factor field must be greater than low_freq_factor, got high_freq_factor=1.0 and low_freq_factor=1.0
(APIServer pid=1) INFO 11-25 03:10:52 [model.py:630] Resolved architecture: Llama4ForConditionalGeneration
(APIServer pid=1) INFO 11-25 03:10:52 [model.py:1745] Using max model len 131072
(APIServer pid=1) INFO 11-25 03:10:53 [cache.py:195] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=1) INFO 11-25 03:10:54 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=1) WARNING 11-25 03:10:54 [modelopt.py:812] Detected ModelOpt NVFP4 checkpoint. Please note that the format is experimental and could change in future.
(APIServer pid=1) WARNING 11-25 03:10:54 [vllm.py:754] There is a latency regression when using chunked local attention with the hybrid KV cache manager. Disabling it, by default. To enable it, set the environment VLLM_ALLOW_CHUNKED_LOCAL_ATTN_WITH_HYBRID_KV_CACHE=1.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "<frozen runpy>", line 198, in _run_module_as_main
(APIServer pid=1)   File "<frozen runpy>", line 88, in _run_code
(APIServer pid=1)   File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 2178, in <module>
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1)     return __asyncio.run(
(APIServer pid=1)            ^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=1)     return runner.run(main)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 2106, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 2125, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 196, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 237, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/workspace/vllm/vllm/utils/func_utils.py", line 116, in inner
(APIServer pid=1)     return fn(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/workspace/vllm/vllm/v1/engine/async_llm.py", line 219, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/workspace/vllm/vllm/v1/engine/async_llm.py", line 114, in __init__
(APIServer pid=1)     tokenizer = init_tokenizer_from_configs(self.model_config)
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/workspace/vllm/vllm/transformers_utils/tokenizer.py", line 295, in init_tokenizer_from_configs
(APIServer pid=1)     return get_tokenizer(
(APIServer pid=1)            ^^^^^^^^^^^^^^
(APIServer pid=1)   File "/workspace/vllm/vllm/transformers_utils/tokenizer.py", line 221, in get_tokenizer
(APIServer pid=1)     tokenizer = AutoTokenizer.from_pretrained(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/tokenization_auto.py", line 1149, in from_pretrained
(APIServer pid=1)     return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py", line 2164, in from_pretrained
(APIServer pid=1)     return cls._from_pretrained(
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py", line 2471, in _from_pretrained
(APIServer pid=1)     if _is_local and _config.model_type not in [
(APIServer pid=1)                      ^^^^^^^^^^^^^^^^^^
(APIServer pid=1) AttributeError: 'dict' object has no attribute 'model_type'

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]:llama4 AttributeError: 'dict' object has no attribute 'model_type' #29368

Your current environment

🐛 Describe the bug

=============
== PyTorch ==

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]:llama4 AttributeError: 'dict' object has no attribute 'model_type' #29368

Description

Your current environment

🐛 Describe the bug

============= == PyTorch ==

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

=============
== PyTorch ==