Skip to content

[Bug]: Loading mistral-7B-instruct-v03 KeyError: 'layers.0.attention.wk.weight' #4989

@timbmg

Description

@timbmg

Your current environment

PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Rocky Linux 8.8 (Green Obsidian) (x86_64)
GCC version: (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20)
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.28

Python version: 3.9.13 (main, Oct 13 2022, 21:15:33) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-4.18.0-513.9.1.el8_9.x86_64-x86_64-with-glibc2.28
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA H100 PCIe
Nvidia driver version: 535.129.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 2
Vendor ID: AuthenticAMD
CPU family: 25
Model: 17
Model name: AMD EPYC 9534 64-Core Processor
Stepping: 1
CPU MHz: 2450.000
CPU max MHz: 3718.0659
CPU min MHz: 1500.0000
BogoMIPS: 4900.22
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 32768K
NUMA node0 CPU(s): 0-63,128-191
NUMA node1 CPU(s): 64-127,192-255
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] triton==2.3.0
[pip3] vllm-nccl-cu12==2.18.1.0.4.0
[conda] numpy 1.26.4 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi
[conda] torch 2.3.0 pypi_0 pypi
[conda] triton 2.3.0 pypi_0 pypi
[conda] vllm-nccl-cu12 2.18.1.0.4.0 pypi_0 pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS 5-6,133-134 0 N/A
NIC0 SYS X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_bond_0

🐛 Describe the bug

I am trying to load the new Mistral 7B instruct v03 model. However, it gives KeyError: 'layers.0.attention.wk.weight'. Curiously it seems to use the llama model loader (see stack trace). I am not sure if that is intended.

KeyError                                  Traceback (most recent call last)
Cell In[13], line 43
     40 else:
     41     raise ValueError(model)
---> 43 llm = LLM(
     44     model=model_path, 
     45     dtype="float16",
     46     max_model_len=max_model_len,
     47     gpu_memory_utilization=gpu_memory_utilization,
     48     **kwargs
     49 )

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/entrypoints/llm.py:123, in LLM.__init__(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, **kwargs)
    102     kwargs["disable_log_stats"] = True
    103 engine_args = EngineArgs(
    104     model=model,
    105     tokenizer=tokenizer,
   (...)
    121     **kwargs,
    122 )
--> 123 self.llm_engine = LLMEngine.from_engine_args(
    124     engine_args, usage_context=UsageContext.LLM_CLASS)
    125 self.request_counter = Counter()

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/engine/llm_engine.py:292, in LLMEngine.from_engine_args(cls, engine_args, usage_context)
    289     executor_class = GPUExecutor
    291 # Create the LLM engine.
--> 292 engine = cls(
    293     **engine_config.to_dict(),
    294     executor_class=executor_class,
    295     log_stats=not engine_args.disable_log_stats,
    296     usage_context=usage_context,
    297 )
    298 return engine

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/engine/llm_engine.py:160, in LLMEngine.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, vision_language_config, speculative_config, decoding_config, executor_class, log_stats, usage_context)
    156 self.seq_counter = Counter()
    157 self.generation_config_fields = _load_generation_config_dict(
    158     model_config)
--> 160 self.model_executor = executor_class(
    161     model_config=model_config,
    162     cache_config=cache_config,
    163     parallel_config=parallel_config,
    164     scheduler_config=scheduler_config,
    165     device_config=device_config,
    166     lora_config=lora_config,
    167     vision_language_config=vision_language_config,
    168     speculative_config=speculative_config,
    169     load_config=load_config,
    170 )
    172 self._initialize_kv_caches()
    174 # If usage stat is enabled, collect relevant info.

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/executor/executor_base.py:41, in ExecutorBase.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, vision_language_config, speculative_config)
     38 self.vision_language_config = vision_language_config
     39 self.speculative_config = speculative_config
---> 41 self._init_executor()

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/executor/gpu_executor.py:23, in GPUExecutor._init_executor(self)
     17 """Initialize the worker and load the model.
     18 
     19 If speculative decoding is enabled, we instead create the speculative
     20 worker.
     21 """
     22 if self.speculative_config is None:
---> 23     self._init_non_spec_worker()
     24 else:
     25     self._init_spec_worker()

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/executor/gpu_executor.py:69, in GPUExecutor._init_non_spec_worker(self)
     67 self.driver_worker = self._create_worker()
     68 self.driver_worker.init_device()
---> 69 self.driver_worker.load_model()

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/worker/worker.py:118, in Worker.load_model(self)
    117 def load_model(self):
--> 118     self.model_runner.load_model()

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/worker/model_runner.py:164, in ModelRunner.load_model(self)
    162 def load_model(self) -> None:
    163     with CudaMemoryProfiler() as m:
--> 164         self.model = get_model(
    165             model_config=self.model_config,
    166             device_config=self.device_config,
    167             load_config=self.load_config,
    168             lora_config=self.lora_config,
    169             vision_language_config=self.vision_language_config,
    170             parallel_config=self.parallel_config,
    171             scheduler_config=self.scheduler_config,
    172         )
    174     self.model_memory_usage = m.consumed_memory
    175     logger.info("Loading model weights took %.4f GB",
    176                 self.model_memory_usage / float(2**30))

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/model_executor/model_loader/__init__.py:19, in get_model(model_config, load_config, device_config, parallel_config, scheduler_config, lora_config, vision_language_config)
     13 def get_model(
     14         *, model_config: ModelConfig, load_config: LoadConfig,
     15         device_config: DeviceConfig, parallel_config: ParallelConfig,
     16         scheduler_config: SchedulerConfig, lora_config: Optional[LoRAConfig],
     17         vision_language_config: Optional[VisionLanguageConfig]) -> nn.Module:
     18     loader = get_model_loader(load_config)
---> 19     return loader.load_model(model_config=model_config,
     20                              device_config=device_config,
     21                              lora_config=lora_config,
     22                              vision_language_config=vision_language_config,
     23                              parallel_config=parallel_config,
     24                              scheduler_config=scheduler_config)

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/model_executor/model_loader/loader.py:224, in DefaultModelLoader.load_model(self, model_config, device_config, lora_config, vision_language_config, parallel_config, scheduler_config)
    221 with torch.device(device_config.device):
    222     model = _initialize_model(model_config, self.load_config,
    223                               lora_config, vision_language_config)
--> 224 model.load_weights(
    225     self._get_weights_iterator(model_config.model,
    226                                model_config.revision,
    227                                fall_back_to_pt=getattr(
    228                                    model,
    229                                    "fall_back_to_pt_during_load",
    230                                    True)), )
    231 for _, module in model.named_modules():
    232     quant_method = getattr(module, "quant_method", None)

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/model_executor/models/llama.py:415, in LlamaForCausalLM.load_weights(self, weights)
    413 if name.endswith(".bias") and name not in params_dict:
    414     continue
--> 415 param = params_dict[name]
    416 weight_loader = getattr(param, "weight_loader",
    417                         default_weight_loader)
    418 weight_loader(param, loaded_weight)

KeyError: 'layers.0.attention.wk.weight'

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions