-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Description
Your current environment
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Rocky Linux 8.8 (Green Obsidian) (x86_64)
GCC version: (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20)
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.28
Python version: 3.9.13 (main, Oct 13 2022, 21:15:33) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-4.18.0-513.9.1.el8_9.x86_64-x86_64-with-glibc2.28
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA H100 PCIe
Nvidia driver version: 535.129.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 2
Vendor ID: AuthenticAMD
CPU family: 25
Model: 17
Model name: AMD EPYC 9534 64-Core Processor
Stepping: 1
CPU MHz: 2450.000
CPU max MHz: 3718.0659
CPU min MHz: 1500.0000
BogoMIPS: 4900.22
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 32768K
NUMA node0 CPU(s): 0-63,128-191
NUMA node1 CPU(s): 64-127,192-255
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] triton==2.3.0
[pip3] vllm-nccl-cu12==2.18.1.0.4.0
[conda] numpy 1.26.4 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi
[conda] torch 2.3.0 pypi_0 pypi
[conda] triton 2.3.0 pypi_0 pypi
[conda] vllm-nccl-cu12 2.18.1.0.4.0 pypi_0 pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS 5-6,133-134 0 N/A
NIC0 SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_bond_0
🐛 Describe the bug
I am trying to load the new Mistral 7B instruct v03 model. However, it gives KeyError: 'layers.0.attention.wk.weight'. Curiously it seems to use the llama model loader (see stack trace). I am not sure if that is intended.
KeyError Traceback (most recent call last)
Cell In[13], line 43
40 else:
41 raise ValueError(model)
---> 43 llm = LLM(
44 model=model_path,
45 dtype="float16",
46 max_model_len=max_model_len,
47 gpu_memory_utilization=gpu_memory_utilization,
48 **kwargs
49 )
File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/entrypoints/llm.py:123, in LLM.__init__(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, **kwargs)
102 kwargs["disable_log_stats"] = True
103 engine_args = EngineArgs(
104 model=model,
105 tokenizer=tokenizer,
(...)
121 **kwargs,
122 )
--> 123 self.llm_engine = LLMEngine.from_engine_args(
124 engine_args, usage_context=UsageContext.LLM_CLASS)
125 self.request_counter = Counter()
File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/engine/llm_engine.py:292, in LLMEngine.from_engine_args(cls, engine_args, usage_context)
289 executor_class = GPUExecutor
291 # Create the LLM engine.
--> 292 engine = cls(
293 **engine_config.to_dict(),
294 executor_class=executor_class,
295 log_stats=not engine_args.disable_log_stats,
296 usage_context=usage_context,
297 )
298 return engine
File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/engine/llm_engine.py:160, in LLMEngine.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, vision_language_config, speculative_config, decoding_config, executor_class, log_stats, usage_context)
156 self.seq_counter = Counter()
157 self.generation_config_fields = _load_generation_config_dict(
158 model_config)
--> 160 self.model_executor = executor_class(
161 model_config=model_config,
162 cache_config=cache_config,
163 parallel_config=parallel_config,
164 scheduler_config=scheduler_config,
165 device_config=device_config,
166 lora_config=lora_config,
167 vision_language_config=vision_language_config,
168 speculative_config=speculative_config,
169 load_config=load_config,
170 )
172 self._initialize_kv_caches()
174 # If usage stat is enabled, collect relevant info.
File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/executor/executor_base.py:41, in ExecutorBase.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, vision_language_config, speculative_config)
38 self.vision_language_config = vision_language_config
39 self.speculative_config = speculative_config
---> 41 self._init_executor()
File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/executor/gpu_executor.py:23, in GPUExecutor._init_executor(self)
17 """Initialize the worker and load the model.
18
19 If speculative decoding is enabled, we instead create the speculative
20 worker.
21 """
22 if self.speculative_config is None:
---> 23 self._init_non_spec_worker()
24 else:
25 self._init_spec_worker()
File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/executor/gpu_executor.py:69, in GPUExecutor._init_non_spec_worker(self)
67 self.driver_worker = self._create_worker()
68 self.driver_worker.init_device()
---> 69 self.driver_worker.load_model()
File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/worker/worker.py:118, in Worker.load_model(self)
117 def load_model(self):
--> 118 self.model_runner.load_model()
File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/worker/model_runner.py:164, in ModelRunner.load_model(self)
162 def load_model(self) -> None:
163 with CudaMemoryProfiler() as m:
--> 164 self.model = get_model(
165 model_config=self.model_config,
166 device_config=self.device_config,
167 load_config=self.load_config,
168 lora_config=self.lora_config,
169 vision_language_config=self.vision_language_config,
170 parallel_config=self.parallel_config,
171 scheduler_config=self.scheduler_config,
172 )
174 self.model_memory_usage = m.consumed_memory
175 logger.info("Loading model weights took %.4f GB",
176 self.model_memory_usage / float(2**30))
File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/model_executor/model_loader/__init__.py:19, in get_model(model_config, load_config, device_config, parallel_config, scheduler_config, lora_config, vision_language_config)
13 def get_model(
14 *, model_config: ModelConfig, load_config: LoadConfig,
15 device_config: DeviceConfig, parallel_config: ParallelConfig,
16 scheduler_config: SchedulerConfig, lora_config: Optional[LoRAConfig],
17 vision_language_config: Optional[VisionLanguageConfig]) -> nn.Module:
18 loader = get_model_loader(load_config)
---> 19 return loader.load_model(model_config=model_config,
20 device_config=device_config,
21 lora_config=lora_config,
22 vision_language_config=vision_language_config,
23 parallel_config=parallel_config,
24 scheduler_config=scheduler_config)
File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/model_executor/model_loader/loader.py:224, in DefaultModelLoader.load_model(self, model_config, device_config, lora_config, vision_language_config, parallel_config, scheduler_config)
221 with torch.device(device_config.device):
222 model = _initialize_model(model_config, self.load_config,
223 lora_config, vision_language_config)
--> 224 model.load_weights(
225 self._get_weights_iterator(model_config.model,
226 model_config.revision,
227 fall_back_to_pt=getattr(
228 model,
229 "fall_back_to_pt_during_load",
230 True)), )
231 for _, module in model.named_modules():
232 quant_method = getattr(module, "quant_method", None)
File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/model_executor/models/llama.py:415, in LlamaForCausalLM.load_weights(self, weights)
413 if name.endswith(".bias") and name not in params_dict:
414 continue
--> 415 param = params_dict[name]
416 weight_loader = getattr(param, "weight_loader",
417 default_weight_loader)
418 weight_loader(param, loaded_weight)
KeyError: 'layers.0.attention.wk.weight'