[Bug]: Error on inference with LoRa request (safetensors format) #6333

tsvisab · 2024-07-11T12:04:04Z

Your current environment

Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 11 (bullseye) (x86_64)
GCC version: (Debian 10.2.1-6) 10.2.1 20210110
Clang version: Could not collect
CMake version: version 3.18.4
Libc version: glibc-2.31

Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.10.0-30-cloud-amd64-x86_64-with-glibc2.31
Is CUDA available: N/A
CUDA runtime version: 11.3.109
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: 
GPU 0: NVIDIA L4
GPU 1: NVIDIA L4
GPU 2: NVIDIA L4
GPU 3: NVIDIA L4

Nvidia driver version: 550.90.07
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Byte Order:                           Little Endian
Address sizes:                        46 bits physical, 48 bits virtual
CPU(s):                               48
On-line CPU(s) list:                  0-47
Thread(s) per core:                   2
Core(s) per socket:                   24
Socket(s):                            1
NUMA node(s):                         1
Vendor ID:                            GenuineIntel
CPU family:                           6
Model:                                85
Model name:                           Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping:                             7
CPU MHz:                              2200.226
BogoMIPS:                             4400.45
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            768 KiB
L1i cache:                            768 KiB
L2 cache:                             24 MiB
L3 cache:                             38.5 MiB
NUMA node0 CPU(s):                    0-47
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.24.4
[conda] numpy                     1.24.4                   pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PHB     PHB     PHB     0-47    0               N/A
GPU1    PHB      X      PHB     PHB     0-47    0               N/A
GPU2    PHB     PHB      X      PHB     0-47    0               N/A
GPU3    PHB     PHB     PHB      X      0-47    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

Llama3 8b with tuned LoraAdapter fails on inference,

Model init:

class VLLMClassifierBase(KLlmHuggingfaceClassifier):
    def __init__(self, model_config: VLLMClassifierConfig):
        logger.info(f"Model config: {model_config}")
        super().__init__()
        ray_tmp_dir = "/dev/shm/tmp/ray"
        os.makedirs(ray_tmp_dir, exist_ok=True)
        ray.init(_temp_dir=ray_tmp_dir, num_gpus=model_config.tensor_parallel_size)

        self.load_model(model_config)

    def load_model(self, model_config):

        self.lora_adapter_path = None
        logger.info(f"Lora adapter path: {model_config.gcs_lora_adapter_path}")
        if model_config.gcs_lora_adapter_path:
            local_lora_adapter_path = "/home/jupyter/huggingface-models/training/artifacts-240711_1124/"
            # copy_from_gcs_uri(model_config.gcs_lora_adapter_path, local_lora_adapter_path)
            self.lora_adapter_path = local_lora_adapter_path

        os.environ["NCCL_DEBUG"] = "INFO"

        download_dir = "/dev/shm/cache/hub"
        os.makedirs(download_dir, exist_ok=True)

        logger.info(f"Loading model: {model_config.hf_model_path}")
        engine_args = AsyncEngineArgs(
            model=model_config.hf_model_path,
            quantization=model_config.quantization if model_config.quantization else None,
            dtype="auto",
            tensor_parallel_size=model_config.tensor_parallel_size,
            enforce_eager=model_config.enforce_eager,
            disable_custom_all_reduce=model_config.disable_custom_all_reduce,
            worker_use_ray=bool(model_config.tensor_parallel_size > 1),
            engine_use_ray=bool(model_config.tensor_parallel_size > 1),
            enable_lora=bool(self.lora_adapter_path is not None),
            download_dir=download_dir,
            gpu_memory_utilization=model_config.gpu_memory_utilization,
            max_model_len=model_config.max_model_len,
            enable_prefix_caching=model_config.enable_prefix_caching,
            max_lora_rank=64,
        )

        self.model = AsyncLLMEngine.from_engine_args(engine_args, usage_context=UsageContext.API_SERVER)
        logger.info(f"Successfully loaded model: {vars(self.model)}")
        loop = asyncio.get_event_loop()

        get_tokenizer_future = asyncio.ensure_future(self.model.get_tokenizer())

        loop.run_until_complete(get_tokenizer_future)

        self.tokenizer = get_tokenizer_future.result()`
```

# Model inference:

`    async def _predict_single_instance(self, model_input: KLLMModelInput):
        prompt = self._to_chat_prompt(model_input.chat_history)
        sampling_params = SamplingParams(
            model_input.n,
            temperature=model_input.temperature,
            top_p=model_input.top_p,
            max_tokens=model_input.max_new_tokens,
        )
        # try:
        request_id = random_uuid()
        lora_request = None
        if self.lora_adapter_path:
            lora_request = LoRARequest("sql_adapter", 1, self.lora_adapter_path)

        # Run the async function within the event loop
        text_outputs = await self.get_result_from_generator(
            self.model.generate(prompt, sampling_params, request_id, lora_request=lora_request)
        )
        return KLLMModelOutput(generated_ids=text_outputs)`



Logs:


> m_engine.AsyncLLMEngine object at 0x7f3117d5d600>>)
> handle: <Handle functools.partial(<function _log_task_completion at 0x7f2fdbc01e10>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f3117d5d600>>)>
> Traceback (most recent call last):
>   File "/home/jupyter/.cache/pypoetry/virtualenvs/k-llm-huggingface-6NB9Z0db-py3.10/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 43, in _log_task_completion
>     return_value = task.result()
>   File "/home/jupyter/.cache/pypoetry/virtualenvs/k-llm-huggingface-6NB9Z0db-py3.10/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 595, in run_engine_loop
>     result = task.result()
>   File "/home/jupyter/.cache/pypoetry/virtualenvs/k-llm-huggingface-6NB9Z0db-py3.10/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 538, in engine_step
>     request_outputs = await self.engine.step.remote()  # type: ignore
> ray.exceptions.RayTaskError(IndexError): ray::_AsyncLLMEngine.step() (pid=55187, ip=10.148.0.7, actor_id=75c03eff87f9a5f009b2815b01000000, repr=<vllm.engine.async_llm_engine._AsyncLLMEngine object at 0x7fe99db5a350>)
>   File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 451, in result
>     return self.__get_result()
>   File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
>     raise self._exception
>   File "/home/jupyter/.cache/pypoetry/virtualenvs/k-llm-huggingface-6NB9Z0db-py3.10/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 861, in step
>     output = self.model_executor.execute_model(
>   File "/home/jupyter/.cache/pypoetry/virtualenvs/k-llm-huggingface-6NB9Z0db-py3.10/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 76, in execute_model
>     return self._driver_execute_model(execute_model_req)
>   File "/home/jupyter/.cache/pypoetry/virtualenvs/k-llm-huggingface-6NB9Z0db-py3.10/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 224, in _driver_execute_model
>     return self.driver_worker.execute_method("execute_model",
>   File "/home/jupyter/.cache/pypoetry/virtualenvs/k-llm-huggingface-6NB9Z0db-py3.10/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 349, in execute_method
>     raise e
>   File "/home/jupyter/.cache/pypoetry/virtualenvs/k-llm-huggingface-6NB9Z0db-py3.10/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 340, in execute_method
>     return executor(*args, **kwargs)
>   File "/home/jupyter/.cache/pypoetry/virtualenvs/k-llm-huggingface-6NB9Z0db-py3.10/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 271, in execute_model
>     output = self.model_runner.execute_model(
>   File "/home/jupyter/.cache/pypoetry/virtualenvs/k-llm-huggingface-6NB9Z0db-py3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
>     return func(*args, **kwargs)
>   File "/home/jupyter/.cache/pypoetry/virtualenvs/k-llm-huggingface-6NB9Z0db-py3.10/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1189, in execute_model
>     self.set_active_loras(model_input.lora_requests,
>   File "/home/jupyter/.cache/pypoetry/virtualenvs/k-llm-huggingface-6NB9Z0db-py3.10/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 887, in set_active_loras
>     self.lora_manager.set_active_loras(lora_requests, lora_mapping)
>   File "/home/jupyter/.cache/pypoetry/virtualenvs/k-llm-huggingface-6NB9Z0db-py3.10/lib/python3.10/site-packages/vllm/lora/worker_manager.py", line 138, in set_active_loras
>     self._apply_loras(lora_requests)
>   File "/home/jupyter/.cache/pypoetry/virtualenvs/k-llm-huggingface-6NB9Z0db-py3.10/lib/python3.10/site-packages/vllm/lora/worker_manager.py", line 270, in _apply_loras
>     self.add_lora(lora)
>   File "/home/jupyter/.cache/pypoetry/virtualenvs/k-llm-huggingface-6NB9Z0db-py3.10/lib/python3.10/site-packages/vllm/lora/worker_manager.py", line 285, in add_lora
>     self._lora_manager.activate_lora(lora_request.lora_int_id)
>   File "/home/jupyter/.cache/pypoetry/virtualenvs/k-llm-huggingface-6NB9Z0db-py3.10/lib/python3.10/site-packages/vllm/lora/models.py", line 804, in activate_lora
>     result = super().activate_lora(lora_id)
>   File "/home/jupyter/.cache/pypoetry/virtualenvs/k-llm-huggingface-6NB9Z0db-py3.10/lib/python3.10/site-packages/vllm/lora/models.py", line 495, in activate_lora
>     module.set_lora(index, module_lora.lora_a, module_lora.lora_b,
>   File "/home/jupyter/.cache/pypoetry/virtualenvs/k-llm-huggingface-6NB9Z0db-py3.10/lib/python3.10/site-packages/vllm/lora/layers.py", line 826, in set_lora
>     lora_b = self.slice_lora_b(lora_b)
>   File "/home/jupyter/.cache/pypoetry/virtualenvs/k-llm-huggingface-6NB9Z0db-py3.10/lib/python3.10/site-packages/vllm/lora/layers.py", line 801, in slice_lora_b
>     lora_b_q = lora_b[0][:, self.q_proj_shard_size *
> IndexError: too many indices for tensor of dimension 1

The text was updated successfully, but these errors were encountered:

tsvisab · 2024-07-12T08:00:53Z

archive_name.tar.gz
Attached is LoRa Adapter

tsvisab · 2024-07-12T09:57:03Z

After little investigation, seems like vllm's using safetensors to load the adapter but returns empty tensors when calling f.get_tensor() in some of the cases, still, no idea how to fix this.

More context:

i trained the Adapter using trl SFTtrainer
as a workaround i'm going to use save_safetensors=False in SFTConfig

tsvisab added the bug Something isn't working label Jul 11, 2024

tsvisab changed the title ~~[Bug]:~~ [Bug]: Error on inference with LoRa request Jul 11, 2024

tsvisab changed the title ~~[Bug]: Error on inference with LoRa request~~ [Bug]: Error on inference with LoRa request (safetensors format) Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Error on inference with LoRa request (safetensors format) #6333

[Bug]: Error on inference with LoRa request (safetensors format) #6333

tsvisab commented Jul 11, 2024 •

edited

Loading

tsvisab commented Jul 12, 2024

tsvisab commented Jul 12, 2024

[Bug]: Error on inference with LoRa request (safetensors format) #6333

[Bug]: Error on inference with LoRa request (safetensors format) #6333

Comments

tsvisab commented Jul 11, 2024 • edited Loading

Your current environment

🐛 Describe the bug

tsvisab commented Jul 12, 2024

tsvisab commented Jul 12, 2024

tsvisab commented Jul 11, 2024 •

edited

Loading