Skip to content

multi lora infer error : [TensorRT-LLm Error][fpA_intB Runner] Failed to run cutlass fpA_intB gemm. Error: Error Internal #584

@PAOPAO6

Description

@PAOPAO6

System Info

gpu: A100
trtllm v0.11.0
trtllm-backend v0.11.0
image: Triton Inference Server: 24.07-trtllm-python-py3
model: llama-7b
lora: Japanese-Alpaca-LoRA-7b-v0

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. convert
    BASE_LLAMA_MODEL=./Qwen1.5-72B-Chat-GPTQ-Int4
    python3 convert_checkpoint.py
    --model_dir ${BASE_LLAMA_MODEL}
    --output_dir ./tllm_checkpoint_1gpu
    --dtype float16
    --use_weight_only
    --weight_only_precision int4_gptq
    --per_group

  2. trt-build
    trtllm-build
    --checkpoint_dir ./tllm_checkpoint_1gpu
    --output_dir /opt/app/triton_model_repo/tensorrt_llm/1
    --gemm_plugin float16
    --max_batch_size 1
    --max_input_len 2048
    --max_output_len 100
    --gpt_attention_plugin float16
    --paged_kv_cache enable
    --remove_input_padding enable
    --use_paged_context_fmha enable
    --use_custom_all_reduce disable
    --lora_plugin float16
    --lora_target_modules attn_q attn_k attn_v attn_dense
    --max_lora_rank 128

  3. python3 inflight_batcher_llm/client/inflight_batcher_llm_client.py
    --request-output-len 10
    --text "You are a travel assistant."
    --tokenizer-dir ./Qwen1.5-72B-Chat-GPTQ-Int4
    --lora-path ""
    --lora-task-id 0

  4. start server
    modelpath=/opt/app/triton_model_repo/
    CUDA_VISIBLE_DEVICES=0 /opt/tritonserver/bin/tritonserver
    --model-repository=$modelpath --backend-config=python,shm-region-prefix-name=prefix0_

Execution error conditions:
If the number of tokens in the input text is less than 5, the result is normal. If it is greater than 5, the execution fails.

Expected behavior

no error

actual behavior

terminate called after throwing an instance of 'std::runtime_error'
what(): [TensorRT-LLm Error][fpA_intB Runner] Failed to run cutlass fpA_intB gemm. Error: Error Internal
Signal (6) received.
0# 0x0000563C4BC2B80D in /opt/tritonserver/bin/tritonserver
1# 0x00007F8C6CF14520 in /usr/lib/x86_64-linux-gnu/libc.so.6
2# pthread_kill in /usr/lib/x86_64-linux-gnu/libc.so.6
3# raise in /usr/lib/x86_64-linux-gnu/libc.so.6
4# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
5# 0x00007F8C6D19DB9E in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
6# 0x00007F8C6D1A920C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
7# 0x00007F8C6D1A81E9 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
8# __gxx_personality_v0 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
9# 0x00007F8C6F66A884 in /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
10# _Unwind_Resume in /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
11# 0x00007F8B68C73CC3 in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so.10
12# 0x00007F8B68C8B8A3 in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so.10
13# tensorrt_llm::plugins::WeightOnlyGroupwiseQuantMatmulPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so.10
14# 0x00007F8B35FCFA8C in /usr/local/tensorrt/lib/libnvinfer.so.10
15# 0x00007F8B35F74657 in /usr/local/tensorrt/lib/libnvinfer.so.10
16# 0x00007F8B35F760C1 in /usr/local/tensorrt/lib/libnvinfer.so.10
17# tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so
18# tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so
19# tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so
20# tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so
21# tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so
22# tensorrt_llm::executor::Executor::Impl::executionLoop() in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so
23# 0x00007F8C6D1D7253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
24# 0x00007F8C6CF66AC3 in /usr/lib/x86_64-linux-gnu/libc.so.6
25# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

Aborted (core dumped)

additional notes

Execution error conditions:
If the number of tokens in the input text is less than 5, the result is normal. If it is greater than 5, the execution fails.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions