multi lora infer error : [TensorRT-LLm Error][fpA_intB Runner] Failed to run cutlass fpA_intB gemm. Error: Error Internal

### System Info

gpu: A100
trtllm v0.11.0
trtllm-backend v0.11.0
image: Triton Inference Server: 24.07-trtllm-python-py3
model: llama-7b
lora: Japanese-Alpaca-LoRA-7b-v0

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

1.  convert 
BASE_LLAMA_MODEL=./Qwen1.5-72B-Chat-GPTQ-Int4
python3 convert_checkpoint.py \
    --model_dir ${BASE_LLAMA_MODEL} \
    --output_dir ./tllm_checkpoint_1gpu \
    --dtype float16 \
    --use_weight_only \
    --weight_only_precision int4_gptq \
    --per_group

2.  trt-build
trtllm-build \
--checkpoint_dir ./tllm_checkpoint_1gpu \
--output_dir /opt/app/triton_model_repo/tensorrt_llm/1 \
--gemm_plugin float16 \
--max_batch_size 1 \
--max_input_len 2048 \
--max_output_len 100 \
--gpt_attention_plugin float16 \
--paged_kv_cache enable \
--remove_input_padding enable \
--use_paged_context_fmha enable \
--use_custom_all_reduce disable \
--lora_plugin float16 \
--lora_target_modules attn_q attn_k attn_v attn_dense \
--max_lora_rank 128

3. python3 inflight_batcher_llm/client/inflight_batcher_llm_client.py
--request-output-len 10
--text "You are a travel assistant."
--tokenizer-dir ./Qwen1.5-72B-Chat-GPTQ-Int4
--lora-path ""
--lora-task-id 0

4. start server
modelpath=/opt/app/triton_model_repo/
CUDA_VISIBLE_DEVICES=0 /opt/tritonserver/bin/tritonserver \
--model-repository=$modelpath --backend-config=python,shm-region-prefix-name=prefix0_ 

**Execution error conditions：**
If the number of tokens in the input text is less than 5, the result is normal. If it is **greater than 5**, the execution fails.



### Expected behavior

no error 

### actual behavior

terminate called after throwing an instance of 'std::runtime_error'
  what():  [TensorRT-LLm Error][fpA_intB Runner] Failed to run cutlass fpA_intB gemm. Error: Error Internal
Signal (6) received.
 0# 0x0000563C4BC2B80D in /opt/tritonserver/bin/tritonserver
 1# 0x00007F8C6CF14520 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# pthread_kill in /usr/lib/x86_64-linux-gnu/libc.so.6
 3# raise in /usr/lib/x86_64-linux-gnu/libc.so.6
 4# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
 5# 0x00007F8C6D19DB9E in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 6# 0x00007F8C6D1A920C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 7# 0x00007F8C6D1A81E9 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 8# __gxx_personality_v0 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 9# 0x00007F8C6F66A884 in /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
10# _Unwind_Resume in /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
11# 0x00007F8B68C73CC3 in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so.10
12# 0x00007F8B68C8B8A3 in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so.10
13# tensorrt_llm::plugins::WeightOnlyGroupwiseQuantMatmulPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so.10
14# 0x00007F8B35FCFA8C in /usr/local/tensorrt/lib/libnvinfer.so.10
15# 0x00007F8B35F74657 in /usr/local/tensorrt/lib/libnvinfer.so.10
16# 0x00007F8B35F760C1 in /usr/local/tensorrt/lib/libnvinfer.so.10
17# tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so
18# tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so
19# tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so
20# tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so
21# tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so
22# tensorrt_llm::executor::Executor::Impl::executionLoop() in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so
23# 0x00007F8C6D1D7253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
24# 0x00007F8C6CF66AC3 in /usr/lib/x86_64-linux-gnu/libc.so.6
25# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

Aborted (core dumped)

### additional notes

**Execution error conditions：**
If the number of tokens in the input text is less than 5, the result is normal. If it is **greater than 5**, the execution fails.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

multi lora infer error : [TensorRT-LLm Error][fpA_intB Runner] Failed to run cutlass fpA_intB gemm. Error: Error Internal #584

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

multi lora infer error : [TensorRT-LLm Error][fpA_intB Runner] Failed to run cutlass fpA_intB gemm. Error: Error Internal #584

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions