-
Notifications
You must be signed in to change notification settings - Fork 132
Description
System Info
gpu: A100
trtllm v0.11.0
trtllm-backend v0.11.0
image: Triton Inference Server: 24.07-trtllm-python-py3
model: llama-7b
lora: Japanese-Alpaca-LoRA-7b-v0
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
-
convert
BASE_LLAMA_MODEL=./Qwen1.5-72B-Chat-GPTQ-Int4
python3 convert_checkpoint.py
--model_dir ${BASE_LLAMA_MODEL}
--output_dir ./tllm_checkpoint_1gpu
--dtype float16
--use_weight_only
--weight_only_precision int4_gptq
--per_group -
trt-build
trtllm-build
--checkpoint_dir ./tllm_checkpoint_1gpu
--output_dir /opt/app/triton_model_repo/tensorrt_llm/1
--gemm_plugin float16
--max_batch_size 1
--max_input_len 2048
--max_output_len 100
--gpt_attention_plugin float16
--paged_kv_cache enable
--remove_input_padding enable
--use_paged_context_fmha enable
--use_custom_all_reduce disable
--lora_plugin float16
--lora_target_modules attn_q attn_k attn_v attn_dense
--max_lora_rank 128 -
python3 inflight_batcher_llm/client/inflight_batcher_llm_client.py
--request-output-len 10
--text "You are a travel assistant."
--tokenizer-dir ./Qwen1.5-72B-Chat-GPTQ-Int4
--lora-path ""
--lora-task-id 0 -
start server
modelpath=/opt/app/triton_model_repo/
CUDA_VISIBLE_DEVICES=0 /opt/tritonserver/bin/tritonserver
--model-repository=$modelpath --backend-config=python,shm-region-prefix-name=prefix0_
Execution error conditions:
If the number of tokens in the input text is less than 5, the result is normal. If it is greater than 5, the execution fails.
Expected behavior
no error
actual behavior
terminate called after throwing an instance of 'std::runtime_error'
what(): [TensorRT-LLm Error][fpA_intB Runner] Failed to run cutlass fpA_intB gemm. Error: Error Internal
Signal (6) received.
0# 0x0000563C4BC2B80D in /opt/tritonserver/bin/tritonserver
1# 0x00007F8C6CF14520 in /usr/lib/x86_64-linux-gnu/libc.so.6
2# pthread_kill in /usr/lib/x86_64-linux-gnu/libc.so.6
3# raise in /usr/lib/x86_64-linux-gnu/libc.so.6
4# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
5# 0x00007F8C6D19DB9E in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
6# 0x00007F8C6D1A920C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
7# 0x00007F8C6D1A81E9 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
8# __gxx_personality_v0 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
9# 0x00007F8C6F66A884 in /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
10# _Unwind_Resume in /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
11# 0x00007F8B68C73CC3 in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so.10
12# 0x00007F8B68C8B8A3 in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so.10
13# tensorrt_llm::plugins::WeightOnlyGroupwiseQuantMatmulPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so.10
14# 0x00007F8B35FCFA8C in /usr/local/tensorrt/lib/libnvinfer.so.10
15# 0x00007F8B35F74657 in /usr/local/tensorrt/lib/libnvinfer.so.10
16# 0x00007F8B35F760C1 in /usr/local/tensorrt/lib/libnvinfer.so.10
17# tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so
18# tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so
19# tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so
20# tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so
21# tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so
22# tensorrt_llm::executor::Executor::Impl::executionLoop() in /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so
23# 0x00007F8C6D1D7253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
24# 0x00007F8C6CF66AC3 in /usr/lib/x86_64-linux-gnu/libc.so.6
25# clone in /usr/lib/x86_64-linux-gnu/libc.so.6
Aborted (core dumped)
additional notes
Execution error conditions:
If the number of tokens in the input text is less than 5, the result is normal. If it is greater than 5, the execution fails.