-
Notifications
You must be signed in to change notification settings - Fork 132
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
System Info
CPU: X86-64, 16 cores
GPU: A100-SXM4-80GB
system: ubuntu 22.04.4
driver: 555.42.06
docker_images: nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I used concurrencies of 100, 150, and 200 for alternate testing. The program ran normally at the beginning. However, after more than one and a half hours, the following errors occurred:
E0728 17:26:10.112496 154 model.py:106] "Traceback (most recent call last):
File \"/triton_deploy/triton_server/tcl_model/inflight_batcher_llm/tensorrt_llm_bls/1/model.py\", line 94, in execute
for res in res_gen:
File \"/triton_deploy/triton_server/tcl_model/inflight_batcher_llm/tensorrt_llm_bls/1/lib/decode.py\", line 190, in decode
gen_response = self._generate_non_streaming(
File \"/triton_deploy/triton_server/tcl_model/inflight_batcher_llm/tensorrt_llm_bls/1/lib/triton_decoder.py\", line 286, in _generate_non_streaming
r = self._exec_triton_request_single(triton_req)
File \"/triton_deploy/triton_server/tcl_model/inflight_batcher_llm/tensorrt_llm_bls/1/lib/triton_decoder.py\", line 136, in _exec_triton_request_single
raise pb_utils.TritonModelException(responses.error().message())
c_python_backend_utils.TritonModelException: Executor failed process requestId 163510 due to the following error: Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] Assertion failed: mNextBlocks.empty() (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:219)
1 0x7f3a740272b5 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2 0x7f3986329b3e /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x6b7b3e) [0x7f3986329b3e]
3 0x7f3988277299 tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::claimLeafBlock(tensorrt_llm::batch_manager::kv_cache_manager::KVCacheBlock&) + 25
4 0x7f398827cd9b tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::loadOrAllocateBlocks(std::__cxx11::list<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > const&, tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, int, int) + 283
5 0x7f398827d4fe tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::addSequence(tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, int, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&) + 366
6 0x7f398827deb6 tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::addSequence(int, int, int, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&) + 742
7 0x7f398829939e tensorrt_llm::batch_manager::RuntimeBuffers::setFromInputs(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int, int, tensorrt_llm::batch_manager::DecoderBuffers&, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager*, tensorrt_llm::batch_manager::rnn_state_manager::RnnStateManager*, std::map<unsigned long, std::shared_ptr<std::vector<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig, std::allocator<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig> > >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<std::vector<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig, std::allocator<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig> > > > > > const&, tensorrt_llm::runtime::TllmRuntime const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&) + 5198
8 0x7f398829d933 tensorrt_llm::batch_manager::RuntimeBuffers::prepareStep[abi:cxx11](std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int, int, tensorrt_llm::batch_manager::DecoderBuffers&, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager*, tensorrt_llm::batch_manager::rnn_state_manager::RnnStateManager*, std::map<unsigned long, std::shared_ptr<std::vector<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig, std::allocator<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig> > >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<std::vector<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig, std::allocator<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig> > > > > > const&, tensorrt_llm::runtime::TllmRuntime const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&) + 163
9 0x7f39882a8d3c tensorrt_llm::batch_manager::TrtGptModelInflightBatching::setupContext(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) + 172
10 0x7f39882a9090 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) + 544
11 0x7f39882b66a3 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 1731
12 0x7f39882d9cdb tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 91
13 0x7f39882dc9eb tensorrt_llm::executor::Executor::Impl::executionLoop() + 475
14 0x7f3a7e3d8253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f3a7e3d8253]
15 0x7f3a7e167ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f3a7e167ac3]
16 0x7f3a7e1f8a04 clone + 68\n"build script is :
input_dir=/models/tcl_finetune/qwen4b/
input_temp_dir=/models/trtllm_model/qwen4b-trt/temp
output_dir=/models/trtllm_model/qwen4b-trt/1gpu-fp16/
calib_dataset_path=/dataset/quant/VA/quant_VA.json
python3 qwen_convert_trtllm.py --model_dir $input_dir \
--output_dir $input_temp_dir \
--dtype float16 \
--calib_dataset $calib_dataset_path
trtllm-build --checkpoint_dir $input_temp_dir \
--output_dir $output_dir \
--remove_input_padding enable \
--gemm_plugin float16 \
--gpt_attention_plugin float16 \
--paged_kv_cache enable \
--use_paged_context_fmha enable \
--max_batch_size 128 \
--max_input_len 2048 \
--max_num_tokens 32768 \
--tokens_per_block 16 \
--use_fused_mlp \
--context_fmha enable \
--context_fmha_fp32_acc enable \
--multi_block_mode enable \
--use_custom_all_reduce enable \
--strongly_typedthe tensorrt_llm_bls is :
name: "tensorrt_llm_bls"
backend: "python"
max_batch_size: 8
model_transaction_policy {
decoupled: false
}
input [
{
name: "text_input"
data_type: TYPE_STRING
dims: [ -1 ]
},
{
name: "max_tokens"
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: "bad_words"
data_type: TYPE_STRING
dims: [ -1 ]
optional: true
},
{
name: "stop_words"
data_type: TYPE_STRING
dims: [ -1 ]
optional: true
},
{
name: "end_id"
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
{
name: "pad_id"
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
{
name: "top_k"
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
{
name: "top_p"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "temperature"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "length_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "repetition_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "min_length"
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
{
name: "presence_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "frequency_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "random_seed"
data_type: TYPE_UINT64
dims: [ 1 ]
optional: true
},
{
name: "return_log_probs"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "return_context_logits"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "return_generation_logits"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "beam_width"
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
{
name: "stream"
data_type: TYPE_BOOL
dims: [ 1 ]
optional: true
},
{
name: "prompt_embedding_table"
data_type: TYPE_FP16
dims: [ -1, -1 ]
optional: true
},
{
name: "prompt_vocab_size"
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
{
name: "embedding_bias_words"
data_type: TYPE_STRING
dims: [ -1 ]
optional: true
},
{
name: "embedding_bias_weights"
data_type: TYPE_FP32
dims: [ -1 ]
optional: true
},
{
name: "num_draft_tokens",
data_type: TYPE_INT32,
dims: [ 1 ]
optional: true
},
{
name: "use_draft_logits",
data_type: TYPE_BOOL,
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
}
]
output [
{
name: "text_output"
data_type: TYPE_STRING
dims: [ -1 ]
},
{
name: "cum_log_probs"
data_type: TYPE_FP32
dims: [ -1 ]
},
{
name: "output_log_probs"
data_type: TYPE_FP32
dims: [ -1, -1 ]
},
{
name: "context_logits"
data_type: TYPE_FP32
dims: [ -1, -1 ]
},
{
name: "generation_logits"
data_type: TYPE_FP32
dims: [ -1, -1, -1 ]
}
]
parameters: {
key: "accumulate_tokens"
value: {
string_value: "${accumulate_tokens}"
}
}
parameters: {
key: "tensorrt_llm_model_name"
value: {
string_value: "tensorrt_llm"
}
}
parameters: {
key: "tensorrt_llm_draft_model_name"
value: {
string_value: "${tensorrt_llm_draft_model_name}"
}
}
instance_group [
{
count: 128
kind : KIND_CPU
}
]I just modified instance_group to 128, which equals to max_batch_size in the build script.
Why does this phenomenon occur? Is it because I have too few CPU resources? Should the number of CPU cores be equal to 128 for it to be normal?
Expected behavior
the service is normal with concurrencies of 100, 150, and 200
actual behavior
service hang on with some errors
additional notes
No
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working