Skip to content

bugs in v0.10.0 version with tensorrtllm_backend #550

@white-wolf-tech

Description

@white-wolf-tech

System Info

CPU: X86-64, 16 cores
GPU: A100-SXM4-80GB
system: ubuntu 22.04.4
driver: 555.42.06
docker_images: nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I used concurrencies of 100, 150, and 200 for alternate testing. The program ran normally at the beginning. However, after more than one and a half hours, the following errors occurred:

E0728 17:26:10.112496 154 model.py:106] "Traceback (most recent call last):
   File \"/triton_deploy/triton_server/tcl_model/inflight_batcher_llm/tensorrt_llm_bls/1/model.py\", line 94, in execute
   for res in res_gen:
     File \"/triton_deploy/triton_server/tcl_model/inflight_batcher_llm/tensorrt_llm_bls/1/lib/decode.py\", line 190, in decode
        gen_response = self._generate_non_streaming(
            File \"/triton_deploy/triton_server/tcl_model/inflight_batcher_llm/tensorrt_llm_bls/1/lib/triton_decoder.py\", line 286, in _generate_non_streaming
              r = self._exec_triton_request_single(triton_req)
                File \"/triton_deploy/triton_server/tcl_model/inflight_batcher_llm/tensorrt_llm_bls/1/lib/triton_decoder.py\", line 136, in _exec_triton_request_single
                    raise pb_utils.TritonModelException(responses.error().message())
                     c_python_backend_utils.TritonModelException: Executor failed process requestId 163510 due to the following error: Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] Assertion failed: mNextBlocks.empty() (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:219)
                     1       0x7f3a740272b5 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
                     2       0x7f3986329b3e /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x6b7b3e) [0x7f3986329b3e]
                     3       0x7f3988277299 tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::claimLeafBlock(tensorrt_llm::batch_manager::kv_cache_manager::KVCacheBlock&) + 25
                     4       0x7f398827cd9b tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::loadOrAllocateBlocks(std::__cxx11::list<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > const&, tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, int, int) + 283
                     5       0x7f398827d4fe tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::addSequence(tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, int, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&) + 366
                     6       0x7f398827deb6 tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::addSequence(int, int, int, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&) + 742
                     7       0x7f398829939e tensorrt_llm::batch_manager::RuntimeBuffers::setFromInputs(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int, int, tensorrt_llm::batch_manager::DecoderBuffers&, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager*, tensorrt_llm::batch_manager::rnn_state_manager::RnnStateManager*, std::map<unsigned long, std::shared_ptr<std::vector<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig, std::allocator<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig> > >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<std::vector<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig, std::allocator<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig> > > > > > const&, tensorrt_llm::runtime::TllmRuntime const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&) + 5198
                     8       0x7f398829d933 tensorrt_llm::batch_manager::RuntimeBuffers::prepareStep[abi:cxx11](std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int, int, tensorrt_llm::batch_manager::DecoderBuffers&, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager*, tensorrt_llm::batch_manager::rnn_state_manager::RnnStateManager*, std::map<unsigned long, std::shared_ptr<std::vector<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig, std::allocator<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig> > >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<std::vector<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig, std::allocator<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig> > > > > > const&, tensorrt_llm::runtime::TllmRuntime const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&) + 163
 9       0x7f39882a8d3c tensorrt_llm::batch_manager::TrtGptModelInflightBatching::setupContext(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) + 172
                     10      0x7f39882a9090 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) + 544
                     11      0x7f39882b66a3 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 1731
                     12      0x7f39882d9cdb tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 91
                     13      0x7f39882dc9eb tensorrt_llm::executor::Executor::Impl::executionLoop() + 475
                     14      0x7f3a7e3d8253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f3a7e3d8253]
                     15      0x7f3a7e167ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f3a7e167ac3]
                     16      0x7f3a7e1f8a04 clone + 68\n"

build script is :

    input_dir=/models/tcl_finetune/qwen4b/
    input_temp_dir=/models/trtllm_model/qwen4b-trt/temp
    output_dir=/models/trtllm_model/qwen4b-trt/1gpu-fp16/
    calib_dataset_path=/dataset/quant/VA/quant_VA.json

    python3 qwen_convert_trtllm.py --model_dir $input_dir \
                                --output_dir $input_temp_dir \
                                --dtype float16 \
                                --calib_dataset $calib_dataset_path

    trtllm-build --checkpoint_dir $input_temp_dir \
                --output_dir $output_dir \
                --remove_input_padding enable \
                --gemm_plugin float16 \
                --gpt_attention_plugin float16 \
                --paged_kv_cache  enable \
                --use_paged_context_fmha enable \
                --max_batch_size 128 \
                --max_input_len 2048 \
                --max_num_tokens 32768 \
                --tokens_per_block 16 \
                --use_fused_mlp \
                --context_fmha enable \
                --context_fmha_fp32_acc enable \
                --multi_block_mode enable \
                --use_custom_all_reduce enable \
                --strongly_typed

the tensorrt_llm_bls is :

name: "tensorrt_llm_bls"
backend: "python"
max_batch_size: 8

model_transaction_policy {
  decoupled: false
}

input [
  {
    name: "text_input"
    data_type: TYPE_STRING
    dims: [ -1 ]
  },
  {
    name: "max_tokens"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
   name: "bad_words"
   data_type: TYPE_STRING
   dims: [ -1 ]
   optional: true
  },
  {
   name: "stop_words"
   data_type: TYPE_STRING
   dims: [ -1 ]
   optional: true
  },
  {
    name: "end_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
 optional: true
  },
  {
    name: "pad_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "top_k"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "length_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
{
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "min_length"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "presence_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "frequency_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    optional: true
  },
{
    name: "return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_context_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_generation_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_width"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
{
    name: "stream"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
  },
  {
    name: "prompt_embedding_table"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    optional: true
  },
  {
    name: "prompt_vocab_size"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
      name: "embedding_bias_words"
      data_type: TYPE_STRING
      dims: [ -1 ]
      optional: true
  },
  {
      name: "embedding_bias_weights"
      data_type: TYPE_FP32
      dims: [ -1 ]
      optional: true
  },
{
      name: "num_draft_tokens",
      data_type: TYPE_INT32,
      dims: [ 1 ]
      optional: true
  },
  {
      name: "use_draft_logits",
      data_type: TYPE_BOOL,
      dims: [ 1 ]
      reshape: { shape: [ ] }
      optional: true
  }
]
output [
  {
    name: "text_output"
    data_type: TYPE_STRING
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "context_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
{
    name: "generation_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1 ]
  }
]

parameters: {
  key: "accumulate_tokens"
  value: {
    string_value: "${accumulate_tokens}"
  }
}
parameters: {
  key: "tensorrt_llm_model_name"
  value: {
    string_value: "tensorrt_llm"
  }
}
parameters: {
  key: "tensorrt_llm_draft_model_name"
  value: {
    string_value: "${tensorrt_llm_draft_model_name}"
  }
}
instance_group [
  {
    count: 128
    kind : KIND_CPU
  }
]

I just modified instance_group to 128, which equals to max_batch_size in the build script.

Why does this phenomenon occur? Is it because I have too few CPU resources? Should the number of CPU cores be equal to 128 for it to be normal?

Expected behavior

the service is normal with concurrencies of 100, 150, and 200

actual behavior

service hang on with some errors

additional notes

No

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions