Skip to content

Assertion failed: Invalid tensor name: decoder_input_lengths #508

@HowardChenRV

Description

@HowardChenRV

System Info

  • docker image: nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3
  • tensorrt_llm: 0.9.0

Who can help?

@kaiyux @byshiue

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

docker run --rm -it --gpus all --net host --shm-size=64g
--ulimit stack=67108864
-v /share/datasets/tmp_share/chenyonghua/models/tensorrt_engines_v0.9.0/llama/Llama-2-7b-chat_TP1/tensorrtllm_backend:/tensorrtllm_backend
-v /share/datasets/public_models/Llama-2-7b-chat-hf:/share/datasets/public_models/Llama-2-7b-chat-hf
nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 bash

cd /tensorrtllm_backend

export CUDA_VISIBLE_DEVICES=0

python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/tensorrtllm_backend/triton_model_repo

curl -X POST localhost:8000/v2/models/ensemble/generate -d
'{
"text_input": "How do I count to nine in French?",
"parameters": {
"max_tokens": 100,
"bad_words":[""],
"stop_words":[""]
}
}'

Expected behavior

I would expect the tensorrt engine to work with the triton inference server. And I could get a correct respond

actual behavior

client respond:

{"error":"in ensemble 'ensemble', [TensorRT-LLM][ERROR] Assertion failed: Invalid tensor name: decoder_input_lengths (/tmp/tritonbuild/tensorrtllm/inflight_batcher_llm/../tensorrt_llm/cpp/include/tensorrt_llm/batch_manager/inferenceRequest.h:269)\n1 0x7fdc94c6eba4 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 100\n2 0x7fdc94c6ed59 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm_common.so(+0x19d59) [0x7fdc94c6ed59]\n3 0x7fdc94c75da8 triton::backend::inflight_batcher_llm::WorkItem::Initialize(TRITONBACKEND_Request*, unsigned long, bool) + 56\n4 0x7fdc94c76041 triton::backend::inflight_batcher_llm::WorkItem::WorkItem(TRITONBACKEND_Request*, bool) + 113\n5 0x7fdc94c77fb2 triton::backend::inflight_batcher_llm::WorkItemsQueue::pushBatch(std::vector<triton::backend::inflight_batcher_llm::WorkItemsQueue::RequestWrapper, std::allocatortriton::backend::inflight_batcher_llm::WorkItemsQueue::RequestWrapper >&, unsigned long, std::function<void (std::shared_ptrtriton::backend::inflight_batcher_llm::WorkItem)> const&) + 210\n6 0x7fdc94c7c130 triton::backend::inflight_batcher_llm::ModelInstanceState::enqueue(TRITONBACKEND_Request**, unsigned int) + 208\n7 0x7fdc94cc1c03 TRITONBACKEND_ModelInstanceExecute + 227\n8 0x7fdca0d19814 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1aa814) [0x7fdca0d19814]\n9 0x7fdca0d19b7b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1aab7b) [0x7fdca0d19b7b]\n10 0x7fdca0e2f76d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2c076d) [0x7fdca0e2f76d]\n11 0x7fdca0d1dfb4 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1aefb4) [0x7fdca0d1dfb4]\n12 0x7fdca05dc253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fdca05dc253]\n13 0x7fdca036bac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fdca036bac3]\n14 0x7fdca03fca04 clone + 68"}

server log :

I0620 09:19:22.915296 878 metrics.cc:877] "Collecting metrics for GPU 0: NVIDIA H800"
I0620 09:19:22.986074 878 metrics.cc:770] "Collecting CPU metrics"
I0620 09:19:22.986387 878 tritonserver.cc:2557]
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.46.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | /tensorrtllm_backend/triton_model_repo |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| model_config_name | |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0620 09:19:22.989899 878 grpc_server.cc:2463] "Started GRPCInferenceService at 0.0.0.0:8001"
I0620 09:19:22.990119 878 http_server.cc:4692] "Started HTTPService at 0.0.0.0:8000"
I0620 09:19:23.031496 878 http_server.cc:362] "Started Metrics Service at 0.0.0.0:8002"
[[ 1 1128 437 306 2302 304 14183 297 5176 29973]]
[[10]]
[TensorRT-LLM][ERROR] [TensorRT-LLM][ERROR] Assertion failed: Invalid tensor name: decoder_input_lengths (/tmp/tritonbuild/tensorrtllm/inflight_batcher_llm/../tensorrt_llm/cpp/include/tensorrt_llm/batch_manager/inferenceRequest.h:269)
1 0x7fdc94c6eba4 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 100
2 0x7fdc94c6ed59 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm_common.so(+0x19d59) [0x7fdc94c6ed59]
3 0x7fdc94c75da8 triton::backend::inflight_batcher_llm::WorkItem::Initialize(TRITONBACKEND_Request*, unsigned long, bool) + 56
4 0x7fdc94c76041 triton::backend::inflight_batcher_llm::WorkItem::WorkItem(TRITONBACKEND_Request*, bool) + 113
5 0x7fdc94c77fb2 triton::backend::inflight_batcher_llm::WorkItemsQueue::pushBatch(std::vector<triton::backend::inflight_batcher_llm::WorkItemsQueue::RequestWrapper, std::allocatortriton::backend::inflight_batcher_llm::WorkItemsQueue::RequestWrapper >&, unsigned long, std::function<void (std::shared_ptrtriton::backend::inflight_batcher_llm::WorkItem)> const&) + 210
6 0x7fdc94c7c130 triton::backend::inflight_batcher_llm::ModelInstanceState::enqueue(TRITONBACKEND_Request**, unsigned int) + 208
7 0x7fdc94cc1c03 TRITONBACKEND_ModelInstanceExecute + 227
8 0x7fdca0d19814 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1aa814) [0x7fdca0d19814]
9 0x7fdca0d19b7b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1aab7b) [0x7fdca0d19b7b]
10 0x7fdca0e2f76d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2c076d) [0x7fdca0e2f76d]
11 0x7fdca0d1dfb4 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1aefb4) [0x7fdca0d1dfb4]
12 0x7fdca05dc253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fdca05dc253]
13 0x7fdca036bac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fdca036bac3]
14 0x7fdca03fca04 clone + 68

additional notes

  • model: llama2-7b-chat

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions