Assertion failed: Invalid tensor name: decoder_input_lengths

### System Info

- docker image: nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3
- tensorrt_llm: 0.9.0


### Who can help?

@kaiyux @byshiue 

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction


docker run --rm -it --gpus all --net host --shm-size=64g \
    --ulimit stack=67108864 \
    -v /share/datasets/tmp_share/chenyonghua/models/tensorrt_engines_v0.9.0/llama/Llama-2-7b-chat_TP1/tensorrtllm_backend:/tensorrtllm_backend \
    -v /share/datasets/public_models/Llama-2-7b-chat-hf:/share/datasets/public_models/Llama-2-7b-chat-hf \
    nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 bash 


cd /tensorrtllm_backend

export CUDA_VISIBLE_DEVICES=0

python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/tensorrtllm_backend/triton_model_repo


curl -X POST localhost:8000/v2/models/ensemble/generate -d \
'{
"text_input": "How do I count to nine in French?",
"parameters": {
"max_tokens": 100,
"bad_words":[""],
"stop_words":[""]
}
}'

### Expected behavior

I would expect the tensorrt engine to work with the triton inference server. And I could get a correct respond



### actual behavior

### client respond:
{"error":"in ensemble 'ensemble', [TensorRT-LLM][ERROR] Assertion failed: Invalid tensor name: decoder_input_lengths (/tmp/tritonbuild/tensorrtllm/inflight_batcher_llm/../tensorrt_llm/cpp/include/tensorrt_llm/batch_manager/inferenceRequest.h:269)\n1       0x7fdc94c6eba4 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100\n2       0x7fdc94c6ed59 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm_common.so(+0x19d59) [0x7fdc94c6ed59]\n3       0x7fdc94c75da8 triton::backend::inflight_batcher_llm::WorkItem::Initialize(TRITONBACKEND_Request*, unsigned long, bool) + 56\n4       0x7fdc94c76041 triton::backend::inflight_batcher_llm::WorkItem::WorkItem(TRITONBACKEND_Request*, bool) + 113\n5       0x7fdc94c77fb2 triton::backend::inflight_batcher_llm::WorkItemsQueue::pushBatch(std::vector<triton::backend::inflight_batcher_llm::WorkItemsQueue::RequestWrapper, std::allocator<triton::backend::inflight_batcher_llm::WorkItemsQueue::RequestWrapper> >&, unsigned long, std::function<void (std::shared_ptr<triton::backend::inflight_batcher_llm::WorkItem>)> const&) + 210\n6       0x7fdc94c7c130 triton::backend::inflight_batcher_llm::ModelInstanceState::enqueue(TRITONBACKEND_Request**, unsigned int) + 208\n7       0x7fdc94cc1c03 TRITONBACKEND_ModelInstanceExecute + 227\n8       0x7fdca0d19814 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1aa814) [0x7fdca0d19814]\n9       0x7fdca0d19b7b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1aab7b) [0x7fdca0d19b7b]\n10      0x7fdca0e2f76d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2c076d) [0x7fdca0e2f76d]\n11      0x7fdca0d1dfb4 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1aefb4) [0x7fdca0d1dfb4]\n12      0x7fdca05dc253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fdca05dc253]\n13      0x7fdca036bac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fdca036bac3]\n14      0x7fdca03fca04 clone + 68"}


### server log :

I0620 09:19:22.915296 878 metrics.cc:877] "Collecting metrics for GPU 0: NVIDIA H800"
I0620 09:19:22.986074 878 metrics.cc:770] "Collecting CPU metrics"
I0620 09:19:22.986387 878 tritonserver.cc:2557]
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                           |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                                          |
| server_version                   | 2.46.0                                                                                                                                                                                                          |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0]         | /tensorrtllm_backend/triton_model_repo                                                                                                                                                                          |
| model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
| strict_model_config              | 1                                                                                                                                                                                                               |
| model_config_name                |                                                                                                                                                                                                                 |
| rate_limit                       | OFF                                                                                                                                                                                                             |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
| strict_readiness                 | 1                                                                                                                                                                                                               |
| exit_timeout                     | 30                                                                                                                                                                                                              |
| cache_enabled                    | 0                                                                                                                                                                                                               |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0620 09:19:22.989899 878 grpc_server.cc:2463] "Started GRPCInferenceService at 0.0.0.0:8001"
I0620 09:19:22.990119 878 http_server.cc:4692] "Started HTTPService at 0.0.0.0:8000"
I0620 09:19:23.031496 878 http_server.cc:362] "Started Metrics Service at 0.0.0.0:8002"
[[    1  1128   437   306  2302   304 14183   297  5176 29973]]
[[10]]
[TensorRT-LLM][ERROR] [TensorRT-LLM][ERROR] Assertion failed: Invalid tensor name: decoder_input_lengths (/tmp/tritonbuild/tensorrtllm/inflight_batcher_llm/../tensorrt_llm/cpp/include/tensorrt_llm/batch_manager/inferenceRequest.h:269)
1       0x7fdc94c6eba4 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2       0x7fdc94c6ed59 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm_common.so(+0x19d59) [0x7fdc94c6ed59]
3       0x7fdc94c75da8 triton::backend::inflight_batcher_llm::WorkItem::Initialize(TRITONBACKEND_Request*, unsigned long, bool) + 56
4       0x7fdc94c76041 triton::backend::inflight_batcher_llm::WorkItem::WorkItem(TRITONBACKEND_Request*, bool) + 113
5       0x7fdc94c77fb2 triton::backend::inflight_batcher_llm::WorkItemsQueue::pushBatch(std::vector<triton::backend::inflight_batcher_llm::WorkItemsQueue::RequestWrapper, std::allocator<triton::backend::inflight_batcher_llm::WorkItemsQueue::RequestWrapper> >&, unsigned long, std::function<void (std::shared_ptr<triton::backend::inflight_batcher_llm::WorkItem>)> const&) + 210
6       0x7fdc94c7c130 triton::backend::inflight_batcher_llm::ModelInstanceState::enqueue(TRITONBACKEND_Request**, unsigned int) + 208
7       0x7fdc94cc1c03 TRITONBACKEND_ModelInstanceExecute + 227
8       0x7fdca0d19814 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1aa814) [0x7fdca0d19814]
9       0x7fdca0d19b7b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1aab7b) [0x7fdca0d19b7b]
10      0x7fdca0e2f76d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2c076d) [0x7fdca0e2f76d]
11      0x7fdca0d1dfb4 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1aefb4) [0x7fdca0d1dfb4]
12      0x7fdca05dc253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fdca05dc253]
13      0x7fdca036bac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fdca036bac3]
14      0x7fdca03fca04 clone + 68


### additional notes

- model: llama2-7b-chat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Assertion failed: Invalid tensor name: decoder_input_lengths #508

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

client respond:

server log :

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Assertion failed: Invalid tensor name: decoder_input_lengths #508

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

client respond:

server log :

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions