T5 model: Encountered an error when fetching new request: Prompt length (200) exceeds maximum input length (1)

### System Info

L4 GPU
GPU memory: 24 GB
TensorRT LLM version: v0.10.0
container used: tritonserver:24.06-trtllm-python-py3 

### Who can help?

@byshiue @schetlur-nv 

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

docker run -v /home/jaya_kommuru/:/home/jaya_kommuru/ -it --gpus=all --net=host --ipc=host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3 

git clone https://github.com/NVIDIA/TensorRT-LLM.git 
git checkout tags/v0.10.0 
cd TensorRT-LLM/examples/enc_dec/

git clone https://huggingface.co/google-t5/t5-small /tmp/hf_models/t5-small

export MODEL_NAME=t5-small
export MODEL_TYPE=t5 # or bart
export HF_MODEL_PATH=/tmp/hf_models/${MODEL_NAME}
export UNIFIED_CKPT_PATH=/tmp/ckpt/${MODEL_NAME}
export ENGINE_PATH=/tmp/engines/${MODEL_NAME}

python convert_checkpoint.py --model_type ${MODEL_TYPE}     --model_dir ${HF_MODEL_PATH}     --output_dir ${UNIFIED_CKPT_PATH}     --dtype float16

trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH}/tp1/pp1/encoder     --output_dir ${ENGINE_PATH}/encoder     --paged_kv_cache disable     --moe_plugin disable     --enable_xqa disable     --max_batch_size 64     --gemm_plugin float16     --bert_attention_plugin float16     --gpt_attention_plugin float16     --context_fmha disable

trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH}/tp1/pp1/decoder     --output_dir ${ENGINE_PATH}/decoder     --moe_plugin disable     --enable_xqa disable     --max_batch_size 64     --gemm_plugin float16     --bert_attention_plugin float16     --gpt_attention_plugin float16     --context_fmha disable     --max_input_len 1     --max_encoder_input_len 2048


cd ../../../

git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
git checkout tags/v0.10.0

python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH}/decoder,encoder_engine_dir:${ENGINE_PATH}/encoder,max_tokens_in_paged_kv_cache:4096,max_attention_window_size:4096,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0,enable_chunked_context:False,max_queue_size:0

python3 tools/fill_template.py -i all_models/inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_dir:${HF_MODEL_PATH},triton_max_batch_size:64,preprocessing_instance_count:1

python3 tools/fill_template.py -i all_models/inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_dir:${HF_MODEL_PATH},triton_max_batch_size:64,postprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:64
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False


pip install SentencePiece
python3 scripts/launch_triton_server.py --world_size 1 --model_repo=all_models/inflight_batcher_llm/

curl -X POST localhost:8000/v2/models/ensemble/generate -d "{\"text_input\": \"Summarize the following news article: (CNN)Following last year's successful U.K. tour, Prince and 3rdEyeGirl are bringing the Hit & Run Tour to the U.S. for the first time. The first -- and so far only -- scheduled show will take place in Louisville, Kentucky, the hometown of 3rdEyeGirl drummer Hannah Welton. Slated for March 14, tickets will go on sale Monday, March 9 at 10 a.m. local time. Prince crowns dual rock charts . A venue has yet to be announced. When the Hit & Run worked its way through the U.K. in 2014, concert venues were revealed via Twitter prior to each show. Portions of the ticket sales will be donated to various Louisville charities. See the original story at Billboard.com. ©2015 Billboard. All Rights Reserved.\", \"max_tokens\": 1024, \"bad_words\": \"\", \"stop_words\": \"\"}"









### Expected behavior

The curl request should have generated some response

### actual behavior

But its failing due to following error: 
```{"error":"in ensemble 'ensemble', Executor failed process requestId 4 due to the following error: Encountered an error when fetching new request: Prompt length (200) exceeds maximum input length (1). (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/include/tensorrt_llm/batch_manager/llmRequest.h:201)\n1       0x7f15ca33587f /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x6c387f) [0x7f15ca33587f]\n2       0x7f15cc2dcae2 tensorrt_llm::executor::Executor::Impl::executionLoop() + 722\n3       0x7f16bd1d8253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f16bd1d8253]\n4       0x7f16bcf67ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f16bcf67ac3]\n5       0x7f16bcff8a04 clone + 68"}```

### additional notes

Have been referring to this example: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/encoder_decoder.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

T5 model: Encountered an error when fetching new request: Prompt length (200) exceeds maximum input length (1) #565

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

T5 model: Encountered an error when fetching new request: Prompt length (200) exceeds maximum input length (1) #565

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions