Skip to content

T5 model: Encountered an error when fetching new request: Prompt length (200) exceeds maximum input length (1) #565

@jayakommuru

Description

@jayakommuru

System Info

L4 GPU
GPU memory: 24 GB
TensorRT LLM version: v0.10.0
container used: tritonserver:24.06-trtllm-python-py3

Who can help?

@byshiue @schetlur-nv

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

docker run -v /home/jaya_kommuru/:/home/jaya_kommuru/ -it --gpus=all --net=host --ipc=host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3

git clone https://github.com/NVIDIA/TensorRT-LLM.git
git checkout tags/v0.10.0
cd TensorRT-LLM/examples/enc_dec/

git clone https://huggingface.co/google-t5/t5-small /tmp/hf_models/t5-small

export MODEL_NAME=t5-small
export MODEL_TYPE=t5 # or bart
export HF_MODEL_PATH=/tmp/hf_models/${MODEL_NAME}
export UNIFIED_CKPT_PATH=/tmp/ckpt/${MODEL_NAME}
export ENGINE_PATH=/tmp/engines/${MODEL_NAME}

python convert_checkpoint.py --model_type ${MODEL_TYPE} --model_dir ${HF_MODEL_PATH} --output_dir ${UNIFIED_CKPT_PATH} --dtype float16

trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH}/tp1/pp1/encoder --output_dir ${ENGINE_PATH}/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --max_batch_size 64 --gemm_plugin float16 --bert_attention_plugin float16 --gpt_attention_plugin float16 --context_fmha disable

trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH}/tp1/pp1/decoder --output_dir ${ENGINE_PATH}/decoder --moe_plugin disable --enable_xqa disable --max_batch_size 64 --gemm_plugin float16 --bert_attention_plugin float16 --gpt_attention_plugin float16 --context_fmha disable --max_input_len 1 --max_encoder_input_len 2048

cd ../../../

git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
git checkout tags/v0.10.0

python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH}/decoder,encoder_engine_dir:${ENGINE_PATH}/encoder,max_tokens_in_paged_kv_cache:4096,max_attention_window_size:4096,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0,enable_chunked_context:False,max_queue_size:0

python3 tools/fill_template.py -i all_models/inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_dir:${HF_MODEL_PATH},triton_max_batch_size:64,preprocessing_instance_count:1

python3 tools/fill_template.py -i all_models/inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_dir:${HF_MODEL_PATH},triton_max_batch_size:64,postprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:64
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False

pip install SentencePiece
python3 scripts/launch_triton_server.py --world_size 1 --model_repo=all_models/inflight_batcher_llm/

curl -X POST localhost:8000/v2/models/ensemble/generate -d "{"text_input": "Summarize the following news article: (CNN)Following last year's successful U.K. tour, Prince and 3rdEyeGirl are bringing the Hit & Run Tour to the U.S. for the first time. The first -- and so far only -- scheduled show will take place in Louisville, Kentucky, the hometown of 3rdEyeGirl drummer Hannah Welton. Slated for March 14, tickets will go on sale Monday, March 9 at 10 a.m. local time. Prince crowns dual rock charts . A venue has yet to be announced. When the Hit & Run worked its way through the U.K. in 2014, concert venues were revealed via Twitter prior to each show. Portions of the ticket sales will be donated to various Louisville charities. See the original story at Billboard.com. ©2015 Billboard. All Rights Reserved.", "max_tokens": 1024, "bad_words": "", "stop_words": ""}"

Expected behavior

The curl request should have generated some response

actual behavior

But its failing due to following error:
{"error":"in ensemble 'ensemble', Executor failed process requestId 4 due to the following error: Encountered an error when fetching new request: Prompt length (200) exceeds maximum input length (1). (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/include/tensorrt_llm/batch_manager/llmRequest.h:201)\n1 0x7f15ca33587f /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x6c387f) [0x7f15ca33587f]\n2 0x7f15cc2dcae2 tensorrt_llm::executor::Executor::Impl::executionLoop() + 722\n3 0x7f16bd1d8253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f16bd1d8253]\n4 0x7f16bcf67ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f16bcf67ac3]\n5 0x7f16bcff8a04 clone + 68"}

additional notes

Have been referring to this example: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/encoder_decoder.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions