-
Notifications
You must be signed in to change notification settings - Fork 132
Description
System Info
CPU architecture: x86_64
CPU/Host memory size: 32G
GPU properties: SM86
GPU name: NVIDIA A10
GPU memory size: 24G
Clock frequencies used: 1695MHz
Libraries
TensorRT-LLM: v0.9.0
TensorRT: 9.3.0.post12.dev1 (display 8.6.3 while input "dpkg -l | grep nvinfer" in cmd),
CUDA: 12.3,
Container used : 24.04-trtllm-python-py3
NVIDIA driver version: 535.161.08
OS : Ubuntu 22.04
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
docker exec -it trtllm1 /bin/bash
mamba deactivate
mamba deactivate
# git from correct branch
git clone -b v0.9.0 https://github.com/NVIDIA/TensorRT-LLM.git
git clone -b v0.9.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
# build trt engines
cd TensorRT-LLM
trtllm-build --checkpoint_dir ../Work/TensorRT-LLM/examples/llama/tllm_checkpoint_1gpu_tp1 \
--output_dir ./tmp/llama/8B/trt_engines/fp16/1-gpu/ \
--remove_input_padding enable \
--gpt_attention_plugin float16 --gemm_plugin float16 \
--context_fmha enable --paged_kv_cache enable \
--streamingllm enable \
--use_paged_context_fmha enable --enable_chunked_context\
--use_context_fmha_for_generation enable \
--max_input_len 512 --max_output_len 512 \
--max_batch_size 64
# copy rank0.engine & config.json
cd ../tensorrtllm_backend
cp ../TensorRT-LLM/tmp/llama/8B/trt_engines/fp16/1-gpu/* all_models/inflight_batcher_llm/tensorrt_llm/1/
# model configuration
export HF_LLAMA_MODEL=/path/to/llama3-8B-Instruct-hf
export ENGINE_PATH=/path/to/tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,preprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,postprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:True,repetition_penalty:0.9,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:64
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,decoding_mode:top_p,enable_chunked_context:True,batch_scheduler_policy:max_utilization,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.9,exclude_input_in_output:True,enable_kv_cache_reuse:True,batching_strategy:v1,enable_trt_overlap:True,max_queue_delay_microseconds:0
# launch triton-server
python3 scripts/launch_triton_server.py --model_repo=all_models/inflight_batcher_llm --world_size 1
# send request via curl
curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "what are flowers","max_tokens": 100,"bad_words":[""],"stop_words":["<|eot_id|>"]}'
Expected behavior
Answer the question normally.
actual behavior
Got a repeated answer which are usually less than threes words

additional notes
When I simply run TensorRT-LLM locally to infer as the example in TensorRT-LLM repository shows, which is :
python3 ../run.py --tokenizer_dir ./tmp/llama/8B/
--engine_dir=./tmp/llama/8B/trt_engines/fp16/1-gpu/
--input_text "How to build tensorrt engine?"
--max_output_len 100
The model can answer normally.