Skip to content

Got repeated answer while deploying LLaMA3-Instruct-8B model in triton server #487

@AndyZZt

Description

@AndyZZt

System Info

CPU architecture: x86_64
CPU/Host memory size: 32G
GPU properties: SM86
GPU name: NVIDIA A10
GPU memory size: 24G
Clock frequencies used: 1695MHz

Libraries

TensorRT-LLM: v0.9.0
TensorRT: 9.3.0.post12.dev1 (display 8.6.3 while input "dpkg -l | grep nvinfer" in cmd),
CUDA: 12.3,
Container used : 24.04-trtllm-python-py3
NVIDIA driver version: 535.161.08
OS : Ubuntu 22.04

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

docker exec -it trtllm1 /bin/bash
mamba deactivate
mamba deactivate

# git from correct branch
git clone -b v0.9.0 https://github.com/NVIDIA/TensorRT-LLM.git  
git clone -b v0.9.0 https://github.com/triton-inference-server/tensorrtllm_backend.git  

# build trt engines
cd TensorRT-LLM
trtllm-build --checkpoint_dir ../Work/TensorRT-LLM/examples/llama/tllm_checkpoint_1gpu_tp1 \
            --output_dir ./tmp/llama/8B/trt_engines/fp16/1-gpu/ \
            --remove_input_padding enable \
            --gpt_attention_plugin float16 --gemm_plugin float16 \
            --context_fmha enable --paged_kv_cache enable \
            --streamingllm enable \
            --use_paged_context_fmha enable --enable_chunked_context\
            --use_context_fmha_for_generation enable \
            --max_input_len 512 --max_output_len 512 \
            --max_batch_size 64

# copy rank0.engine & config.json
cd ../tensorrtllm_backend
cp ../TensorRT-LLM/tmp/llama/8B/trt_engines/fp16/1-gpu/* all_models/inflight_batcher_llm/tensorrt_llm/1/

# model configuration
export HF_LLAMA_MODEL=/path/to/llama3-8B-Instruct-hf
export ENGINE_PATH=/path/to/tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,preprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,postprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:True,repetition_penalty:0.9,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:64
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,decoding_mode:top_p,enable_chunked_context:True,batch_scheduler_policy:max_utilization,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.9,exclude_input_in_output:True,enable_kv_cache_reuse:True,batching_strategy:v1,enable_trt_overlap:True,max_queue_delay_microseconds:0

# launch triton-server
python3 scripts/launch_triton_server.py --model_repo=all_models/inflight_batcher_llm --world_size 1

# send request via curl
curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "what are flowers","max_tokens": 100,"bad_words":[""],"stop_words":["<|eot_id|>"]}'

Expected behavior

Answer the question normally.

actual behavior

Got a repeated answer which are usually less than threes words
image

additional notes

When I simply run TensorRT-LLM locally to infer as the example in TensorRT-LLM repository shows, which is :
python3 ../run.py --tokenizer_dir ./tmp/llama/8B/
--engine_dir=./tmp/llama/8B/trt_engines/fp16/1-gpu/
--input_text "How to build tensorrt engine?"
--max_output_len 100
The model can answer normally.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions