Got repeated answer while deploying LLaMA3-Instruct-8B model in triton server

### System Info

CPU architecture:  x86_64
CPU/Host memory size: 32G
GPU properties: SM86
GPU name: NVIDIA A10
GPU memory size: 24G
Clock frequencies used: 1695MHz

### Libraries
TensorRT-LLM: v0.9.0
TensorRT: 9.3.0.post12.dev1 (display 8.6.3 while input "dpkg -l | grep nvinfer" in cmd),  
CUDA: 12.3, 
Container used : 24.04-trtllm-python-py3
NVIDIA driver version: 535.161.08
OS : Ubuntu 22.04

### Who can help?

_No response_

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

```
docker exec -it trtllm1 /bin/bash
mamba deactivate
mamba deactivate

# git from correct branch
git clone -b v0.9.0 https://github.com/NVIDIA/TensorRT-LLM.git  
git clone -b v0.9.0 https://github.com/triton-inference-server/tensorrtllm_backend.git  

# build trt engines
cd TensorRT-LLM
trtllm-build --checkpoint_dir ../Work/TensorRT-LLM/examples/llama/tllm_checkpoint_1gpu_tp1 \
            --output_dir ./tmp/llama/8B/trt_engines/fp16/1-gpu/ \
            --remove_input_padding enable \
            --gpt_attention_plugin float16 --gemm_plugin float16 \
            --context_fmha enable --paged_kv_cache enable \
            --streamingllm enable \
            --use_paged_context_fmha enable --enable_chunked_context\
            --use_context_fmha_for_generation enable \
            --max_input_len 512 --max_output_len 512 \
            --max_batch_size 64

# copy rank0.engine & config.json
cd ../tensorrtllm_backend
cp ../TensorRT-LLM/tmp/llama/8B/trt_engines/fp16/1-gpu/* all_models/inflight_batcher_llm/tensorrt_llm/1/

# model configuration
export HF_LLAMA_MODEL=/path/to/llama3-8B-Instruct-hf
export ENGINE_PATH=/path/to/tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,preprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,postprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:True,repetition_penalty:0.9,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:64
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,decoding_mode:top_p,enable_chunked_context:True,batch_scheduler_policy:max_utilization,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.9,exclude_input_in_output:True,enable_kv_cache_reuse:True,batching_strategy:v1,enable_trt_overlap:True,max_queue_delay_microseconds:0

# launch triton-server
python3 scripts/launch_triton_server.py --model_repo=all_models/inflight_batcher_llm --world_size 1

# send request via curl
curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "what are flowers","max_tokens": 100,"bad_words":[""],"stop_words":["<|eot_id|>"]}'
```

### Expected behavior

Answer the question normally.

### actual behavior

Got a repeated answer which are usually less than threes words
![image](https://github.com/triton-inference-server/tensorrtllm_backend/assets/29043558/021465cb-b728-4c93-93fc-0e852f010c42)


### additional notes

When I simply run TensorRT-LLM locally to infer as the example in TensorRT-LLM repository shows, which is :
python3 ../run.py --tokenizer_dir ./tmp/llama/8B/ \
                  --engine_dir=./tmp/llama/8B/trt_engines/fp16/1-gpu/ \
                  --input_text "How to build tensorrt engine?" \
                  --max_output_len 100
The model can answer normally.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Got repeated answer while deploying LLaMA3-Instruct-8B model in triton server #487

System Info

Libraries

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Got repeated answer while deploying LLaMA3-Instruct-8B model in triton server #487

Description

System Info

Libraries

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions