Whisper - Missing parameters for triton deployment using tensorrt_llm backend

### System Info

Hello,

I'm trying to deploy Whisper `large-v3` using Triton and tensorrtllm backend using this readme: https://github.com/triton-inference-server/tensorrtllm_backend/blob/v0.16.0/docs/whisper.md

## Context

- hardware: L40S
- version of tensorrtllm_backend: `v0.16.0`
- checkpoint conversion done (success)
- TensorRT-LLM engines building done (success)

## Issues

However, I have some issues when I'm trying to go to step 3 (Prepare Tritonserver configs) due to missing parameters to fill in config file using the following script: 

```
python3 tools/fill_template.py -i model_repo_whisper/tensorrt_llm/config.pbtxt triton_backend:${BACKEND},engine_dir:${DECODER_ENGINE_PATH},encoder_engine_dir:${ENCODER_ENGINE_PATH},decoupled_mode:${DECOUPLED_MODE},max_tokens_in_paged_kv_cache:${MAX_TOKENS_IN_KV_CACHE},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},batch_scheduler_policy:${BATCH_SCHEDULER_POLICY},batching_strategy:${BATCHING_STRATEGY},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:${EXCLUDE_INPUT_IN_OUTPUT},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS},max_beam_width:${MAX_BEAM_WIDTH},enable_kv_cache_reuse:${ENABLE_KV_CACHE_REUSE},normalize_log_probs:${NORMALIZE_LOG_PROBS},enable_chunked_context:${ENABLE_CHUNKED_CONTEXT},gpu_device_ids:${GPU_DEVICE_IDS},decoding_mode:${DECODING_MODE},max_queue_size:${MAX_QUEUE_SIZE},enable_context_fmha_fp32_acc:${ENABLE_CONTEXT_FMHA_FP32_ACC},cross_kv_cache_fraction:${CROSS_KV_CACHE_FRACTION},encoder_input_features_data_type:TYPE_FP16
```

My questions are:

- **Why do we need a `tensorrt_llm` "model" to run the triton server for `whisper_bls` ?**
- **If it's required, how to set up these parameters for a Whisper model?**

Thank you 🙂

### Who can help?

@juney-nvidia 

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

In the https://github.com/triton-inference-server/tensorrtllm_backend/blob/v0.16.0/docs/whisper.md, at step 3:

```
BACKEND=tensorrtllm
DECOUPLED_MODE=false
DECODER_ENGINE_PATH=${output_dir}/decoder
ENCODER_ENGINE_PATH=${output_dir}/encoder
MAX_TOKENS_IN_KV_CACHE=24000
BATCHING_STRATEGY=inflight_fused_batching
KV_CACHE_FREE_GPU_MEM_FRACTION=0.5
EXCLUDE_INPUT_IN_OUTPUT=True
TRITON_MAX_BATCH_SIZE=8
MAX_QUEUE_DELAY_MICROSECONDS=0
MAX_BEAM_WIDTH=1
MAX_QUEUE_SIZE="0"
ENABLE_KV_CACHE_REUSE=false
ENABLE_CHUNKED_CONTEXT=false
CROSS_KV_CACHE_FRACTION="0.5"
n_mels=128
zero_pad=false

python3 tools/fill_template.py -i model_repo_whisper/tensorrt_llm/config.pbtxt triton_backend:${BACKEND},engine_dir:${DECODER_ENGINE_PATH},encoder_engine_dir:${ENCODER_ENGINE_PATH},decoupled_mode:${DECOUPLED_MODE},max_tokens_in_paged_kv_cache:${MAX_TOKENS_IN_KV_CACHE},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},batch_scheduler_policy:${BATCH_SCHEDULER_POLICY},batching_strategy:${BATCHING_STRATEGY},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:${EXCLUDE_INPUT_IN_OUTPUT},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS},max_beam_width:${MAX_BEAM_WIDTH},enable_kv_cache_reuse:${ENABLE_KV_CACHE_REUSE},normalize_log_probs:${NORMALIZE_LOG_PROBS},enable_chunked_context:${ENABLE_CHUNKED_CONTEXT},gpu_device_ids:${GPU_DEVICE_IDS},decoding_mode:${DECODING_MODE},max_queue_size:${MAX_QUEUE_SIZE},enable_context_fmha_fp32_acc:${ENABLE_CONTEXT_FMHA_FP32_ACC},cross_kv_cache_fraction:${CROSS_KV_CACHE_FRACTION},encoder_input_features_data_type:TYPE_FP16
```

### Expected behavior

Variable not found when running the script:
```
python3 tools/fill_template.py -i model_repo_whisper/tensorrt_llm/config.pbtxt ...
```

Or in tritonserver logs:
```
[libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:337] Error parsing text-format inference.ModelConfig: 105:16: Expected integer or identifier, got: $
E0102 18:16:16.688605 46342 model_repository_manager.cc:1460] "Poll failed for model directory 'tensorrt_llm': failed to read text proto from /workspace/model_repo/l40s/openai_whisper-large-v3_int8/tensorrt_llm/config.pbtxt"
```

### actual behavior

Missing parameters to fill the config.pbtxtx

### additional notes

Tried to add the parameters as follow but still missing other parameters

```
MAX_ATTENTION_WINDOW_SIZE=448
BATCH_SCHEDULER_POLICY=max_utilization
NORMALIZE_LOG_PROBS=false
GPU_DEVICE_IDS=""
DECODING_MODE=""
ENABLE_CONTEXT_FMHA_FP32_ACC=true
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Whisper - Missing parameters for triton deployment using tensorrt_llm backend #672

System Info

Context

Issues

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Whisper - Missing parameters for triton deployment using tensorrt_llm backend #672

Description

System Info

Context

Issues

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions