Inflight Batching not working with OpenAI-Compatible Frontend

### System Info

NVIDIA H100 80GB HBM3
| server_version                   | 2.52.0                                                                                                                                                                                                          |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |

### Who can help?

@kaiyu

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

I have followed the official example from:
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/tensorrtllm_backend/README.html#prepare-the-model-repository


I have setup the inflight_batcher_llm model repository from here:
https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/all_models/inflight_batcher_llm


and I am using the openai frontend infront of it:
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/client_guide/openai_readme.html




1. Download the model from HG

https://huggingface.co/nvidia/Llama-3.1-8B-Instruct-FP8

2. Running the checkpoint python script
`python3 convert_checkpoint.py --model_dir ${HF_LLAMA_MODEL} \
                             --output_dir ${UNIFIED_CKPT_PATH} \
                             --dtype float16`


3. Building the tensorrt-llm engine
`trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH} \
             --remove_input_padding enable \
             --gpt_attention_plugin float16 \
             --context_fmha enable \
             --gemm_plugin float16 \
             --output_dir ${ENGINE_PATH} \
             --kv_cache_type paged \
             --max_batch_size 1024`


4.  creating the model repository for  inflight_batcher_llm model repository from here:
https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/all_models/inflight_batcher_llm

5. Starting triton with the openai API frontend
`python3 /root/.cache/deps/triton_repo/server/python/openai/openai_frontend/main.py --model-repository $MODEL_REPO --tokenizer $TOKENIZER_PATH --tritonserver-log-verbose-level=1
`

6. using the genai perf script to send concurrent requests to the ensemble model
`   genai-perf  profile  ...
`





The config.pbtxt file for tensorrt_llm

tensorrt_llm
name: "tensorrt_llm"
backend: "tensorrtllm"
max_batch_size: 1024

model_transaction_policy {
  decoupled: true
}

dynamic_batching {
    preferred_batch_size: [ 1024 ]
    max_queue_delay_microseconds: 1000000
}

input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    allow_ragged_batch: true
    optional: true
  },
  {
    name: "encoder_input_features"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    allow_ragged_batch: true
    optional: true
  },
  {
    name: "encoder_output_lengths"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "input_lengths"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "request_output_len"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "num_return_sequences"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "draft_input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "decoder_input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "decoder_input_lengths"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
    reshape: { shape: [ ] }
  },
  {
    name: "draft_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "draft_acceptance_threshold"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "end_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "pad_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "stop_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "bad_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "embedding_bias"
    data_type: TYPE_FP32
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "beam_width"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_k"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p_min"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p_decay"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p_reset_ids"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "len_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "early_stopping"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "min_length"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_search_diversity_rate"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "presence_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "frequency_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_context_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_generation_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_kv_cache_reuse_stats"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "exclude_input_in_output"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "stop"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "streaming"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "prompt_embedding_table"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "prompt_table_extra_ids"
    data_type: TYPE_UINT64
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "prompt_vocab_size"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "cross_attention_mask"
    data_type: TYPE_BOOL
    dims: [ -1, -1 ]
    optional: true
    allow_ragged_batch: true
  },

  {
    name: "lora_task_id"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },

  {
    name: "lora_weights"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    optional: true
    allow_ragged_batch: true
  },

  {
    name: "lora_config"
    data_type: TYPE_INT32
    dims: [ -1, 3 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "context_phase_params"
    data_type: TYPE_UINT8
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },

  {
    name: "skip_cross_attn_blocks"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
    allow_ragged_batch: true
  }
]
output [
  {
    name: "output_ids"
    data_type: TYPE_INT32
    dims: [ -1, -1 ]
  },
  {
    name: "sequence_length"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "context_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "generation_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1 ]
  },
  {
    name: "batch_index"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "sequence_index"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "context_phase_params"
    data_type: TYPE_UINT8
    dims: [ -1 ]
  },
  {
    name: "kv_cache_alloc_new_blocks"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "kv_cache_reused_blocks"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "kv_cache_alloc_total_blocks"
    data_type: TYPE_INT32
    dims: [ 1 ]
  }
]
instance_group [
  {
    count: 1
    kind : KIND_CPU
  }
]
parameters: {
  key: "max_beam_width"
  value: {
    string_value: "${max_beam_width}"
  }
}
parameters: {
  key: "FORCE_CPU_ONLY_INPUT_TENSORS"
  value: {
    string_value: "no"
  }
}
parameters: {
  key: "gpt_model_type"
  value: {
    string_value: "inflight_fused_batching"
  }
}
parameters: {
  key: "gpt_model_path"
  value: {
    string_value: "/root/.cache/deps/transformation/engines/llama_new1/1b/"
  }
}
parameters: {
  key: "encoder_model_path"
  value: {
    string_value: "${encoder_engine_dir}"
  }
}
parameters: {
  key: "max_tokens_in_paged_kv_cache"
  value: {
    string_value: "${max_tokens_in_paged_kv_cache}"
  }
}
parameters: {
  key: "max_attention_window_size"
  value: {
    string_value: "${max_attention_window_size}"
  }
}
parameters: {
  key: "sink_token_length"
  value: {
    string_value: "${sink_token_length}"
  }
}
parameters: {
  key: "batch_scheduler_policy"
  value: {
    string_value: "${batch_scheduler_policy}"
  }
}
parameters: {
  key: "kv_cache_free_gpu_mem_fraction"
  value: {
    string_value: "${kv_cache_free_gpu_mem_fraction}"
  }
}
parameters: {
  key: "cross_kv_cache_fraction"
  value: {
    string_value: "${cross_kv_cache_fraction}"
  }
}
parameters: {
  key: "kv_cache_host_memory_bytes"
  value: {
    string_value: "${kv_cache_host_memory_bytes}"
  }
}

parameters: {
  key: "kv_cache_onboard_blocks"
  value: {
    string_value: "${kv_cache_onboard_blocks}"
  }
}

parameters: {
  key: "exclude_input_in_output"
  value: {
    string_value: "${exclude_input_in_output}"
  }
}
parameters: {
  key: "cancellation_check_period_ms"
  value: {
    string_value: "${cancellation_check_period_ms}"
  }
}
parameters: {
  key: "stats_check_period_ms"
  value: {
    string_value: "${stats_check_period_ms}"
  }
}
parameters: {
  key: "iter_stats_max_iterations"
  value: {
    string_value: "${iter_stats_max_iterations}"
  }
}
parameters: {
  key: "request_stats_max_iterations"
  value: {
    string_value: "${request_stats_max_iterations}"
  }
}
parameters: {
  key: "enable_kv_cache_reuse"
  value: {
    string_value: "${enable_kv_cache_reuse}"
  }
}
parameters: {
  key: "normalize_log_probs"
  value: {
    string_value: "${normalize_log_probs}"
  }
}
parameters: {
  key: "enable_chunked_context"
  value: {
    string_value: "${enable_chunked_context}"
  }
}
parameters: {
  key: "gpu_device_ids"
  value: {
    string_value: "${gpu_device_ids}"
  }
}
parameters: {
  key: "participant_ids"
  value: {
    string_value: "${participant_ids}"
  }
}
parameters: {
  key: "lora_cache_optimal_adapter_size"
  value: {
    string_value: "${lora_cache_optimal_adapter_size}"
  }
}
parameters: {
  key: "lora_cache_max_adapter_size"
  value: {
    string_value: "${lora_cache_max_adapter_size}"
  }
}
parameters: {
  key: "lora_cache_gpu_memory_fraction"
  value: {
    string_value: "${lora_cache_gpu_memory_fraction}"
  }
}
parameters: {
  key: "lora_cache_host_memory_bytes"
  value: {
    string_value: "${lora_cache_host_memory_bytes}"
  }
}
parameters: {
  key: "decoding_mode"
  value: {
    string_value: "${decoding_mode}"
  }
}
parameters: {
  key: "executor_worker_path"
  value: {
    string_value: "/opt/tritonserver/backends/tensorrtllm/trtllmExecutorWorker"
  }
}
parameters: {
  key: "medusa_choices"
    value: {
      string_value: "${medusa_choices}"
  }
}
parameters: {
  key: "eagle_choices"
    value: {
      string_value: "${eagle_choices}"
  }
}
parameters: {
  key: "gpu_weights_percent"
    value: {
      string_value: "${gpu_weights_percent}"
  }
}
parameters: {
  key: "enable_context_fmha_fp32_acc"
  value: {
    string_value: "${enable_context_fmha_fp32_acc}"
  }
}
parameters: {
  key: "multi_block_mode"
  value: {
    string_value: "${multi_block_mode}"
  }
}
parameters: {
  key: "cuda_graph_mode"
  value: {
    string_value: "${cuda_graph_mode}"
  }
}
parameters: {
  key: "cuda_graph_cache_size"
  value: {
    string_value: "${cuda_graph_cache_size}"
  }
}
parameters: {
  key: "speculative_decoding_fast_logits"
  value: {
    string_value: "${speculative_decoding_fast_logits}"
  }
}


if you need the other configs from ensemble, preprocessing etc. let me know



### Expected behavior

I should see some batching happening, but after sending concurrent requests the metrics endpoint shows me


nv_inference_request_success{model="ensemble",version="1"} 232
nv_inference_request_success{model="postprocessing",version="1"} 25181
nv_inference_request_success{model="preprocessing",version="1"} 232
nv_inference_request_success{model="tensorrt_llm_bls",version="1"} 0
nv_inference_request_success{model="tensorrt_llm",version="1"} 232

nv_inference_request_failure{model="ensemble",reason="OTHER",version="1"} 0
nv_inference_request_failure{model="ensemble",reason="BACKEND",version="1"} 0
nv_inference_request_failure{model="ensemble",reason="CANCELED",version="1"} 0
nv_inference_request_failure{model="ensemble",reason="REJECTED",version="1"} 0
nv_inference_request_failure{model="preprocessing",reason="OTHER",version="1"} 0
nv_inference_request_failure{model="preprocessing",reason="BACKEND",version="1"} 0
nv_inference_request_failure{model="preprocessing",reason="CANCELED",version="1"} 0
nv_inference_request_failure{model="tensorrt_llm",reason="REJECTED",version="1"} 0
nv_inference_request_failure{model="tensorrt_llm",reason="BACKEND",version="1"} 0
nv_inference_request_failure{model="tensorrt_llm",reason="CANCELED",version="1"} 0
nv_inference_request_failure{model="tensorrt_llm",reason="OTHER",version="1"} 0
nv_inference_request_failure{model="tensorrt_llm_bls",reason="REJECTED",version="1"} 0
nv_inference_request_failure{model="tensorrt_llm_bls",reason="BACKEND",version="1"} 0
nv_inference_request_failure{model="tensorrt_llm_bls",reason="OTHER",version="1"} 0
nv_inference_request_failure{model="postprocessing",reason="REJECTED",version="1"} 0
nv_inference_request_failure{model="tensorrt_llm_bls",reason="CANCELED",version="1"} 0
nv_inference_request_failure{model="postprocessing",reason="CANCELED",version="1"} 0
nv_inference_request_failure{model="preprocessing",reason="REJECTED",version="1"} 0
nv_inference_request_failure{model="postprocessing",reason="BACKEND",version="1"} 0
nv_inference_request_failure{model="postprocessing",reason="OTHER",version="1"} 0

nv_inference_count{model="ensemble",version="1"} 232
nv_inference_count{model="postprocessing",version="1"} 25181
nv_inference_count{model="preprocessing",version="1"} 232
nv_inference_count{model="tensorrt_llm_bls",version="1"} 0
nv_inference_count{model="tensorrt_llm",version="1"} 232

nv_inference_exec_count{model="ensemble",version="1"} 232
nv_inference_exec_count{model="postprocessing",version="1"} 11041
nv_inference_exec_count{model="preprocessing",version="1"} 232
nv_inference_exec_count{model="tensorrt_llm_bls",version="1"} 0
nv_inference_exec_count{model="tensorrt_llm",version="1"} 232

nv_inference_request_duration_us{model="ensemble",version="1"} 369358274
nv_inference_request_duration_us{model="postprocessing",version="1"} 63622688
nv_inference_request_duration_us{model="preprocessing",version="1"} 254386
nv_inference_request_duration_us{model="tensorrt_llm_bls",version="1"} 0
nv_inference_request_duration_us{model="tensorrt_llm",version="1"} 368632046

nv_inference_queue_duration_us{model="ensemble",version="1"} 1293
nv_inference_queue_duration_us{model="postprocessing",version="1"} 20113756
nv_inference_queue_duration_us{model="preprocessing",version="1"} 25294
nv_inference_queue_duration_us{model="tensorrt_llm_bls",version="1"} 0
nv_inference_queue_duration_us{model="tensorrt_llm",version="1"} 173117398

nv_inference_compute_input_duration_us{model="ensemble",version="1"} 3036172
nv_inference_compute_input_duration_us{model="postprocessing",version="1"} 999827
nv_inference_compute_input_duration_us{model="preprocessing",version="1"} 10749
nv_inference_compute_input_duration_us{model="tensorrt_llm_bls",version="1"} 0
nv_inference_compute_input_duration_us{model="tensorrt_llm",version="1"} 2012983

nv_inference_compute_infer_duration_us{model="ensemble",version="1"} 197515178
nv_inference_compute_infer_duration_us{model="postprocessing",version="1"} 3992872
nv_inference_compute_infer_duration_us{model="preprocessing",version="1"} 121548
nv_inference_compute_infer_duration_us{model="tensorrt_llm_bls",version="1"} 0
nv_inference_compute_infer_duration_us{model="tensorrt_llm",version="1"} 193387997


and this 


nv_cpu_memory_used_bytes 180916031488

nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="context",version="1"} 0
nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="scheduled",version="1"} 0
nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="max",version="1"} 640
nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="active",version="1"} 0

nv_trt_llm_runtime_memory_metrics{memory_type="pinned",model="tensorrt_llm",version="1"} 3556783876
nv_trt_llm_runtime_memory_metrics{memory_type="gpu",model="tensorrt_llm",version="1"} 61362490955
nv_trt_llm_runtime_memory_metrics{memory_type="cpu",model="tensorrt_llm",version="1"} 21532

nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="reused",model="tensorrt_llm",version="1"} 0
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="tokens_per",model="tensorrt_llm",version="1"} 64
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="used",model="tensorrt_llm",version="1"} 0
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="free",model="tensorrt_llm",version="1"} 6605
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="max",model="tensorrt_llm",version="1"} 6605

nv_trt_llm_inflight_batcher_metrics{inflight_batcher_specific_metric="paused_requests",model="tensorrt_llm",version="1"} 0
nv_trt_llm_inflight_batcher_metrics{inflight_batcher_specific_metric="micro_batch_id",model="tensorrt_llm",version="1"} 0
nv_trt_llm_inflight_batcher_metrics{inflight_batcher_specific_metric="generation_requests",model="tensorrt_llm",version="1"} 0
nv_trt_llm_inflight_batcher_metrics{inflight_batcher_specific_metric="total_context_tokens",model="tensorrt_llm",version="1"} 0

nv_trt_llm_general_metrics{general_type="iteration_counter",model="tensorrt_llm",version="1"} 4785
nv_trt_llm_general_metrics{general_type="timestamp",model="tensorrt_llm",version="1"} 1734880624970454


### actual behavior

I should see 

nv_inference_exec_count{model="ensemble",version="1"} 232
is smaller than
nv_inference_request_success{model="ensemble",version="1"} 232



### additional notes


some triton logs after server start:

[TensorRT-LLM][INFO] [MemUsageChange] Allocated 864.01 MiB for execution context memory.                                                                                                                                                                                                  
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 15380 (MiB)                                                                                                                                              
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.18 GB GPU memory for runtime buffers.                                                                                                                                                                                                   
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.51 GB GPU memory for decoder.                                                                                                                                                                                                           
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.11 GiB, available: 57.33 GiB                                                                                                                                                                   
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 6605                                                                                                                                                                                                                      
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true                                                                                                                                                                  
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 2048                                                                                                                                                                                                                                
[TensorRT-LLM][INFO] Number of tokens per block: 64.                                                                                                                                                                                                                                      
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 51.60 GiB for max tokens in paged KV cache (422720).                                                                                                                                                                                      
[TensorRT-LLM][INFO] Enable MPI KV cache transport.    


some triton logs after server stop:



Shutting down Triton OpenAI-Compatible Frontend...                                                                                           
Shutting down Triton Inference Server...                                                                                                     
I1222 18:24:43.350231 2239 server.cc:305] "Waiting for in-flight requests to complete."        
I1222 18:24:43.350291 2239 server.cc:321] "Timeout 30: Found 0 model versions that have in-flight inferences"
I1222 18:24:43.351900 2239 dynamic_batch_scheduler.cc:445] "Stopping dynamic-batcher thread for postprocessing..."
I1222 18:24:43.351925 2239 server.cc:336] "All models are stopped, unloading models"           
I1222 18:24:43.351948 2239 server.cc:345] "Timeout 30: Found 5 live models and 0 in-flight non-inference requests"
I1222 18:24:43.351952 2239 server.cc:351] "ensemble v1: UNLOADING"                                                                                                                                                                                                                        
I1222 18:24:43.351959 2239 server.cc:351] "postprocessing v1: UNLOADING"                                                                                                                                                                                                                  
I1222 18:24:43.351961 2239 server.cc:351] "preprocessing v1: UNLOADING"
I1222 18:24:43.351964 2239 server.cc:351] "tensorrt_llm v1: UNLOADING" 
I1222 18:24:43.351967 2239 server.cc:351] "tensorrt_llm_bls v1: UNLOADING"          
I1222 18:24:43.352050 2239 dynamic_batch_scheduler.cc:445] "Stopping dynamic-batcher thread for tensorrt_llm..."
I1222 18:24:43.352085 2239 backend_model_instance.cc:807] "Stopping backend thread for tensorrt_llm_bls_0_0..."
I1222 18:24:43.352109 2239 backend_model_instance.cc:807] "Stopping backend thread for postprocessing_0_1..."
I1222 18:24:43.352157 2239 model_lifecycle.cc:636] "successfully unloaded 'ensemble' version 1"                                                                                                                                                                                           
I1222 18:24:43.352168 2239 backend_model_instance.cc:807] "Stopping backend thread for preprocessing_0_2..."
I1222 18:24:43.352270 2239 python_be.cc:2387] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
I1222 18:24:43.352367 2239 python_be.cc:2387] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
I1222 18:24:43.352394 2239 python_be.cc:2387] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
I1222 18:24:43.352402 2239 backend_model_instance.cc:807] "Stopping backend thread for tensorrt_llm_0_0..."
[TensorRT-LLM][INFO] Orchestrator sendReq thread exiting
[TensorRT-LLM][INFO] Orchestrator recv thread exiting
[TensorRT-LLM][INFO] Leader recvReq thread exiting
[TensorRT-LLM][INFO] Leader sendThread exiting
I1222 18:24:43.524265 2239 model_lifecycle.cc:636] "successfully unloaded 'tensorrt_llm' version 1"
[TensorRT-LLM][INFO] Refreshed the MPI local session







Let me know if you need any more information. I would like to understand why as default the 
dynamic_batching {
    preferred_batch_size: [ 1024 ]
    max_queue_delay_microseconds: 1000000
}
is not set by the scripts in the example and do I need to set them for inflight batching?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inflight Batching not working with OpenAI-Compatible Frontend #667

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inflight Batching not working with OpenAI-Compatible Frontend #667

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions