-
Notifications
You must be signed in to change notification settings - Fork 132
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
System Info
x86_64, Debian, GPU A100
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
- Compile LLAMA3.1 8B Instruct to trt llm
- Fill template with command
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0
- Start triton with all_models/inflight_batcher_llm/ensemble
- Use a multi-threaded client to send multiple requests in parallel
- Check the logs
Expected behavior
When Active Request Count is greater than 1, Scheduled Requests should also be greater than 1
actual behavior
See this in the log
I1002 23:45:42.282246 136 model_instance_state.cc:1115] "{\"Active Request Count\":8,\"Iteration Counter\":6189,\"Max Request Count\":8,\"Runtime CPU Memory Usage\":3060,\"Runtime GPU Memory Usage\":1427313739,\"Runtime Pinned Memory Usage\":637534388,\"Timestamp\":\"10-02-2024 23:45:42.275675\",\"Context Requests\":0,\"Generation Requests\":1,\"MicroBatch ID\":0,\"Paused Requests\":0,\"Scheduled Requests\":1,\"Total Context Tokens\":0,\"Free KV cache blocks\":25,\"Max KV cache blocks\":40,\"Tokens per KV cache block\":64,\"Used KV cache blocks\":15,\"Reused KV cache blocks\":0}"
additional notes
Why is Scheduled Requests always 1?
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working