-
Notifications
You must be signed in to change notification settings - Fork 132
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
System Info
I tried using the same LLAMA3.1 8b model and the same inputs on Triton with tesnorrtllm_backend and local TensorRT-LLM with TensorRT-LLM/examples/run.py. I set --max_output_len=1 and disabled kv cache for both. The Triton one needs 11s to complete, but TensorRT-LLM/examples/run.py only needs 0.27s.
What may be the reason?
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_int4_awq --output_dir ./tmp/llama/8B/trt_engines/int4_awq/1-gpu --gpt_attention_plugin auto --gemm_plugin auto --max_num_tokens 32768 --max_batch_size 8 --logits_dtype=float32 --gather_generation_logits --kv_cache_type=disabled
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:8,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:32768,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0
python3 TensorRT-LLM/examples/run.py --engine_dir=./tmp/llama/8B/trt_engines/int4_awq/1-gpu --max_output_len 1 --run_profiling --tokenizer_dir ./llama-3.1-8b-probability-finetuned --input_file long_input.txt --no_prompt_template --output_logits_npy=logits.npy --num_beams=1
Expected behavior
They should have similar latency
actual behavior
The Triton one needs 11s to complete, but TensorRT-LLM/examples/run.py only needs 0.27s.
additional notes
N/A
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working