Skip to content

With same engine, trtllm backend is 40x slower than TensorRT-LLM/examples/run.py #642

@ShuaiShao93

Description

@ShuaiShao93

System Info

I tried using the same LLAMA3.1 8b model and the same inputs on Triton with tesnorrtllm_backend and local TensorRT-LLM with TensorRT-LLM/examples/run.py. I set --max_output_len=1 and disabled kv cache for both. The Triton one needs 11s to complete, but TensorRT-LLM/examples/run.py only needs 0.27s.

What may be the reason?

Who can help?

@kaiyux

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_int4_awq --output_dir ./tmp/llama/8B/trt_engines/int4_awq/1-gpu  --gpt_attention_plugin auto  --gemm_plugin auto  --max_num_tokens 32768 --max_batch_size 8 --logits_dtype=float32 --gather_generation_logits --kv_cache_type=disabled
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:8,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:32768,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0
python3 TensorRT-LLM/examples/run.py --engine_dir=./tmp/llama/8B/trt_engines/int4_awq/1-gpu --max_output_len 1 --run_profiling --tokenizer_dir ./llama-3.1-8b-probability-finetuned --input_file long_input.txt --no_prompt_template --output_logits_npy=logits.npy --num_beams=1

Expected behavior

They should have similar latency

actual behavior

The Triton one needs 11s to complete, but TensorRT-LLM/examples/run.py only needs 0.27s.

additional notes

N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions