With same engine, trtllm backend is 40x slower than TensorRT-LLM/examples/run.py

### System Info

 I tried using the same LLAMA3.1 8b model and the same inputs on Triton with tesnorrtllm_backend and local TensorRT-LLM with `TensorRT-LLM/examples/run.py`. I set `--max_output_len=1` and disabled kv cache for both. The Triton one needs 11s to complete, but `TensorRT-LLM/examples/run.py` only needs 0.27s.

What may be the reason?

### Who can help?

@kaiyux

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

```
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_int4_awq --output_dir ./tmp/llama/8B/trt_engines/int4_awq/1-gpu  --gpt_attention_plugin auto  --gemm_plugin auto  --max_num_tokens 32768 --max_batch_size 8 --logits_dtype=float32 --gather_generation_logits --kv_cache_type=disabled
```

```
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:8,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:32768,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0
```

```
python3 TensorRT-LLM/examples/run.py --engine_dir=./tmp/llama/8B/trt_engines/int4_awq/1-gpu --max_output_len 1 --run_profiling --tokenizer_dir ./llama-3.1-8b-probability-finetuned --input_file long_input.txt --no_prompt_template --output_logits_npy=logits.npy --num_beams=1
```


### Expected behavior

They should have similar latency

### actual behavior

The Triton one needs 11s to complete, but TensorRT-LLM/examples/run.py only needs 0.27s.

### additional notes

N/A

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

With same engine, trtllm backend is 40x slower than TensorRT-LLM/examples/run.py #642

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

With same engine, trtllm backend is 40x slower than TensorRT-LLM/examples/run.py #642

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions