[Bug]: inter-token latency is lower than TPOT in serving benchmark result #6531

Jeffwan · 2024-07-18T04:55:54Z

Your current environment

v0.5.2. vLLM env is not an issue so I will just skip the collection process

🐛 Describe the bug

I am running benchmark tests and notice one potential problem.

Seems the inter-token latency is lower than TPOT. Basically, inter-token latency takes TTFT into the consideration and should be higher than TPOT. However the data shows different result. I have not looked at the code yet and I will try to figure this out

root@fb5250e2ae4c:/workspace# python3 vllm/benchmarks/benchmark_serving.py     --backend vllm     --dataset-name sharegpt     --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json     --model meta-llama/Llama-2-7b-chat-hf     --num-prompts 200     --endpoint /v1/completions     --tokenizer meta-llama/Llama-2-7b-chat-hf     --save-result     2>&1 | tee benchmark_serving.txt
Namespace(backend='vllm', base_url=None, host='localhost', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='./ShareGPT_V3_unfiltered_cleaned_split.json', model='meta-llama/Llama-2-7b-chat-hf', tokenizer='meta-llama/Llama-2-7b-chat-hf', best_of=1, use_beam_search=False, num_prompts=200, sharegpt_output_len=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, request_rate=inf, seed=0, trust_remote_code=False, disable_tqdm=False, save_result=True, metadata=None, result_dir=None, result_filename=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [01:12<00:00,  2.74it/s]s
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  72.96     
Total input tokens:                      49490     
Total generated tokens:                  41078     
Request throughput (req/s):              2.74      
Input token throughput (tok/s):          678.34    
Output token throughput (tok/s):         563.04    
---------------Time to First Token----------------
Mean TTFT (ms):                          3594.18   
Median TTFT (ms):                        3685.95   
P99 TTFT (ms):                           7361.98   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          186.90    
Median TPOT (ms):                        121.63    
P99 TPOT (ms):                           966.47    
---------------Inter-token Latency----------------
Mean ITL (ms):                           121.20    
Median ITL (ms):                         92.91     
P99 ITL (ms):                            310.89    
==================================================

The text was updated successfully, but these errors were encountered:

hyhuang00 · 2024-07-18T14:59:14Z

Observed similar results on my experiments. It seems like TPOT is calculated with the final "[Done]" latency included, whereas ITL does not include the final latency, as shown here. Would like some more explanation on the difference between these metrics.

yzlnew · 2024-08-01T06:01:26Z

Can confirm on my exps, especially for marlin24 model, ITL is much lower than TPOT, while TPOT for marlin24 is much higher than norm GPTQ model with marlin kernel.

Jeffwan added the bug Something isn't working label Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: inter-token latency is lower than TPOT in serving benchmark result #6531

[Bug]: inter-token latency is lower than TPOT in serving benchmark result #6531

Jeffwan commented Jul 18, 2024 •

edited

Loading

hyhuang00 commented Jul 18, 2024 •

edited

Loading

yzlnew commented Aug 1, 2024

[Bug]: inter-token latency is lower than TPOT in serving benchmark result #6531

[Bug]: inter-token latency is lower than TPOT in serving benchmark result #6531

Comments

Jeffwan commented Jul 18, 2024 • edited Loading

Your current environment

🐛 Describe the bug

hyhuang00 commented Jul 18, 2024 • edited Loading

yzlnew commented Aug 1, 2024

Jeffwan commented Jul 18, 2024 •

edited

Loading

hyhuang00 commented Jul 18, 2024 •

edited

Loading