Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: inter-token latency is lower than TPOT in serving benchmark result #6531

Open
Jeffwan opened this issue Jul 18, 2024 · 2 comments
Open
Labels
bug Something isn't working

Comments

@Jeffwan
Copy link
Contributor

Jeffwan commented Jul 18, 2024

Your current environment

v0.5.2. vLLM env is not an issue so I will just skip the collection process

🐛 Describe the bug

I am running benchmark tests and notice one potential problem.

Seems the inter-token latency is lower than TPOT. Basically, inter-token latency takes TTFT into the consideration and should be higher than TPOT. However the data shows different result. I have not looked at the code yet and I will try to figure this out

root@fb5250e2ae4c:/workspace# python3 vllm/benchmarks/benchmark_serving.py     --backend vllm     --dataset-name sharegpt     --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json     --model meta-llama/Llama-2-7b-chat-hf     --num-prompts 200     --endpoint /v1/completions     --tokenizer meta-llama/Llama-2-7b-chat-hf     --save-result     2>&1 | tee benchmark_serving.txt
Namespace(backend='vllm', base_url=None, host='localhost', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='./ShareGPT_V3_unfiltered_cleaned_split.json', model='meta-llama/Llama-2-7b-chat-hf', tokenizer='meta-llama/Llama-2-7b-chat-hf', best_of=1, use_beam_search=False, num_prompts=200, sharegpt_output_len=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, request_rate=inf, seed=0, trust_remote_code=False, disable_tqdm=False, save_result=True, metadata=None, result_dir=None, result_filename=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [01:12<00:00,  2.74it/s]s
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  72.96     
Total input tokens:                      49490     
Total generated tokens:                  41078     
Request throughput (req/s):              2.74      
Input token throughput (tok/s):          678.34    
Output token throughput (tok/s):         563.04    
---------------Time to First Token----------------
Mean TTFT (ms):                          3594.18   
Median TTFT (ms):                        3685.95   
P99 TTFT (ms):                           7361.98   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          186.90    
Median TPOT (ms):                        121.63    
P99 TPOT (ms):                           966.47    
---------------Inter-token Latency----------------
Mean ITL (ms):                           121.20    
Median ITL (ms):                         92.91     
P99 ITL (ms):                            310.89    
==================================================
@Jeffwan Jeffwan added the bug Something isn't working label Jul 18, 2024
@hyhuang00
Copy link

hyhuang00 commented Jul 18, 2024

Observed similar results on my experiments. It seems like TPOT is calculated with the final "[Done]" latency included, whereas ITL does not include the final latency, as shown here. Would like some more explanation on the difference between these metrics.

@yzlnew
Copy link

yzlnew commented Aug 1, 2024

Can confirm on my exps, especially for marlin24 model, ITL is much lower than TPOT, while TPOT for marlin24 is much higher than norm GPTQ model with marlin kernel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants