bump version to v0.4.2 #4600

simon-mo · 2024-05-04T21:33:24Z

TLDR on perf:

No perf regression
Chunked prefill helps lower both TTFT and ITL (by batching more) in high load scenario.

Benchmark

Setup: compare v0.4.1 and this commit, on llama3 8B and 70B. H100

Scripts

python benchmark_throughput.py --input-len 1000 --output-len 100 --model meta-llama/Meta-Llama-3-8B-Instruct
python benchmark_latency.py --model meta-llama/Meta-Llama-3-8B-Instruct
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B-Instruct --disable-log-requests
python benchmark_serving.py --model meta-llama/Meta-Llama-3-8B-Instruct --backend openai --host 0.0.0.0 --port 8000 --endpoint /v1/completions --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name sharegpt --request-rate 16 --num-prompts 200

python benchmark_throughput.py --input-len 1000 --output-len 100 --model /mnt/localdisk/simon-llama3-70b -tp 8
python benchmark_latency.py --model /mnt/localdisk/simon-llama3-70b -tp 8
python -m vllm.entrypoints.openai.api_server --model /mnt/localdisk/simon-llama3-70b --disable-log-requests -tp 8
python benchmark_serving.py --model /mnt/localdisk/simon-llama3-70b --backend openai --host 0.0.0.0 --port 8000 --endpoint /v1/completions --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name sharegpt --request-rate 16 --num-prompts 200

v0.4.1

8B

Throughput: 20.72 requests/s, 22789.42 tokens/s

Avg latency: 1.0524114217337532 seconds

============ Serving Benchmark Result ============
Successful requests:                     200
Benchmark duration (s):                  20.24
Total input tokens:                      42659
Total generated tokens:                  41455
Request throughput (req/s):              9.88
Input token throughput (tok/s):          2107.49
Output token throughput (tok/s):         2048.01
---------------Time to First Token----------------
Mean TTFT (ms):                          25.13
Median TTFT (ms):                        23.66
P99 TTFT (ms):                           105.47
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          18.20
Median TPOT (ms):                        18.74
P99 TPOT (ms):                           25.95
==================================================

70B

Throughput: 14.47 requests/s, 15921.83 tokens/s

Avg latency: 1.8583771677222103 seconds

============ Serving Benchmark Result ============
Successful requests:                     200
Benchmark duration (s):                  27.82
Total input tokens:                      42659
Total generated tokens:                  35448
Request throughput (req/s):              7.19
Input token throughput (tok/s):          1533.62
Output token throughput (tok/s):         1274.38
---------------Time to First Token----------------
Mean TTFT (ms):                          51.87
Median TTFT (ms):                        58.24
P99 TTFT (ms):                           107.88
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          46.65
Median TPOT (ms):                        44.87
P99 TPOT (ms):                           91.39
==================================================

0.4.2

8B

Throughput: 20.76 requests/s, 22839.97 tokens/s

Avg latency: 1.050389444703857 seconds

============ Serving Benchmark Result ============
Successful requests:                     200
Benchmark duration (s):                  20.02
Total input tokens:                      42659
Total generated tokens:                  41053
Request throughput (req/s):              9.99
Input token throughput (tok/s):          2131.22
Output token throughput (tok/s):         2050.99
---------------Time to First Token----------------
Mean TTFT (ms):                          24.68
Median TTFT (ms):                        23.36
P99 TTFT (ms):                           46.91
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.72
Median TPOT (ms):                        18.26
P99 TPOT (ms):                           24.32
==================================================

70B

Throughput: 14.52 requests/s, 15974.74 tokens/s

Avg latency: 1.883905064908322 seconds

============ Serving Benchmark Result ============
Successful requests:                     200
Benchmark duration (s):                  27.84
Total input tokens:                      42659
Total generated tokens:                  35616
Request throughput (req/s):              7.18
Input token throughput (tok/s):          1532.43
Output token throughput (tok/s):         1279.42
---------------Time to First Token----------------
Mean TTFT (ms):                          51.50
Median TTFT (ms):                        58.76
P99 TTFT (ms):                           104.25
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          46.55
Median TPOT (ms):                        44.03
P99 TPOT (ms):                           97.85
==================================================

Chunked Prefill Comparison

python benchmark_serving.py --model /mnt/localdisk/simon-llama3-70b --backend openai --host 0.0.0.0 --port 8000 --endpoint /v1/completions --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name sharegpt --request-rate 32 --num-prompts 500

w/o chunked

============ Serving Benchmark Result ============
Successful requests:                     500
Benchmark duration (s):                  42.94
Total input tokens:                      100895
Total generated tokens:                  89101
Request throughput (req/s):              11.64
Input token throughput (tok/s):          2349.55
Output token throughput (tok/s):         2074.90
---------------Time to First Token----------------
Mean TTFT (ms):                          460.49
Median TTFT (ms):                        75.54
P99 TTFT (ms):                           5324.58
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          100.80
Median TPOT (ms):                        84.39
P99 TPOT (ms):                           244.96
==================================================

w/chunked

============ Serving Benchmark Result ============
Successful requests:                     500
Benchmark duration (s):                  40.72
Total input tokens:                      100895
Total generated tokens:                  88896
Request throughput (req/s):              12.28
Input token throughput (tok/s):          2477.66
Output token throughput (tok/s):         2183.00
---------------Time to First Token----------------
Mean TTFT (ms):                          335.62
Median TTFT (ms):                        182.52
P99 TTFT (ms):                           2345.78
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          71.03
Median TPOT (ms):                        70.43
P99 TPOT (ms):                           105.58
==================================================

bump version to v0.4.1

f8c284a

simon-mo changed the title ~~bump version to v0.4.1~~ bump version to v0.4.2 May 4, 2024

simon-mo merged commit 8d8357c into vllm-project:main May 5, 2024
59 checks passed

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request May 6, 2024

bump version to v0.4.2 (vllm-project#4600)

2d96b61

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 7, 2024

bump version to v0.4.2 (vllm-project#4600)

52b5bcb

dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request May 7, 2024

bump version to v0.4.2 (vllm-project#4600)

e49ea2e

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

mawong-amd pushed a commit to ROCm/vllm that referenced this pull request Jun 3, 2024

bump version to v0.4.2 (vllm-project#4600)

83e9553

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bump version to v0.4.2 #4600

bump version to v0.4.2 #4600

simon-mo commented May 4, 2024 •

edited

bump version to v0.4.2 #4600

bump version to v0.4.2 #4600

Conversation

simon-mo commented May 4, 2024 • edited

Benchmark

v0.4.1

8B

70B

0.4.2

8B

70B

Chunked Prefill Comparison

simon-mo commented May 4, 2024 •

edited