Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bump version to v0.4.2 #4600

Merged
merged 1 commit into from
May 5, 2024
Merged

Conversation

simon-mo
Copy link
Collaborator

@simon-mo simon-mo commented May 4, 2024

TLDR on perf:

  • No perf regression
  • Chunked prefill helps lower both TTFT and ITL (by batching more) in high load scenario.

Benchmark

Setup: compare v0.4.1 and this commit, on llama3 8B and 70B. H100

Scripts

python benchmark_throughput.py --input-len 1000 --output-len 100 --model meta-llama/Meta-Llama-3-8B-Instruct
python benchmark_latency.py --model meta-llama/Meta-Llama-3-8B-Instruct
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B-Instruct --disable-log-requests
python benchmark_serving.py --model meta-llama/Meta-Llama-3-8B-Instruct --backend openai --host 0.0.0.0 --port 8000 --endpoint /v1/completions --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name sharegpt --request-rate 16 --num-prompts 200

python benchmark_throughput.py --input-len 1000 --output-len 100 --model /mnt/localdisk/simon-llama3-70b -tp 8
python benchmark_latency.py --model /mnt/localdisk/simon-llama3-70b -tp 8
python -m vllm.entrypoints.openai.api_server --model /mnt/localdisk/simon-llama3-70b --disable-log-requests -tp 8
python benchmark_serving.py --model /mnt/localdisk/simon-llama3-70b --backend openai --host 0.0.0.0 --port 8000 --endpoint /v1/completions --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name sharegpt --request-rate 16 --num-prompts 200

v0.4.1

8B

Throughput: 20.72 requests/s, 22789.42 tokens/s
Avg latency: 1.0524114217337532 seconds
============ Serving Benchmark Result ============
Successful requests:                     200
Benchmark duration (s):                  20.24
Total input tokens:                      42659
Total generated tokens:                  41455
Request throughput (req/s):              9.88
Input token throughput (tok/s):          2107.49
Output token throughput (tok/s):         2048.01
---------------Time to First Token----------------
Mean TTFT (ms):                          25.13
Median TTFT (ms):                        23.66
P99 TTFT (ms):                           105.47
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          18.20
Median TPOT (ms):                        18.74
P99 TPOT (ms):                           25.95
==================================================

70B

Throughput: 14.47 requests/s, 15921.83 tokens/s
Avg latency: 1.8583771677222103 seconds
============ Serving Benchmark Result ============
Successful requests:                     200
Benchmark duration (s):                  27.82
Total input tokens:                      42659
Total generated tokens:                  35448
Request throughput (req/s):              7.19
Input token throughput (tok/s):          1533.62
Output token throughput (tok/s):         1274.38
---------------Time to First Token----------------
Mean TTFT (ms):                          51.87
Median TTFT (ms):                        58.24
P99 TTFT (ms):                           107.88
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          46.65
Median TPOT (ms):                        44.87
P99 TPOT (ms):                           91.39
==================================================

0.4.2

8B

Throughput: 20.76 requests/s, 22839.97 tokens/s
Avg latency: 1.050389444703857 seconds
============ Serving Benchmark Result ============
Successful requests:                     200
Benchmark duration (s):                  20.02
Total input tokens:                      42659
Total generated tokens:                  41053
Request throughput (req/s):              9.99
Input token throughput (tok/s):          2131.22
Output token throughput (tok/s):         2050.99
---------------Time to First Token----------------
Mean TTFT (ms):                          24.68
Median TTFT (ms):                        23.36
P99 TTFT (ms):                           46.91
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.72
Median TPOT (ms):                        18.26
P99 TPOT (ms):                           24.32
==================================================

70B

Throughput: 14.52 requests/s, 15974.74 tokens/s
Avg latency: 1.883905064908322 seconds
============ Serving Benchmark Result ============
Successful requests:                     200
Benchmark duration (s):                  27.84
Total input tokens:                      42659
Total generated tokens:                  35616
Request throughput (req/s):              7.18
Input token throughput (tok/s):          1532.43
Output token throughput (tok/s):         1279.42
---------------Time to First Token----------------
Mean TTFT (ms):                          51.50
Median TTFT (ms):                        58.76
P99 TTFT (ms):                           104.25
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          46.55
Median TPOT (ms):                        44.03
P99 TPOT (ms):                           97.85
==================================================

Chunked Prefill Comparison

python benchmark_serving.py --model /mnt/localdisk/simon-llama3-70b --backend openai --host 0.0.0.0 --port 8000 --endpoint /v1/completions --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name sharegpt --request-rate 32 --num-prompts 500

w/o chunked

============ Serving Benchmark Result ============
Successful requests:                     500
Benchmark duration (s):                  42.94
Total input tokens:                      100895
Total generated tokens:                  89101
Request throughput (req/s):              11.64
Input token throughput (tok/s):          2349.55
Output token throughput (tok/s):         2074.90
---------------Time to First Token----------------
Mean TTFT (ms):                          460.49
Median TTFT (ms):                        75.54
P99 TTFT (ms):                           5324.58
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          100.80
Median TPOT (ms):                        84.39
P99 TPOT (ms):                           244.96
==================================================

w/chunked

============ Serving Benchmark Result ============
Successful requests:                     500
Benchmark duration (s):                  40.72
Total input tokens:                      100895
Total generated tokens:                  88896
Request throughput (req/s):              12.28
Input token throughput (tok/s):          2477.66
Output token throughput (tok/s):         2183.00
---------------Time to First Token----------------
Mean TTFT (ms):                          335.62
Median TTFT (ms):                        182.52
P99 TTFT (ms):                           2345.78
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          71.03
Median TPOT (ms):                        70.43
P99 TPOT (ms):                           105.58
==================================================

@simon-mo simon-mo changed the title bump version to v0.4.1 bump version to v0.4.2 May 4, 2024
@simon-mo simon-mo merged commit 8d8357c into vllm-project:main May 5, 2024
59 checks passed
robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request May 6, 2024
z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 7, 2024
dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request May 7, 2024
mawong-amd pushed a commit to ROCm/vllm that referenced this pull request Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant