Serving Benchmark Refactoring #2433

ywang96 · 2024-01-13T07:33:45Z

The goal of this PR is to refactor the current online serving benchmark script to make it easier to use and contribute to as well as include more features. Some major items are

Some other items that can be included are:

Allow sampling input & output lengths from a distribution (Currently we have a fixed < 1024 input and < 2048 input + output setup)
Add a latency benchmark (similar to benchmark_latency.py where we run synchronous requests against the server to measure the best-scenario latency performance of the engine/backend)

zhaoyang-star · 2024-01-16T09:21:22Z

Adding deepspeed mii as an alternate backend will be expeted.

ywang96 · 2024-01-16T09:32:06Z

Adding deepspeed mii as an a alternate backend will be expeted.

Yea that's definitely doable. Since this benchmark is running against a server, I'm considering taking this the default way to deploy a model server with mii deepspeed.

ywang96 · 2024-01-17T19:39:01Z

Here's a sample output from running this version of benchmark script

Traffic request rate: inf
Successful requests: 10
Benchmark duration: 20.108469 s
Total input tokens: 1522
Total generated tokens: 2211
Reuqest throughput: 0.50 requests/s
Input token throughput: 75.69 tokens/s
Output token throughput: 109.95 tokens/s
Mean latency per output token: 61.17 ms
Median latency per output token: 40.94 ms
P99 latency per output token: 145.59 ms

A few remarks:

Since TensorRT-LLM doesn't come with an API server, I took Triton as default serving backend to serve with TRT-LLM.
For this PR I didn't want to introduce any breaking changes yet, so we may leave any potential item to be added in a later PR.

@LiuXiaoxuanPKU Could you take a first pass on this PR and see if there's anything wrong (mostly the design)? I can iterate on it to refine this PR (e.g, adding scripts for launching servers) once the we agree on the design.

pzhao1799 · 2024-01-17T22:24:18Z

benchmarks/serving/backend_request_func.py

+    return output
+
+
+ASYNC_REQUEST_FUNCS = {


nit: Do we want to organize these into a class? It may make this a lot cleaner and we can then generate an interface for all future backend benchmarks, which will keep the main class free of major changes if new backends are added.

My thought is to keep this particular file as flexible as possible: If someone wants to add support for a new backend "ABC", the only thing they need to do is to add an async function for "ABC" that performs an online inference on a given prompt (could be http or grpc) then add "ABC" to ASYNC_REQUEST_FUNCS without needing to touch the main benchmark script.

We could refactor these async request functions into a class, but again that'll still require implementations of class methods to send request & parse outputs, which are essentially what these functions are doing.

LiuXiaoxuanPKU · 2024-01-21T03:54:12Z

Here's a sample output from running this version of benchmark script
Traffic request rate: inf
Successful requests: 10
Benchmark duration: 20.108469 s
Total input tokens: 1522
Total generated tokens: 2211
Reuqest throughput: 0.50 requests/s
Input token throughput: 75.69 tokens/s
Output token throughput: 109.95 tokens/s
Mean latency per output token: 61.17 ms
Median latency per output token: 40.94 ms
P99 latency per output token: 145.59 ms
A few remarks:

Since TensorRT-LLM doesn't come with an API server, I took Triton as default serving backend to serve with TRT-LLM.

For this PR I didn't want to introduce any breaking changes yet, so we may leave any potential item to be added in a later PR.

@LiuXiaoxuanPKU Could you take a first pass on this PR and see if there's anything wrong (mostly the design)? I can iterate on it to refine this PR (e.g, adding scripts for launching servers) once the we agree on the design.

Hi Roger, thanks for the PR. Yeah, the design looks good to us, please go ahead. The only thing we want to confirm is that the performance numbers are similar before and after refactoring.

ywang96 · 2024-01-21T21:51:37Z

Hi Roger, thanks for the PR. Yeah, the design looks good to us, please go ahead. The only thing we want to confirm is that the performance numbers are similar before and after refactoring.

Thank you for the response! Sounds good and I'll keep iterating on the PR, and as we discussed offline, TTFT will be added to the metrics to be measured.

ywang96 · 2024-01-30T02:20:35Z

@LiuXiaoxuanPKU @simon-mo Here's an output from running the main branch version of benchmark_serving.py and from this branch on mixtral 8x7 (served on 4xA100-80G with v0.2.7 version of vllm) with the ShareGPT dataset.

main branch

Namespace(backend='vllm', protocol='http', host='localhost', port=8000, endpoint='/generate', model=None, dataset='ShareGPT_V3_unfiltered_cleaned_split.json', tokenizer='mistralai/Mixtral-8x7B-Instruct-v0.1', best_of=1, use_beam_search=False, num_prompts=100, request_rate=1.0, seed=0, trust_remote_code=False)
Total time: 111.56 s
Throughput: 0.90 requests/s
Average latency: 6.09 s
Average latency per token: 0.01 s
Average latency per output token: 0.03 s

This branch

Namespace(backend='vllm', version='0.2.7', base_url=None, host='localhost', port=8000, endpoint='/generate', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', model='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer='mistralai/Mixtral-8x7B-Instruct-v0.1', best_of=1, use_beam_search=False, num_prompts=100, request_rate=1.0, seed=0, trust_remote_code=False, save_result=False)
Traffic request rate: 1.0
Successful requests: 100
Benchmark duration: 103.926134 s
Total input tokens: 23521
Total generated tokens: 22873
Request throughput: 0.96 requests/s
Input token throughput: 226.32 tokens/s
Output token throughput: 220.09 tokens/s
Mean TTFT: 108.50 ms
Median TTFT: 90.62 ms
P99 TTFT: 197.14 ms
Mean TPOT: 29.21 ms
Median TPOT: 28.18 ms
P99 TPOT: 42.40 ms

One note is that Deepspeed-mii currently does not support streaming, so TTFT will be 0 as a placeholder (I've commented about this in the code itself too)

I can share more results but let me know what you think, Thanks!

ywang96 · 2024-02-12T05:01:26Z

Hi @simon-mo! I've refactored the script with dataclasses and edited the serving benchmark portion in the CI. A few last questions I have in mind:

I noticed the benchmark now runs on top of OpenAI API server instead of the /generate API server. Would you say the OpenAI API server should be used for serving with vLLM by default? (If so, we can just map vllm to the generic OpenAI request function)
Initially I made the serving scripts in its own serving directory to make my development easier, but I've moved these files back to top-level under benchmarks. Let me know what you think.
Should we add benchmarks for other backends (TGI, DeepSpeed-MII, etc) into CI or add them to a separate process? I'm indifferent so I'll leave that to you guys to decide - happy to help either way!

benchmarks/serving/benchmark_serving.py

.buildkite/run-benchmarks.sh

ywang96 added 4 commits January 12, 2024 23:12

initial refactor

15cf77e

iterate

1a29d37

iterate - add openai completions

5962f8a

iterate

b3f41cb

ywang96 added 3 commits January 16, 2024 14:18

add deepspeed-mii

1f31236

add trt-llm and cleanup

fb871f4

fix top_p for TGI

c36ee7f

hmellor mentioned this pull request Jan 17, 2024

Support OpenAI API server in benchmark_serving.py #2172

Merged

add endpoint arg

8d44415

ywang96 marked this pull request as ready for review January 17, 2024 19:29

ywang96 added 2 commits January 17, 2024 12:24

fix backend options

e302f24

rename & format

3610e29

pzhao1799 reviewed Jan 17, 2024

View reviewed changes

dynamic backend choices

ebb7742

Rogerwyf mentioned this pull request Jan 22, 2024

wip: improve benchmark serving #2532

Closed

simon-mo self-requested a review January 22, 2024 00:29

ywang96 added 5 commits January 21, 2024 16:47

changes from PR #2532

4c11279

add streaming & remove openai dependency

c3fa3a8

clean up and add note for deepspeed mii

941b51a

add ttfts to outputs

a09bd8b

minor changes on commend args

95b206f

ywang96 added 2 commits January 29, 2024 21:07

typo

2dd2e93

minor edit on args

150e23e

zhuohan123 mentioned this pull request Jan 31, 2024

[Roadmap] vLLM Roadmap Q1 2024 #2681

Closed

30 tasks

ywang96 added 13 commits February 11, 2024 01:03

fix dataclasses

14c05fb

fix import

b05a838

adding progressbar

59975d4

option to disable tqdm

a0f7971

add benchmark to ci

1c83ef3

add wget

8d51ed2

fix wget

a325bca

replace wget with curl

f691f72

Merge branch 'main' into benchmark-refactor

693f282

ypaf

4ec4bc5

yapf

80574d8

fix CI

e42eb3e

comments

af5be05

ywang96 commented Feb 12, 2024

View reviewed changes

benchmarks/serving/benchmark_serving.py Outdated Show resolved Hide resolved

ywang96 and others added 4 commits February 12, 2024 09:48

move files back to top-level

5ce50df

inline common metrics

1b7a2e9

add artifact upload

43692d5

fix flag

71984b0

simon-mo reviewed Feb 13, 2024

View reviewed changes

.buildkite/run-benchmarks.sh Outdated Show resolved Hide resolved

Apply suggestions from code review

78526b9

simon-mo merged commit a4211a4 into vllm-project:main Feb 13, 2024
18 checks passed

jvmncs pushed a commit to jvmncs/vllm that referenced this pull request Feb 14, 2024

Serving Benchmark Refactoring (vllm-project#2433)

57b02d0

xjpang pushed a commit to xjpang/vllm that referenced this pull request Feb 20, 2024

Serving Benchmark Refactoring (vllm-project#2433)

f325236

xjpang pushed a commit to xjpang/vllm that referenced this pull request Feb 22, 2024

Serving Benchmark Refactoring (vllm-project#2433)

d256c2e

andy-neuma mentioned this pull request Feb 23, 2024

andy/bump main to v0.3.2 neuralmagic/nm-vllm#49

Closed

ywang96 deleted the benchmark-refactor branch March 4, 2024 01:14

xjpang pushed a commit to xjpang/vllm that referenced this pull request Mar 4, 2024

Serving Benchmark Refactoring (vllm-project#2433)

a808486

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serving Benchmark Refactoring #2433

Serving Benchmark Refactoring #2433

ywang96 commented Jan 13, 2024 •

edited

zhaoyang-star commented Jan 16, 2024 •

edited

ywang96 commented Jan 16, 2024 •

edited

ywang96 commented Jan 17, 2024

pzhao1799 Jan 17, 2024

ywang96 Jan 19, 2024

LiuXiaoxuanPKU commented Jan 21, 2024

ywang96 commented Jan 21, 2024

ywang96 commented Jan 30, 2024 •

edited

ywang96 commented Feb 12, 2024 •

edited

Serving Benchmark Refactoring #2433

Serving Benchmark Refactoring #2433

Conversation

ywang96 commented Jan 13, 2024 • edited

zhaoyang-star commented Jan 16, 2024 • edited

ywang96 commented Jan 16, 2024 • edited

ywang96 commented Jan 17, 2024

pzhao1799 Jan 17, 2024

Choose a reason for hiding this comment

ywang96 Jan 19, 2024

Choose a reason for hiding this comment

LiuXiaoxuanPKU commented Jan 21, 2024

ywang96 commented Jan 21, 2024

ywang96 commented Jan 30, 2024 • edited

ywang96 commented Feb 12, 2024 • edited

ywang96 commented Jan 13, 2024 •

edited

zhaoyang-star commented Jan 16, 2024 •

edited

ywang96 commented Jan 16, 2024 •

edited

ywang96 commented Jan 30, 2024 •

edited

ywang96 commented Feb 12, 2024 •

edited