[RFC]: Postmerge performance suite #4926

simon-mo · 2024-05-20T21:55:14Z

Motivation.

We want to start tracking performance numbers of vLLM on more realistic workloads. Thanks to our sponsors #4925 we are getting a pool of hardware resources ready to run the testing on.

The goal of this test suite is to

Track regression
Track our progress in optimization

Proposed Change.

We will start with running the following benchmarks:

Llama 8B on A100, H100
Llama 70B on 4xA100, 4xH100, 8xA100, 8xH100
Mixtral 8x7B on 8xH100
Mixtral 8x22B on 8xH100

We will run with the following parameters:

chunked prefill enabled
fp8

We will run with the following tests:

Benchmark latency
Benchmark throughput with 1000 prompts (ShareGPT)
Benchmark serving with 1000 prompts (ShareGPT)

We will also compare with TGI and TRT-LLM.

Feedback Period.

Step 1: Ensure hardware availabilities
Step 2: Setup pipeline for Llama 8B on H100 as a proof of concept
Step 3: Monitor the result, build dashboard
Step 4: Scale to other tests as resources come online.

CC List.

No response

Any Other Things.

Suggestion welcomed.

simon-mo · 2024-05-20T22:27:59Z

Other notes:

I'm not sure how to best test spec decode (in what setting and which workload) and prefix caching (same questions).

AaronFriel · 2024-05-20T23:48:36Z

For prefix caching, I think few-shot evals where each eval shares the same prompt but the Q&A pair are distinct would be appropriate. Evaluating requests/second or generated tokens/second will give users an idea of how quickly an agent with a static prompt to guide it will run.

I haven't dug into all of the eval frameworks, but I understand many of them are few-shot. MMLU for example, I think uses the same 5 examples.

MMLUs/second would be an amusing (if hard to use) measurement.

zifeitong · 2024-05-21T02:29:47Z

Benchmark serving with 1000 prompts (ShareGPT)

It would be great to have one more dataset skewed towards larger prompts / prefilling, e.g. RAG.

We will run with the following parameters:

chunked prefill enabled

fp8

Add a 4-bit quantization method (Marlin, GPTQ)?

simon-mo · 2024-05-21T17:53:12Z

@youkaichao mentioned we should also test the multiprocessing backend variant.

simon-mo added the RFC label May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Postmerge performance suite #4926

[RFC]: Postmerge performance suite #4926

simon-mo commented May 20, 2024 •

edited

Loading

simon-mo commented May 20, 2024

AaronFriel commented May 20, 2024

zifeitong commented May 21, 2024 •

edited by linear bot

Loading

simon-mo commented May 21, 2024

[RFC]: Postmerge performance suite #4926

[RFC]: Postmerge performance suite #4926

Comments

simon-mo commented May 20, 2024 • edited Loading

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

simon-mo commented May 20, 2024

AaronFriel commented May 20, 2024

zifeitong commented May 21, 2024 • edited by linear bot Loading

simon-mo commented May 21, 2024

simon-mo commented May 20, 2024 •

edited

Loading

zifeitong commented May 21, 2024 •

edited by linear bot

Loading