Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: Postmerge performance suite #4926

Open
simon-mo opened this issue May 20, 2024 · 4 comments
Open

[RFC]: Postmerge performance suite #4926

simon-mo opened this issue May 20, 2024 · 4 comments
Labels

Comments

@simon-mo
Copy link
Collaborator

simon-mo commented May 20, 2024

Motivation.

We want to start tracking performance numbers of vLLM on more realistic workloads. Thanks to our sponsors #4925 we are getting a pool of hardware resources ready to run the testing on.

The goal of this test suite is to

  1. Track regression
  2. Track our progress in optimization

Proposed Change.

We will start with running the following benchmarks:

  • Llama 8B on A100, H100
  • Llama 70B on 4xA100, 4xH100, 8xA100, 8xH100
  • Mixtral 8x7B on 8xH100
  • Mixtral 8x22B on 8xH100

We will run with the following parameters:

  • chunked prefill enabled
  • fp8

We will run with the following tests:

  • Benchmark latency
  • Benchmark throughput with 1000 prompts (ShareGPT)
  • Benchmark serving with 1000 prompts (ShareGPT)

We will also compare with TGI and TRT-LLM.

Feedback Period.

Step 1: Ensure hardware availabilities
Step 2: Setup pipeline for Llama 8B on H100 as a proof of concept
Step 3: Monitor the result, build dashboard
Step 4: Scale to other tests as resources come online.

CC List.

No response

Any Other Things.

Suggestion welcomed.

@simon-mo simon-mo added the RFC label May 20, 2024
@simon-mo
Copy link
Collaborator Author

Other notes:

  • I'm not sure how to best test spec decode (in what setting and which workload) and prefix caching (same questions).

@AaronFriel
Copy link

For prefix caching, I think few-shot evals where each eval shares the same prompt but the Q&A pair are distinct would be appropriate. Evaluating requests/second or generated tokens/second will give users an idea of how quickly an agent with a static prompt to guide it will run.

I haven't dug into all of the eval frameworks, but I understand many of them are few-shot. MMLU for example, I think uses the same 5 examples.

MMLUs/second would be an amusing (if hard to use) measurement.

@zifeitong
Copy link
Contributor

zifeitong commented May 21, 2024

Benchmark serving with 1000 prompts (ShareGPT)

It would be great to have one more dataset skewed towards larger prompts / prefilling, e.g. RAG.

We will run with the following parameters:

  • chunked prefill enabled
  • fp8

Add a 4-bit quantization method (Marlin, GPTQ)?

Copy link
Collaborator Author

@youkaichao mentioned we should also test the multiprocessing backend variant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants