-
-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: Postmerge performance suite #4926
Comments
Other notes:
|
For prefix caching, I think few-shot evals where each eval shares the same prompt but the Q&A pair are distinct would be appropriate. Evaluating requests/second or generated tokens/second will give users an idea of how quickly an agent with a static prompt to guide it will run. I haven't dug into all of the eval frameworks, but I understand many of them are few-shot. MMLU for example, I think uses the same 5 examples. MMLUs/second would be an amusing (if hard to use) measurement. |
It would be great to have one more dataset skewed towards larger prompts / prefilling, e.g. RAG.
Add a 4-bit quantization method (Marlin, GPTQ)? |
@youkaichao mentioned we should also test the multiprocessing backend variant. |
Motivation.
We want to start tracking performance numbers of vLLM on more realistic workloads. Thanks to our sponsors #4925 we are getting a pool of hardware resources ready to run the testing on.
The goal of this test suite is to
Proposed Change.
We will start with running the following benchmarks:
We will run with the following parameters:
We will run with the following tests:
We will also compare with TGI and TRT-LLM.
Feedback Period.
Step 1: Ensure hardware availabilities
Step 2: Setup pipeline for Llama 8B on H100 as a proof of concept
Step 3: Monitor the result, build dashboard
Step 4: Scale to other tests as resources come online.
CC List.
No response
Any Other Things.
Suggestion welcomed.
The text was updated successfully, but these errors were encountered: