-
Notifications
You must be signed in to change notification settings - Fork 72
Implement Lightweight Scheduler Simulation Tests for Inference Gateway #709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@nirrozenbaum I believe you mentioned you have built a vLLM simulation, that we can use for this issue. cc @kfswain |
yes, my colleagues at IBM do have a vllm-sim under development, If I’m not mistaken they plan on open sourcing it around mid of May |
Heya @nirrozenbaum does Shmuel have a GH handle so we can assign this? We discussed this a few weeks ago, and would love to have access to a tool like this! Thanks! |
btw, earlier today another colleague of mine from the same team ran successfully the e2e tests using the vllm-simulator image (used simulator-deployment.yaml instead of the gpu-deployment). will search her handle and tag her here too. |
|
@nirrozenbaum do you feel there is enough context in this issue for action to be taken? |
the issue contains more than enough details, yes. |
What would you like to be added:
We propose adding lightweight and fast presubmit tests to simulate and validate the performance of our scheduling algorithm. These tests would simulate Inference Gateway behavior with multiple vLLM endpoints on specified GPU hardware configurations, allowing us to detect performance regressions early. Ideally testing scenarios should be fully configurable. Below are examples of the desired tests:
Test Case 1:
Dataset: ShareGPT (focusing only on input/output token distributions)
Hardware: 6x H100 80 GB GPU model servers
Model: llama 3.1-8b
Procedure: Vary the QPS until KV cache saturation occurs, and measure the normalized time per output token as we send request via Gateway. Specifically, collect data points for KV cache utilization between 50% and 100%, incrementing by 5% steps.
Test Case 2:
Dataset: ShareGPT (focusing only on input/output token distributions)
Hardware: 6x H100 80 GB GPU model servers, max lora rank = 8
Model: llama 3.1-8b
LoRA Configurations: 6 rank-8 LoRAs, e.g., nvidia/llama-3.1-nemoguard-8b-topic-control, with uniform traffic distribution.
Procedure: Vary the QPS until KV cache saturation occurs, and measure the normalized time per output token. Specifically, collect data points for KV cache utilization between 50% and 100%, incrementing by 5% steps.
We previously implemented a basic vLLM simulation, but since then, significant changes have occurred in our scheduler—especially with the introduction of chunked prefill.
Why is this needed:
Implementing these simulation-based tests is crucial to identify and prevent regressions in the scheduler algorithm. Real GPU or TPU resources are impractical for presubmits due to cost and availability constraints. Additionally, as the complexity of our algorithms increases, simulation allows cost-effective and scalable testing of more sophisticated scenarios.
The text was updated successfully, but these errors were encountered: