Skip to content

Implement Lightweight Scheduler Simulation Tests for Inference Gateway #709

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Tracked by #681
kaushikmitr opened this issue Apr 18, 2025 · 8 comments
Open
Tracked by #681
Labels
needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@kaushikmitr
Copy link
Contributor

kaushikmitr commented Apr 18, 2025

What would you like to be added:

We propose adding lightweight and fast presubmit tests to simulate and validate the performance of our scheduling algorithm. These tests would simulate Inference Gateway behavior with multiple vLLM endpoints on specified GPU hardware configurations, allowing us to detect performance regressions early. Ideally testing scenarios should be fully configurable. Below are examples of the desired tests:

Test Case 1:

Dataset: ShareGPT (focusing only on input/output token distributions)

Hardware: 6x H100 80 GB GPU model servers

Model: llama 3.1-8b

Procedure: Vary the QPS until KV cache saturation occurs, and measure the normalized time per output token as we send request via Gateway. Specifically, collect data points for KV cache utilization between 50% and 100%, incrementing by 5% steps.

Test Case 2:

Dataset: ShareGPT (focusing only on input/output token distributions)

Hardware: 6x H100 80 GB GPU model servers, max lora rank = 8

Model: llama 3.1-8b

LoRA Configurations: 6 rank-8 LoRAs, e.g., nvidia/llama-3.1-nemoguard-8b-topic-control, with uniform traffic distribution.

Procedure: Vary the QPS until KV cache saturation occurs, and measure the normalized time per output token. Specifically, collect data points for KV cache utilization between 50% and 100%, incrementing by 5% steps.

We previously implemented a basic vLLM simulation, but since then, significant changes have occurred in our scheduler—especially with the introduction of chunked prefill.

Why is this needed:

Implementing these simulation-based tests is crucial to identify and prevent regressions in the scheduler algorithm. Real GPU or TPU resources are impractical for presubmits due to cost and availability constraints. Additionally, as the complexity of our algorithms increases, simulation allows cost-effective and scalable testing of more sophisticated scenarios.

@kaushikmitr kaushikmitr changed the title Implement Lightweight Scheduler Simulation Tests for vLLM Implement Lightweight Scheduler Simulation Tests for Inference Gateway Apr 18, 2025
@kaushikmitr
Copy link
Contributor Author

@nirrozenbaum I believe you mentioned you have built a vLLM simulation, that we can use for this issue. cc @kfswain

@nirrozenbaum
Copy link
Contributor

yes, my colleagues at IBM do have a vllm-sim under development, If I’m not mistaken they plan on open sourcing it around mid of May

@kfswain
Copy link
Collaborator

kfswain commented Apr 21, 2025

Heya @nirrozenbaum does Shmuel have a GH handle so we can assign this? We discussed this a few weeks ago, and would love to have access to a tool like this! Thanks!

@nirrozenbaum
Copy link
Contributor

of course. my colleagues who worked on vllm-simulator are Shmuel and Maya.
github handles: shmuelk and mayabar.
from some reason I can't tag them.

@nirrozenbaum
Copy link
Contributor

btw, earlier today another colleague of mine from the same team ran successfully the e2e tests using the vllm-simulator image (used simulator-deployment.yaml instead of the gpu-deployment). will search her handle and tag her here too.

@nirrozenbaum
Copy link
Contributor

btw, earlier today another colleague of mine from the same team ran successfully the e2e tests using the vllm-simulator image (used simulator-deployment.yaml instead of the gpu-deployment). will search her handle and tag her here too.

irar2

@kfswain kfswain mentioned this issue Apr 23, 2025
17 tasks
@kfswain
Copy link
Collaborator

kfswain commented Apr 24, 2025

@nirrozenbaum do you feel there is enough context in this issue for action to be taken?

@kfswain kfswain added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 24, 2025
@nirrozenbaum
Copy link
Contributor

the issue contains more than enough details, yes.
the only question is if/when any of them have cycles to work on this item.
alternatively, I verified that the vllm-sim will be open sourced around mid may, if none of them have cycles by then, after mid may anyone can pick this item since the simulator will be publicly available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

3 participants