Implement Lightweight Scheduler Simulation Tests for Inference Gateway

## What would you like to be added:
We propose adding lightweight and fast presubmit tests to simulate and validate the performance of our scheduling algorithm. These tests would simulate Inference Gateway behavior with multiple vLLM endpoints on specified GPU hardware configurations, allowing us to detect performance regressions early. Ideally testing scenarios should be fully configurable. Below are examples of the desired tests:

### Test Case 1:

**Dataset**: ShareGPT (focusing only on input/output token distributions)

**Hardware**: 6x H100 80 GB GPU model servers

**Model**: llama 3.1-8b

**Procedure**: Vary the QPS until KV cache saturation occurs, and measure the normalized time per output token as we send request via Gateway. Specifically, collect data points for KV cache utilization between 50% and 100%, incrementing by 5% steps.

### Test Case 2:

**Dataset**: ShareGPT (focusing only on input/output token distributions)

**Hardware**: 6x H100 80 GB GPU model servers, max lora rank = 8

**Model**: llama 3.1-8b

**LoRA Configurations**: 6 rank-8 LoRAs, e.g., [nvidia/llama-3.1-nemoguard-8b-topic-control](https://huggingface.co/nvidia/llama-3.1-nemoguard-8b-topic-control), with uniform traffic distribution.

**Procedure**: Vary the QPS until KV cache saturation occurs, and measure the normalized time per output token. Specifically, collect data points for KV cache utilization between 50% and 100%, incrementing by 5% steps.

We previously implemented a basic vLLM [simulation](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/tools/simulations), but since then, significant changes have occurred in our scheduler—especially with the introduction of chunked prefill.

## Why is this needed:
Implementing these simulation-based tests is crucial to identify and prevent regressions in the scheduler algorithm. Real GPU or TPU resources are impractical for presubmits due to cost and availability constraints. Additionally, as the complexity of our algorithms increases, simulation allows cost-effective and scalable testing of more sophisticated scenarios.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement Lightweight Scheduler Simulation Tests for Inference Gateway #709

What would you like to be added:

Test Case 1:

Test Case 2:

Why is this needed:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement Lightweight Scheduler Simulation Tests for Inference Gateway #709

Description

What would you like to be added:

Test Case 1:

Test Case 2:

Why is this needed:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions