ROSS

ROSS is an offline simulator for LLM inference performance. Given a model, a hardware platform, and a workload description, it predicts throughput and latency metrics (TTFT, TPOT, ITL) without running a real inference server. It supports both SGLang and vLLM backends, colocated and disaggregated parallelism, and Pareto-front search over parallel configurations.

Getting Started

Repository layout

collector/          Platform profiling scripts + pre-collected hardware profiles
common/             Shared model/feature/config classes used by both simulators
ross/               Main simulator package
  sgl_sim/            SGLang simulator backend
  vllm_sim/           vLLM simulator backend
  pareto/             Pareto-front analysis utilities
  config/             Example JSON config files
  ross_predict.py     Entry point for offline prediction sweeps
modeling/           Pre-trained XGBoost regression models
test/               Unit and integration tests

Installation

ROSS runs on a CPU-only host; no GPU is required for simulation.

pip install xgboost==3.2.0 scikit-learn pandas tqdm plotext

Pre-trained regression models for H200 and B200 are loaded at runtime from the directory specified by the modeling_dir field in your config file (or via the --modeling-dir CLI flag). If you want to target a new GPU, follow the profiling workflow described in the Advanced Features section and point modeling_dir at the resulting model directory.

A first prediction

Create a minimal JSON config:

{
    "backend":   "sglang",
    "model":     "Meta-Llama-3.1-70B",
    "parallel":  "1:1:8",
    "batch":     [64],
    "num_prompt": 512,
    "rate":      ["1", "2", "inf"],
    "inputs":    ["sharegpt@0_0"],
    "platforms": [{"gpu": "H200", "version": "0.6.6"}],
    "output":    "~/.etc",
    "datapath":  "~/.etc",
    "model_search_paths": "/path/to/models",
    "modeling_dir":       "/path/to/modeling"
}

Run the simulator:

python ross/ross_predict.py --config my_config.json

Output includes per-request and aggregate latency/throughput metrics. Add --record-path results.csv to also persist a CSV log.

Basic Usage

The `ross_predict.py` entry point

ross_predict.py is the single entry point for all offline prediction sweeps. Configuration is read from a JSON file and may be overridden per-flag on the command line. Priority order (later wins):

built-in default  <  --config FILE  <  command-line flags

Common CLI flags

Flag	Description
`--config FILE`	JSON config file (recommended for sweeps)
`--backend sglang\|vllm`	Simulator backend
`--model`	Model name or absolute path
`--parallel`	Parallelism spec, e.g. `1:1:8` or `1:1:4@1:1:4`
`--batch`	Max batch size per GPU
`--rate`	Comma-separated request rates (`inf` for max tput)
`--input`	Dataset + sequence-length spec
`--modeling-dir`	Path to XGBoost model directory
`--model-search-paths`	Comma-separated model search roots
`--record-path FILE`	Write metrics to CSV
`--eval`	Load trace logs and compute prediction error
`--get-pareto-front`	Enumerate configs and compute Pareto front
`--debug`	Verbose logging

A complete list of CLI flags, JSON config fields, and engine-specific arguments is documented in docs/bench_config.md.

Parallelism format

# Prefill-decode colocate
dp:pp:tp                       e.g. 1:1:8   (1 DP × 1 PP × 8 TP = 8 GPUs)

# Prefill-decode disaggregation
dp:pp:tp@dp:pp:tp              e.g. 1:1:4@1:1:4 (4 prefill + 4 decode GPUs)

# Multiple configurations (comma-separated)
1:1:8,1:1:4,1:1:4@1:1:4

Input (workload) format

dataset[@isl_osl]

dataset  ∈ sharegpt | repoqa | aime
isl, osl = integer (exact length; 0 = use the dataset's natural length, no cap)

Examples:
  sharegpt                 # dataset defaults (ISL≈500, OSL≈100)
  sharegpt@0_0             # use each sample's natural ISL and OSL, no cap
  sharegpt@0_100           # natural ISL, OSL forced to 100
  repoqa@4096_1024         # ISL=4096, OSL=1024
  aime@512_8192            # ISL=512, OSL=8192

Setting isl or osl to 0 tells the dataset loader to keep the original sequence length from the source dataset instead of truncating or padding to a fixed value. This is the recommended setting when you want the simulator to see the workload's real length distribution (e.g. sharegpt@0_0).

Validating against a real trace

When real benchmark traces are available, add --eval to compute per-configuration percentage error (PE) for E2E latency, TTFT, TPOT, and ITL against the recorded ground truth:

python ross/ross_predict.py --config my_config.json --eval

Advanced Features

Pareto-front search over parallelism

Pass --get-pareto-front to enumerate all valid parallel configurations (colocated and PD-disaggregated) for a given model and hardware budget and plot the Pareto frontier of tokens/s/user vs tokens/s/GPU:

python ross/ross_predict.py --config my_config.json --get-pareto-front

A 228-candidate sweep for a 32B model on an 8-GPU B200 cluster completes in about 70 minutes on a CPU-only server — roughly 1,258× cheaper than exhaustive on-hardware evaluation. --get-pareto-front is mutually exclusive with --eval.

Prefill–decode disaggregation

Disaggregated deployments are expressed by splitting the parallelism spec with an @:

python ross/ross_predict.py \
    --config base.json \
    --parallel "1:1:4@1:1:4"

ROSS models the KV-cache transfer between the prefill and decode workers as part of the virtual-clock critical path.

Multi-dimensional sweeps

Any sweep dimension — backend, model, parallelism, batch size, request rate, dataset, platform, engine argument — can be a list. ROSS iterates the Cartesian product and writes one row per cell to the CSV record:

{
    "backend":  ["vllm", "sglang"],
    "model":    ["Qwen2.5-72B-Instruct", "Llama-3.1-70B"],
    "parallel": ["1:1:8", "1:1:4@1:1:4"],
    "batch":    [32, 64],
    "rate":     ["1", "2", "4", "inf"],
    "inputs":   ["sharegpt@500_100", "repoqa@4096_1024"],
    "platforms": [
        {"gpu": "H200", "version": "0.6.6.post1"},
        {"gpu": "B200", "version": "0.7.0"}
    ],
    "model_search_paths": "/path/to/models",
    "modeling_dir":       "/path/to/modeling"
}

Engine-specific arguments

Forward framework-specific knobs through ross_extra (config file) or --args (CLI). Each entry may contain lists to trigger an inner sweep.

"ross_extra": [
    {
        "backend":              "sglang",
        "mem_fraction_static":  [0.85, 0.9],
        "chunked_prefill_size": [8192, 16384]
    },
    {
        "backend":                "vllm",
        "gpu_memory_utilization": [0.9],
        "max_num_batched_tokens": [8192, 16384]
    }
]

CLI equivalent:

--args "sglang@mem_fraction_static=0.9,chunked_prefill_size=8192"

Parallel workers

Large sweeps are parallelized across CPUs via --max-workers and --threads-per-worker. Both default to auto-selection based on the host's CPU count and scale linearly with available cores.

Profiling a new GPU platform

ROSS's data plane is trained from sparse per-platform profiles collected under collector/. To target a new GPU, run the provided profiling scripts on that platform (~3–4 wall-clock hours on an 8-GPU node) and retrain the stage-wise regressor. The resulting XGBoost model is then pointed to via modeling_dir in your config. The control plane requires no changes because it is reused from the native serving framework.

Step-by-step instructions for profiling a new platform are in docs/profiling.md.

Stage-level discrepancy analysis

Because ROSS decomposes each iteration into pre/forward/post stages, comparing simulated and measured stage times isolates where a real system deviates from its predictable behavior. This was used to localize a batch-boundary bottleneck in SGLang's TokenizerManager, whose structural fix reduced end-to-end latency by 36% at high concurrency.

Supported Models

ROSS's stage-wise regressor takes model configuration features as input rather than per-model kernel calibration, so new models within a supported family typically work out of the box. Validated models include:

Family	Variants
Llama-3.1	8B, 70B
Qwen2.5	72B-Instruct
Qwen3	32B, 30B-A3B (MoE), 235B-A22B (MoE), QwQ 32B
DeepSeek-V3	671B (MoE)
gpt-oss	20b, 120b

Backends: vLLM, SGLang Deployment modes: colocated, PD-disaggregated Hardware: NVIDIA H200, B200 (extensible via the collector/ workflow) Datasets: ShareGPT, RepoQA, AIME

Models and platforms outside this list can generally be added by:

Placing the HF-format model under your configured model_search_paths, and
Profiling the target GPU with the collector/ scripts if it is not H200/B200.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ROSS

Table of Contents

Getting Started

Repository layout

Installation

A first prediction

Basic Usage

The `ross_predict.py` entry point

Common CLI flags

Parallelism format

Input (workload) format

Validating against a real trace

Advanced Features

Pareto-front search over parallelism

Prefill–decode disaggregation

Multi-dimensional sweeps

Engine-specific arguments

Parallel workers

Profiling a new GPU platform

Stage-level discrepancy analysis

Supported Models

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
collector		collector
common		common
docs		docs
ross		ross
test		test
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

ROSS

Table of Contents

Getting Started

Repository layout

Installation

A first prediction

Basic Usage

The ross_predict.py entry point

Common CLI flags

Parallelism format

Input (workload) format

Validating against a real trace

Advanced Features

Pareto-front search over parallelism

Prefill–decode disaggregation

Multi-dimensional sweeps

Engine-specific arguments

Parallel workers

Profiling a new GPU platform

Stage-level discrepancy analysis

Supported Models

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

The `ross_predict.py` entry point

Packages