Description
What would you like to be added:
A comprehensive and well-structured input and output format for benchmarking tools.
Why is this needed:
A consistent data format which unambiguously describes a benchmarking experiment and results has multiple benefits:
- Having all relevant parameters describing benchmarking inputs and the specific environment they were executed in will make it clear to anyone examining the data exactly what was measured.
- Experiments can be easily repeated by others; if done correctly, anyone repeating an experiment should get the same result, within a reasonable margin.
- Tools utilizing benchmarking data will have a stable format that can be relied on, and be trusted to contain all the data needed to draw some result or conclusion.
- Combining benchmarking results from multiple sources to perform analysis will be just as easy as analyzing data from a single source.
- Rather than debating what information should go into a benchmark, if all available useful data is consistently captured (community will have to decide what this is), then less time and energy will be spent repeating similar experiments only to capture the specific pieces of data that were desired at the time of a particular experiment.
Draft report proposal (needs feedback)
Below is a draft output format for aggregate benchmarking results (a "report"). This is distinct from the traces captured from a single inference execution. Statistics from multiple traces of identical scenarios are combined to create a report.
The report has three top-level fields, version
, scenario
, and metrics
.
While each of these fields is required, some subfields may be optional or not apply to the specific benchmarking configuration being performed. For example, some metrics may not be captured or supported by a certain benchmarking toolset.
The version
field is used to track the specific data format. Should the schema change with future revisions, this field will identify the specific format used (with, for example, a corresponding JSON Schema).
The scenario
field describes precisely what was measured. This includes details about the inference platform (full stack, including versions and the important runtime arguments), cluster configuration (like GPUs and parallelisms utilized), and workload. The content in this field should be detailed enough that it is sufficient to launch a repeat benchmarking experiment that will yield similar results (within some reasonable bound of variability).
The scenario
field could also be used as an input format for benchmarking. More generally, if this field were to be used as a standard input format it would need to be expanded to describe what parameters to sweep. For example, if I want to benchmark at 1, 2, 5, and 10 QPS request rates, each of these request rates would be a distinct scenario and should come with their own report metrics.
The metrics
field contains all of the results for the report. This does not include individual trace details, rather statistics for all runs that were captured in benchmarking for a particular scenario. This includes request-level performance metrics (like latencies and throughput), details about the inferencing service (like request queue lengths and KV cache size), and hardware metrics (such as GPU compute and memory utilization).
Items needing work
While the overall proposed format should be scrutinized, these areas in particular need work:
-
How should the workload be described (
scenario.load
section)? There are many different workload generators, and ideally this section should be flexible enough to support a wide array of generators. We need to create a structure here that can properly support them while still maintaining consistency. -
How to deal with heterogeneous accelerator setups? Do we split this into homogeneous groups and analyze each of these groups separately?
Data format
-
JSON/YAML allows for an easily readable and descriptive format that can be extended without issue (addition of new fields, metadata, won't break things). JSON may be preferred, but YAML is used here to allow for the addition of comments in this example. Collections of results could be a JSONL file (one JSON artifact per line in a single file, separated by
\n
), or separate files. -
CSV/TSV with a header would be more compact, but the format is too rigid. Inserting or removing columns would break things.
-
Binary would be most efficient (in size, ability to quickly parse through large datasets), but more difficult to work with. Ease of use of JSON/YAML likely outweighs benefits of binary for the purpose of storing results after benchmarking (performance and size is not so important). More efficient binary formats may be useful when analyzing large collections of benchmarking results, and this conversion can be done later.
Example
This example shows the proposed structure and fields for a report. The values themselves are simply placeholders which may not be sensical or consistent across the example.
version: '0.1' # Apply a version that updates with schema changes
scenario: # This section provides the specific environment and workload
description: This is a heterogeneous accelerator setup of DeepSeek with two lora adapters
host:
accelerator: # A list of more than 1 here gives a heterogeneous setup; not sure if this is clear or expressive enough
- model: H100
memory: 80
count: 32
parallelism:
dp: 1
tp: 8
pp: 4
ep: 8
- model: L40
memory: 80
count: 32
parallelism:
dp: 1
tp: 8
pp: 4
ep: 8
platform:
engine:
name: vllm
version: 0.9.0.1
args:
"--dtype": fp16
"--tensor-parallel-size": 8
"--pipeline-parallel-size": 4
"--enable-expert-parallel": true
model:
name: deepseek-ai/DeepSeek-R1-0528
quantization: fp16
adapters:
- lora: sql_adapter
- lora: golang_adapter
load: # Unsure about best format here... in principle this should contain enough information to execute a load generator
name: lmbenchmark
type: long-input
args:
qps_values: 1.34
num_users_warmpup: 20
num_users: 15
num_rounds: 20
system_prompt: 1000
chat_history: 20000
answer_len: 100
test_duration: 100
use_chat_completions: false
experiment:
duration: 16.531641244888306
start: 1749570583.5714512 # UTC seconds from epoch
stop: 1749570580.1030924
metrics: # These are the aggregate results from benchmarking
requests:
total: 32
failures: 0
input_length:
units: count
mean: 628.606060606061
stddev: 19.8353456345
min: 4
p10: 11
p50: 364
p90: 2427
max: 3836
output_length:
units: count
mean: 31.7878787878788
stddev: 19.8353456345
min: 30
p10: 31
p50: 32
p90: 32
max: 32
latency:
request_latency:
units: ms
mean: 3.31325431142327
stddev: 0.00198353456345
min: 1.62129471905064
p10: 1.67609986825846
p50: 2.11507539497688
p90: 5.94717199734878
max: 6.30658466403838
normalized_time_per_output_token:
units: ms/token
mean: 0.104340420636009
stddev: 0.00198353456345
min: 0.0506654599703325
p10: 0.0523781208830769
p50: 0.0670631669655753
p90: 0.189047570470012
max: 0.20343821496898
time_per_output_token:
units: ms/token
mean: 0.0836929455635872
stddev: 0.00198353456345
min: 0.0517028436646797
p10: 0.0530815053513894
p50: 0.0611870964678625
p90: 0.152292036800645
max: 0.17837208439984
time_to_first_token:
units: ms
mean: 0.800974442732916
stddev: 0.00198353456345
min: 0.0625283779809251
p10: 0.072068731742911
p50: 0.203539535985328
p90: 2.26959549135063
max: 4.46773961000145
inter_token_latency:
units: ms
mean: 0.0836929455635872
stddev: 0.00198353456345
min: 7.129972800612e-06
p10: 0.0534287681337446
p50: 0.0591336835059337
p90: 0.084046097996179
max: 0.614475268055685
e2e:
units: s
mean: 40.56
stddev: 12.15
min: 20.3
p10: 24.4
p50: 31.3
p90: 47.1
max: 49.2
throughput:
input_tokens_per_sec: 643.576644186323
output_tokens_per_sec: 32.544923821416
total_tokens_per_sec: 676.121568007739
requests_per_sec: 1.0238155253639
service: # These are metrics about the inference service
batch_size:
units: count
mean: 234.23049
stddev: 34.12342
min: 123
p10: 143
p50: 533
p90: 625
max: 753
queue_size:
units: count
mean: 234.12451
stddev: 34.56737
min: 123
p10: 143
p50: 533
p90: 625
max: 753
kv_cache_size:
units: MiB
mean: 2194993.253
stddev: 2342.3456
min: 1194345
p10: 1394456
p50: 2404751
p90: 2534437
max: 2554393
resources: # These are hardware level metrics
accelerator: # This list correlates 1:1 with the heterogeneous groups defined in scenario.host.accelerator
- memory: # 1st group of GPU types here
consumption:
units: MiB
mean: 2194993.2346
stddev: 2342.4568
min: 1194345
p10: 1394456
p50: 2404751
p90: 2534437
max: 2554393
utilization:
units: percent
mean: 80.235
stddev: 32.1
min: 40.3
p10: 44.4
p50: 71.3
p90: 97.1
max: 99.2
bandwidth:
units: MiB/s
mean: 21993.2346
stddev: 22.4568
min: 19445.2347
p10: 13456.5367
p50: 24051.2456
p90: 24437.4582
max: 25543.3457
compute:
utilization:
units: percent
mean: 40.56
stddev: 12.15
min: 20.3
p10: 24.4
p50: 31.3
p90: 47.1
max: 49.2
power:
units: watts
mean: 410.02
stddev: 170.1
min: 201.3
p10: 243.4
p50: 314.3
p90: 475.1
max: 497.2
- memory: # 2nd group of GPU types here
consumption:
units: MiB
mean: 2194993.2346
stddev: 2342.4568
min: 1194345
p10: 1394456
p50: 2404751
p90: 2534437
max: 2554393
utilization:
units: percent
mean: 80.235
stddev: 32.1
min: 40.3
p10: 44.4
p50: 71.3
p90: 97.1
max: 99.2
bandwidth:
units: MiB/s
mean: 21993.2346
stddev: 22.4568
min: 19445.2347
p10: 13456.5367
p50: 24051.2456
p90: 24437.4582
max: 25543.3457
compute:
utilization:
units: percent
mean: 40.56
stddev: 12.15
min: 20.3
p10: 24.4
p50: 31.3
p90: 47.1
max: 49.2
power:
units: watts
mean: 410.02
stddev: 170.1
min: 201.3
p10: 243.4
p50: 314.3
p90: 475.1
max: 497.2