Unified input/output format

**What would you like to be added**:
A comprehensive and well-structured input and output format for benchmarking tools.

**Why is this needed**:
A consistent data format which unambiguously describes a benchmarking experiment and results has multiple benefits:
- Having all relevant parameters describing benchmarking inputs and the specific environment they were executed in will make it clear to anyone examining the data exactly what was measured.
- Experiments can be easily repeated by others; if done correctly, anyone repeating an experiment should get the same result, within a reasonable margin.
- Tools utilizing benchmarking data will have a stable format that can be relied on, and be trusted to contain all the data needed to draw some result or conclusion.
- Combining benchmarking results from multiple sources to perform analysis will be just as easy as analyzing data from a single source.
- Rather than debating what information should go into a benchmark, if all available useful data is consistently captured (community will have to decide what this is), then less time and energy will be spent repeating similar experiments only to capture the specific pieces of data that were desired at the time of a particular experiment.

## Draft report proposal (needs feedback)

Below is a draft output format for aggregate benchmarking results (a "report"). This is distinct from the traces captured from a single inference execution. Statistics from multiple traces of identical scenarios are combined to create a report.

The report has three top-level fields, `version`, `scenario`, and `metrics`.

While each of these fields is required, some subfields may be optional or not apply to the specific benchmarking configuration being performed. For example, some metrics may not be captured or supported by a certain benchmarking toolset.

The `version` field is used to track the specific data format. Should the schema change with future revisions, this field will identify the specific format used (with, for example, a corresponding [JSON Schema](https://json-schema.org/draft/2020-12)).

The `scenario` field describes precisely what was measured. This includes details about the inference platform (full stack, including versions and the important runtime arguments), cluster configuration (like GPUs and parallelisms utilized), and workload. The content in this field should be detailed enough that it is sufficient to launch a repeat benchmarking experiment that will yield similar results (within some reasonable bound of variability).

The `scenario` field could also be used as an input format for benchmarking. More generally, if this field were to be used as a standard input format it would need to be expanded to describe what parameters to sweep. For example, if I want to benchmark at 1, 2, 5, and 10 QPS request rates, each of these request rates would be a distinct scenario and should come with their own report metrics.

The `metrics` field contains all of the results for the report. This does not include individual trace details, rather statistics for all runs that were captured in benchmarking for a particular scenario. This includes request-level performance metrics (like latencies and throughput), details about the inferencing service (like request queue lengths and KV cache size), and hardware metrics (such as GPU compute and memory utilization).

### Items needing work

While the overall proposed format should be scrutinized, these areas in particular need work:

- How should the workload be described (`scenario.load` section)? There are many different workload generators, and ideally this section should be flexible enough to support a wide array of generators. We need to create a structure here that can properly support them while still maintaining consistency.

- How to deal with heterogeneous accelerator setups? Do we split this into homogeneous groups and analyze each of these groups separately?

### Data format

- JSON/YAML allows for an easily readable and descriptive format that can be extended without issue (addition of new fields, metadata, won't break things). JSON may be preferred, but YAML is used here to allow for the addition of comments in this example. Collections of results could be a [JSONL file](https://jsonlines.org/) (one JSON artifact per line in a single file, separated by `\n`), or separate files.

- CSV/TSV with a header would be more compact, but the format is too rigid. Inserting or removing columns would break things.

- Binary would be most efficient (in size, ability to quickly parse through large datasets), but more difficult to work with. Ease of use of JSON/YAML likely outweighs benefits of binary for the purpose of storing results after benchmarking (performance and size is not so important). More efficient binary formats may be useful when analyzing large collections of benchmarking results, and this conversion can be done later.

### Example

This example shows the proposed structure and fields for a report. The values themselves are simply placeholders which may not be sensical or consistent across the example.

```yaml
version: '0.1' # Apply a version that updates with schema changes
scenario: # This section provides the specific environment and workload
  description: This is a heterogeneous accelerator setup of DeepSeek with two lora adapters
  host:
    accelerator: # A list of more than 1 here gives a heterogeneous setup; not sure if this is clear or expressive enough
      - model: H100
        memory: 80
        count: 32
        parallelism:
          dp: 1
          tp: 8
          pp: 4
          ep: 8
      - model: L40
        memory: 80
        count: 32
        parallelism:
          dp: 1
          tp: 8
          pp: 4
          ep: 8
  platform:
    engine:
      name: vllm
      version: 0.9.0.1
      args:
        "--dtype": fp16
        "--tensor-parallel-size": 8
        "--pipeline-parallel-size": 4
        "--enable-expert-parallel": true
  model:
    name: deepseek-ai/DeepSeek-R1-0528
    quantization: fp16
    adapters:
    - lora: sql_adapter
    - lora: golang_adapter
  load: # Unsure about best format here... in principle this should contain enough information to execute a load generator
    name: lmbenchmark
    type: long-input
    args:
      qps_values: 1.34
      num_users_warmpup: 20
      num_users: 15
      num_rounds: 20
      system_prompt: 1000
      chat_history: 20000
      answer_len: 100
      test_duration: 100
      use_chat_completions: false
  experiment:
    duration: 16.531641244888306
    start: 1749570583.5714512 # UTC seconds from epoch
    stop: 1749570580.1030924
metrics: # These are the aggregate results from benchmarking
  requests:
    total: 32
    failures: 0
    input_length:
      units: count
      mean: 628.606060606061
      stddev: 19.8353456345
      min: 4
      p10: 11
      p50: 364
      p90: 2427
      max: 3836
    output_length:
      units: count
      mean: 31.7878787878788
      stddev: 19.8353456345
      min: 30
      p10: 31
      p50: 32
      p90: 32
      max: 32
  latency:
    request_latency:
      units: ms
      mean: 3.31325431142327
      stddev: 0.00198353456345
      min: 1.62129471905064
      p10: 1.67609986825846
      p50: 2.11507539497688
      p90: 5.94717199734878
      max: 6.30658466403838
    normalized_time_per_output_token:
      units: ms/token
      mean: 0.104340420636009
      stddev: 0.00198353456345
      min: 0.0506654599703325
      p10: 0.0523781208830769
      p50: 0.0670631669655753
      p90: 0.189047570470012
      max: 0.20343821496898
    time_per_output_token:
      units: ms/token
      mean: 0.0836929455635872
      stddev: 0.00198353456345
      min: 0.0517028436646797
      p10: 0.0530815053513894
      p50: 0.0611870964678625
      p90: 0.152292036800645
      max: 0.17837208439984
    time_to_first_token:
      units: ms
      mean: 0.800974442732916
      stddev: 0.00198353456345
      min: 0.0625283779809251
      p10: 0.072068731742911
      p50: 0.203539535985328
      p90: 2.26959549135063
      max: 4.46773961000145
    inter_token_latency:
      units: ms
      mean: 0.0836929455635872
      stddev: 0.00198353456345
      min: 7.129972800612e-06
      p10: 0.0534287681337446
      p50: 0.0591336835059337
      p90: 0.084046097996179
      max: 0.614475268055685
    e2e:
      units: s
      mean: 40.56
      stddev: 12.15
      min: 20.3
      p10: 24.4
      p50: 31.3
      p90: 47.1
      max: 49.2
  throughput:
    input_tokens_per_sec: 643.576644186323
    output_tokens_per_sec: 32.544923821416
    total_tokens_per_sec: 676.121568007739
    requests_per_sec: 1.0238155253639
  service: # These are metrics about the inference service
    batch_size:
      units: count
      mean: 234.23049
      stddev: 34.12342
      min: 123
      p10: 143
      p50: 533
      p90: 625
      max: 753
    queue_size:
      units: count
      mean: 234.12451
      stddev: 34.56737
      min: 123
      p10: 143
      p50: 533
      p90: 625
      max: 753
    kv_cache_size:
      units: MiB
      mean: 2194993.253
      stddev: 2342.3456
      min: 1194345
      p10: 1394456
      p50: 2404751
      p90: 2534437
      max: 2554393
  resources: # These are hardware level metrics
    accelerator: # This list correlates 1:1 with the heterogeneous groups defined in scenario.host.accelerator
      - memory: # 1st group of GPU types here
          consumption:
            units: MiB
            mean: 2194993.2346
            stddev: 2342.4568
            min: 1194345
            p10: 1394456
            p50: 2404751
            p90: 2534437
            max: 2554393
          utilization:
            units: percent
            mean: 80.235
            stddev: 32.1
            min: 40.3
            p10: 44.4
            p50: 71.3
            p90: 97.1
            max: 99.2
          bandwidth:
            units: MiB/s
            mean: 21993.2346
            stddev: 22.4568
            min: 19445.2347
            p10: 13456.5367
            p50: 24051.2456
            p90: 24437.4582
            max: 25543.3457
        compute:
          utilization:
            units: percent
            mean: 40.56
            stddev: 12.15
            min: 20.3
            p10: 24.4
            p50: 31.3
            p90: 47.1
            max: 49.2
        power:
          units: watts
          mean: 410.02
          stddev: 170.1
          min: 201.3
          p10: 243.4
          p50: 314.3
          p90: 475.1
          max: 497.2
      - memory: # 2nd group of GPU types here
          consumption:
            units: MiB
            mean: 2194993.2346
            stddev: 2342.4568
            min: 1194345
            p10: 1394456
            p50: 2404751
            p90: 2534437
            max: 2554393
          utilization:
            units: percent
            mean: 80.235
            stddev: 32.1
            min: 40.3
            p10: 44.4
            p50: 71.3
            p90: 97.1
            max: 99.2
          bandwidth:
            units: MiB/s
            mean: 21993.2346
            stddev: 22.4568
            min: 19445.2347
            p10: 13456.5367
            p50: 24051.2456
            p90: 24437.4582
            max: 25543.3457
        compute:
          utilization:
            units: percent
            mean: 40.56
            stddev: 12.15
            min: 20.3
            p10: 24.4
            p50: 31.3
            p90: 47.1
            max: 49.2
        power:
          units: watts
          mean: 410.02
          stddev: 170.1
          min: 201.3
          p10: 243.4
          p50: 314.3
          p90: 475.1
          max: 497.2
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unified input/output format #124

Draft report proposal (needs feedback)

Items needing work

Data format

Example

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unified input/output format #124

Description

Draft report proposal (needs feedback)

Items needing work

Data format

Example

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions