Skip to content

sdif-format/sdif-benchmarks

Repository files navigation

SDIF Benchmarks

SDIF Benchmarks

Evidence-first benchmarks measuring SDIF against JSON, YAML, XML, CSV Bundle
and other formats from the perspective of AI and LLM developers.

Tracks · Quick start · Latest results · Corpus model · Result model · Environment

Evidence first Shared canonical fixtures Deterministic


Every compared representation is derived from the same canonical JSON source. Claims must name the tokenizer and document coverage that produced them. Optional external tools degrade gracefully.



Benchmark tracks

Token efficiency

Byte and token reduction across shared semantic fixtures. Ranks all formats against JSON Compact as the stable baseline.
Context packing

How many document copies fit inside fixed token budgets (4K, 8K, 32K, 128K). Fit rate and median copies per budget.
Round-trip fidelity

JSON→format→JSON preservation. Scores value, type and structure fidelity. N/A for SDIF AI and TOON.
Delta compactness

Token overhead of re-sending a mutated document. Applies a deterministic mutation to the first 10% of leaf values.
Retrieval accuracy

LLM answer quality by format. Deterministic validators — no LLM judge. Opt-in: requires ANTHROPIC_API_KEY.
Semantic quality

Guards that SDIF preserves relations, rules, schema validation, canonicalization and reversible AI projection boundaries.


Quick start

This repository expects access to the core SDIF repository. By default it looks for it at ../sdif; override this with SDIF_CORE_REPO.

# Token reduction across formats
make benchmark-token

# Context-window fit rate by budget
make benchmark-packing

# JSON→format→JSON round-trip fidelity
make benchmark-roundtrip

# Mutation sensitivity (re-send overhead)
make benchmark-delta

# LLM retrieval accuracy by format — opt-in
SDIF_BENCHMARK_RETRIEVAL=1 ANTHROPIC_API_KEY=<key> make benchmark-retrieval

# Semantic quality checks
make benchmark-quality


Latest results

Results from the most recent token efficiency run across 21 documents and 3 tokenizers (Estimate, TokenX, tiktoken).

Format Consensus avg rank Median ratio vs JSON Compact Wins (63 pairs)
SDIF AI 1.10 56.8% 57
SDIF 2.60 59.5% 2
CSV Bundle 2.70 61.2% 4
TOON 3.60 63.2% 0
YAML 5.35 95.3% 0
JSON Compact 5.65 100.0% 0
JSON Pretty 7.00 137.3% 0
XML 8.00 171.7% 0

Tokenizer-specific winners:

Tokenizer Winning format Wins
Estimate SDIF AI 19/21
TokenX SDIF AI 20/21
tiktoken SDIF AI 18/21

These results are corpus-dependent. Results for Claude and Llama3 tokenizers require separate opt-in. Full per-document breakdowns live in results/token_efficiency/.



Corpus model

The canonical semantic corpus lives in the core repo's examples/golden/ directory, not duplicated here. This avoids drift between parser fixtures and benchmark fixtures.

Each fixture contains:

../sdif/examples/golden/<fixture>/
├── equivalent.json     # canonical semantic source (benchmark input)
├── source.sdif         # hand-authored or generated SDIF source
├── canonical.sdif      # canonical SDIF form
└── canonical.sha256    # canonical hash evidence

The benchmark path defaults to ../sdif/examples/golden/ and can be overridden with SDIF_BENCHMARK_GOLDEN_DIR.



Result model

Each benchmark run writes scratch output to tmp/<track>/ while running and promotes it to results/<track>/ on success. Failed runs leave tmp/<track>/ for diagnosis without touching the last clean result.

results/<track>/
├── comparison.log       # console output
├── comparison.md        # per-document detail
├── summary.md           # key findings
├── summary.json         # machine-readable summary
├── summary.sdif         # SDIF encoding
├── summary.sdif.ai      # compact AI projection
├── dashboard.html       # self-contained HTML dashboard
└── corpus/              # exact format files measured
    └── <document>/
        ├── json_compact.json
        ├── json_pretty.json
        ├── yaml.yaml
        ├── xml.xml
        ├── csv_bundle.csv
        ├── sdif.sdif
        ├── sdif_ai.sdif.ai
        └── toon.toon    # when TOON is enabled


Environment

Common switches (all tracks):

SDIF_BENCHMARK_OUTPUT_DIR=/tmp/sdif-benchmarks   # redirect all output
SDIF_CORE_REPO=../sdif                            # path to core repo
SDIF_BENCHMARK_GOLDEN_DIR=/tmp/golden-fixtures    # use a custom corpus
SDIF_BENCHMARK_TOON=0                             # disable TOON comparison
SDIF_BENCHMARK_VERBOSE=1                          # print optional-tool diagnostics
SDIF_ENV_OVERRIDE=0                               # keep existing env vars; skip .env

Token efficiency additional switches:

SDIF_TIKTOKEN_ENCODING=cl100k_base    # tiktoken encoding (default)
SDIF_BENCHMARK_TOKENX=0               # disable TokenX estimation
SDIF_BENCHMARK_LLAMA=0                # disable Llama tokenizer
SDIF_BENCHMARK_CLAUDE=1               # enable Claude counting; needs ANTHROPIC_API_KEY

Retrieval accuracy:

SDIF_BENCHMARK_RETRIEVAL=1    # opt-in
ANTHROPIC_API_KEY=<key>       # required

All scripts load .env from the repository root when present, unless SDIF_ENV_OVERRIDE=0.



Project structure

sdif-benchmarks/
├── scripts/       # executable benchmark runners (one per track)
├── src/           # reusable helpers shared across tracks
├── results/       # completed benchmark output (committed evidence)
└── tmp/           # in-progress output (gitignored)


Organization contract

  • Executable benchmark runners belong in scripts/.
  • Reusable helpers belong in src/ — code shared by two or more tracks.
  • Each track writes scratch output to tmp/<track>/; completed evidence goes to results/<track>/.
  • Canonical semantic sources belong in the core repo's examples/golden/, unless SDIF_BENCHMARK_GOLDEN_DIR overrides.
  • Optional external tools (TOON, tiktoken) must degrade gracefully.
  • Claims must name the tokenizer and model coverage that produced them.
  • Retrieval accuracy must use deterministic validators, not subjective LLM judging.


Related

  • sdif — Core format, specification, parser and CLI
  • tree-sitter-sdif — Grammar and editor tooling

About

Reproducible benchmarks for SDIF size, token efficiency, latency and format comparison.

Topics

Resources

Stars

Watchers

Forks

Contributors