SDIF Benchmarks
Evidence-first benchmarks measuring SDIF against JSON, YAML, XML, CSV Bundle
and other formats from the perspective of AI and LLM developers.
Tracks · Quick start · Latest results · Corpus model · Result model · Environment
Every compared representation is derived from the same canonical JSON source. Claims must name the tokenizer and document coverage that produced them. Optional external tools degrade gracefully.
|
Token efficiency
Byte and token reduction across shared semantic fixtures. Ranks all formats against JSON Compact as the stable baseline. |
Context packing
How many document copies fit inside fixed token budgets (4K, 8K, 32K, 128K). Fit rate and median copies per budget. |
Round-trip fidelity
JSON→format→JSON preservation. Scores value, type and structure fidelity. N/A for SDIF AI and TOON. |
|
Delta compactness
Token overhead of re-sending a mutated document. Applies a deterministic mutation to the first 10% of leaf values. |
Retrieval accuracy
LLM answer quality by format. Deterministic validators — no LLM judge. Opt-in: requires ANTHROPIC_API_KEY.
|
Semantic quality
Guards that SDIF preserves relations, rules, schema validation, canonicalization and reversible AI projection boundaries. |
This repository expects access to the core SDIF repository. By default it looks for it at ../sdif; override this with SDIF_CORE_REPO.
# Token reduction across formats
make benchmark-token
# Context-window fit rate by budget
make benchmark-packing
# JSON→format→JSON round-trip fidelity
make benchmark-roundtrip
# Mutation sensitivity (re-send overhead)
make benchmark-delta
# LLM retrieval accuracy by format — opt-in
SDIF_BENCHMARK_RETRIEVAL=1 ANTHROPIC_API_KEY=<key> make benchmark-retrieval
# Semantic quality checks
make benchmark-qualityResults from the most recent token efficiency run across 21 documents and 3 tokenizers (Estimate, TokenX, tiktoken).
| Format | Consensus avg rank | Median ratio vs JSON Compact | Wins (63 pairs) |
|---|---|---|---|
| SDIF AI | 1.10 | 56.8% | 57 |
| SDIF | 2.60 | 59.5% | 2 |
| CSV Bundle | 2.70 | 61.2% | 4 |
| TOON | 3.60 | 63.2% | 0 |
| YAML | 5.35 | 95.3% | 0 |
| JSON Compact | 5.65 | 100.0% | 0 |
| JSON Pretty | 7.00 | 137.3% | 0 |
| XML | 8.00 | 171.7% | 0 |
Tokenizer-specific winners:
| Tokenizer | Winning format | Wins |
|---|---|---|
| Estimate | SDIF AI | 19/21 |
| TokenX | SDIF AI | 20/21 |
| tiktoken | SDIF AI | 18/21 |
These results are corpus-dependent. Results for Claude and Llama3 tokenizers require separate opt-in. Full per-document breakdowns live in results/token_efficiency/.
The canonical semantic corpus lives in the core repo's examples/golden/ directory, not duplicated here. This avoids drift between parser fixtures and benchmark fixtures.
Each fixture contains:
../sdif/examples/golden/<fixture>/
├── equivalent.json # canonical semantic source (benchmark input)
├── source.sdif # hand-authored or generated SDIF source
├── canonical.sdif # canonical SDIF form
└── canonical.sha256 # canonical hash evidence
The benchmark path defaults to ../sdif/examples/golden/ and can be overridden with SDIF_BENCHMARK_GOLDEN_DIR.
Each benchmark run writes scratch output to tmp/<track>/ while running and promotes it to results/<track>/ on success. Failed runs leave tmp/<track>/ for diagnosis without touching the last clean result.
results/<track>/
├── comparison.log # console output
├── comparison.md # per-document detail
├── summary.md # key findings
├── summary.json # machine-readable summary
├── summary.sdif # SDIF encoding
├── summary.sdif.ai # compact AI projection
├── dashboard.html # self-contained HTML dashboard
└── corpus/ # exact format files measured
└── <document>/
├── json_compact.json
├── json_pretty.json
├── yaml.yaml
├── xml.xml
├── csv_bundle.csv
├── sdif.sdif
├── sdif_ai.sdif.ai
└── toon.toon # when TOON is enabled
Common switches (all tracks):
SDIF_BENCHMARK_OUTPUT_DIR=/tmp/sdif-benchmarks # redirect all output
SDIF_CORE_REPO=../sdif # path to core repo
SDIF_BENCHMARK_GOLDEN_DIR=/tmp/golden-fixtures # use a custom corpus
SDIF_BENCHMARK_TOON=0 # disable TOON comparison
SDIF_BENCHMARK_VERBOSE=1 # print optional-tool diagnostics
SDIF_ENV_OVERRIDE=0 # keep existing env vars; skip .envToken efficiency additional switches:
SDIF_TIKTOKEN_ENCODING=cl100k_base # tiktoken encoding (default)
SDIF_BENCHMARK_TOKENX=0 # disable TokenX estimation
SDIF_BENCHMARK_LLAMA=0 # disable Llama tokenizer
SDIF_BENCHMARK_CLAUDE=1 # enable Claude counting; needs ANTHROPIC_API_KEYRetrieval accuracy:
SDIF_BENCHMARK_RETRIEVAL=1 # opt-in
ANTHROPIC_API_KEY=<key> # requiredAll scripts load .env from the repository root when present, unless SDIF_ENV_OVERRIDE=0.
sdif-benchmarks/
├── scripts/ # executable benchmark runners (one per track)
├── src/ # reusable helpers shared across tracks
├── results/ # completed benchmark output (committed evidence)
└── tmp/ # in-progress output (gitignored)
- Executable benchmark runners belong in
scripts/. - Reusable helpers belong in
src/— code shared by two or more tracks. - Each track writes scratch output to
tmp/<track>/; completed evidence goes toresults/<track>/. - Canonical semantic sources belong in the core repo's
examples/golden/, unlessSDIF_BENCHMARK_GOLDEN_DIRoverrides. - Optional external tools (TOON, tiktoken) must degrade gracefully.
- Claims must name the tokenizer and model coverage that produced them.
- Retrieval accuracy must use deterministic validators, not subjective LLM judging.
- sdif — Core format, specification, parser and CLI
- tree-sitter-sdif — Grammar and editor tooling
