GitHub - sdif-format/sdif-benchmarks: Reproducible benchmarks for SDIF size, token efficiency, latency and format comparison.

SDIF Benchmarks

Evidence-first benchmarks measuring SDIF against JSON, YAML, XML, CSV Bundle
and other formats from the perspective of AI and LLM developers.

Tracks · Quick start · Latest results · Corpus model · Result model · Environment

Every compared representation is derived from the same canonical JSON source. Claims must name the tokenizer and document coverage that produced them. Optional external tools degrade gracefully.

Benchmark tracks

Token efficiency Byte and token reduction across shared semantic fixtures. Ranks all formats against JSON Compact as the stable baseline.	Context packing How many document copies fit inside fixed token budgets (4K, 8K, 32K, 128K). Fit rate and median copies per budget.	Round-trip fidelity JSON→format→JSON preservation. Scores value, type and structure fidelity. N/A for SDIF AI and TOON.
Delta compactness Token overhead of re-sending a mutated document. Applies a deterministic mutation to the first 10% of leaf values.	Retrieval accuracy LLM answer quality by format. Deterministic validators — no LLM judge. Opt-in: requires `ANTHROPIC_API_KEY`.	Semantic quality Guards that SDIF preserves relations, rules, schema validation, canonicalization and reversible AI projection boundaries.

Quick start

This repository expects access to the core SDIF repository. By default it looks for it at ../sdif; override this with SDIF_CORE_REPO.

# Token reduction across formats
make benchmark-token

# Context-window fit rate by budget
make benchmark-packing

# JSON→format→JSON round-trip fidelity
make benchmark-roundtrip

# Mutation sensitivity (re-send overhead)
make benchmark-delta

# LLM retrieval accuracy by format — opt-in
SDIF_BENCHMARK_RETRIEVAL=1 ANTHROPIC_API_KEY=<key> make benchmark-retrieval

# Semantic quality checks
make benchmark-quality

Latest results

Results from the most recent token efficiency run across 21 documents and 3 tokenizers (Estimate, TokenX, tiktoken).

Format	Consensus avg rank	Median ratio vs JSON Compact	Wins (63 pairs)
SDIF AI	1.10	56.8%	57
SDIF	2.60	59.5%	2
CSV Bundle	2.70	61.2%	4
TOON	3.60	63.2%	0
YAML	5.35	95.3%	0
JSON Compact	5.65	100.0%	0
JSON Pretty	7.00	137.3%	0
XML	8.00	171.7%	0

Tokenizer-specific winners:

Tokenizer	Winning format	Wins
Estimate	SDIF AI	19/21
TokenX	SDIF AI	20/21
tiktoken	SDIF AI	18/21

These results are corpus-dependent. Results for Claude and Llama3 tokenizers require separate opt-in. Full per-document breakdowns live in results/token_efficiency/.

Corpus model

The canonical semantic corpus lives in the core repo's examples/golden/ directory, not duplicated here. This avoids drift between parser fixtures and benchmark fixtures.

Each fixture contains:

../sdif/examples/golden/<fixture>/
├── equivalent.json     # canonical semantic source (benchmark input)
├── source.sdif         # hand-authored or generated SDIF source
├── canonical.sdif      # canonical SDIF form
└── canonical.sha256    # canonical hash evidence

The benchmark path defaults to ../sdif/examples/golden/ and can be overridden with SDIF_BENCHMARK_GOLDEN_DIR.

Result model

Each benchmark run writes scratch output to tmp/<track>/ while running and promotes it to results/<track>/ on success. Failed runs leave tmp/<track>/ for diagnosis without touching the last clean result.

results/<track>/
├── comparison.log       # console output
├── comparison.md        # per-document detail
├── summary.md           # key findings
├── summary.json         # machine-readable summary
├── summary.sdif         # SDIF encoding
├── summary.sdif.ai      # compact AI projection
├── dashboard.html       # self-contained HTML dashboard
└── corpus/              # exact format files measured
    └── <document>/
        ├── json_compact.json
        ├── json_pretty.json
        ├── yaml.yaml
        ├── xml.xml
        ├── csv_bundle.csv
        ├── sdif.sdif
        ├── sdif_ai.sdif.ai
        └── toon.toon    # when TOON is enabled

Environment

Common switches (all tracks):

SDIF_BENCHMARK_OUTPUT_DIR=/tmp/sdif-benchmarks   # redirect all output
SDIF_CORE_REPO=../sdif                            # path to core repo
SDIF_BENCHMARK_GOLDEN_DIR=/tmp/golden-fixtures    # use a custom corpus
SDIF_BENCHMARK_TOON=0                             # disable TOON comparison
SDIF_BENCHMARK_VERBOSE=1                          # print optional-tool diagnostics
SDIF_ENV_OVERRIDE=0                               # keep existing env vars; skip .env

Token efficiency additional switches:

SDIF_TIKTOKEN_ENCODING=cl100k_base    # tiktoken encoding (default)
SDIF_BENCHMARK_TOKENX=0               # disable TokenX estimation
SDIF_BENCHMARK_LLAMA=0                # disable Llama tokenizer
SDIF_BENCHMARK_CLAUDE=1               # enable Claude counting; needs ANTHROPIC_API_KEY

Retrieval accuracy:

SDIF_BENCHMARK_RETRIEVAL=1    # opt-in
ANTHROPIC_API_KEY=<key>       # required

All scripts load .env from the repository root when present, unless SDIF_ENV_OVERRIDE=0.

Project structure

sdif-benchmarks/
├── scripts/       # executable benchmark runners (one per track)
├── src/           # reusable helpers shared across tracks
├── results/       # completed benchmark output (committed evidence)
└── tmp/           # in-progress output (gitignored)

Organization contract

Executable benchmark runners belong in scripts/.
Reusable helpers belong in src/ — code shared by two or more tracks.
Each track writes scratch output to tmp/<track>/; completed evidence goes to results/<track>/.
Canonical semantic sources belong in the core repo's examples/golden/, unless SDIF_BENCHMARK_GOLDEN_DIR overrides.
Optional external tools (TOON, tiktoken) must degrade gracefully.
Claims must name the tokenizer and model coverage that produced them.
Retrieval accuracy must use deterministic validators, not subjective LLM judging.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
docs/plans		docs/plans
results		results
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Makefile		Makefile
README.md		README.md
manifest.sdif		manifest.sdif
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmark tracks

Quick start

Latest results

Corpus model

Result model

Environment

Project structure

Organization contract

Related

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Benchmark tracks

Quick start

Latest results

Corpus model

Result model

Environment

Project structure

Organization contract

Related

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages