Semantic Data Interchange Format
Compact, semantic and canonicalizable structured data
for AI agents, deterministic workflows and human-auditable records.
What is SDIF? · Quick start · Format at a glance · Token efficiency · Ecosystem · Documentation
|
Compact
Shape declared once. No repeated field names. |
Semantic
Tables, relations, metadata and intent. |
Canonical
Stable output for hashing, signing and comparison. |
Auditable
Designed to be read, reviewed and trusted. |
SDIF — Semantic Data Interchange Format is a compact, canonicalizable and AI-friendly data format for structured information that needs to move cleanly between humans, tools, agents and deterministic workflows.
It is designed for cases where data should be:
- small enough to be efficient in AI context windows;
- structured enough for machines to parse and validate;
- readable enough for humans to review;
- deterministic enough for hashing, signing and reproducible workflows;
- semantic enough to express tables, relations, metadata and intent.
SDIF also includes an AI projection surface, .sdif.ai, designed for token-dense agent exchange while remaining reversible back into canonical SDIF when the projection contract is respected.
pip install -e '.[dev]'sdif parse examples/plan.sdif
sdif canon examples/plan.sdif
sdif canon examples/plan.sdif --schema examples/schema.sdif
sdif hash examples/plan.sdif
sdif validate examples/plan.sdif --schema examples/schema.sdif
sdif tokens examples/plan.sdif
sdif to-json examples/plan.sdif
sdif from-json document.json
sdif ai examples/plan.sdif --alias kind=k --alias status=stsdif tokens reports byte size, tokenizer identity and token count. It uses tiktoken/cl100k_base when available and falls back to a deterministic 4-bytes-per-token estimate.
JSON repeats field names across every record:
[
{ "id": "R1", "status": "done", "owner": "build", "evidence": "reports/build.md" },
{ "id": "R2", "status": "open", "owner": "qa", "evidence": "reports/tests.md" },
{ "id": "R3", "status": "done", "owner": "security", "evidence": "reports/audit.md" }
]SDIF declares the shape once and uses literal tabs between cells. Editors must preserve tabs — this is a deliberate tradeoff for compactness:
@sdif 1.0
kind Plan
id release.v1
title "Release readiness plan"
items[id,status,owner,evidence]:
R1 done build reports/build.md
R2 open qa reports/tests.md
R3 done security reports/audit.md
rel:
release.v1 validated_by R1
release.v1 blocked_by R2
release.v1 governed_by R3
Semantic relationships are first-class, not embedded strings.
Structured information closer to a document,
while still behaving like a contract.
The benchmark derives every compared format from the same canonical JSON source in examples/golden/. Results below are from the most recent run across 21 documents and 3 tokenizers.
| Format | Consensus avg rank | Median ratio vs JSON Compact |
|---|---|---|
| SDIF AI | 1.10 | 56.8% |
| SDIF | 2.60 | 59.5% |
| CSV Bundle | 2.70 | 61.2% |
| YAML | 5.35 | 95.3% |
| JSON Compact | 5.65 | 100.0% |
| JSON Pretty | 7.00 | 137.3% |
| XML | 8.00 | 171.7% |
SDIF AI wins 57 of 63 tokenizer/document pairs. SDIF canonical wins 2.
The benchmark repository contains the exact corpus model, generated artifacts and methodology needed to reproduce these numbers.
These results are corpus-dependent. Not every data shape benefits equally from tabular projection. Claude and Llama tokenizers require separate opt-in before claiming results for those models.
For full methodology, corpus model and per-document breakdowns, see sdif-benchmarks.
|
CORE FORMAT
Specification, parser, canonicalizer and CLI. This repository. |
MEASUREMENT Reproducible benchmark datasets and reports comparing SDIF with existing formats across token efficiency, context packing, round-trip fidelity and retrieval accuracy. |
SYNTAX TOOLING
Tree-sitter grammar foundation for syntax highlighting and editor integrations.
Registers both |
SDIF does not try to replace JSON, YAML, CSV, Markdown, XML, Parquet or Protocol Buffers. Those formats are useful and battle-tested.
|
JSON
Universal and reliable, but noisy when repeated records dominate. |
YAML
Readable, but too permissive for deterministic workflows. |
CSV
Compact, but loses structure, relations and meaning quickly. |
Markdown
Great for humans, not enough when data must be parsed and verified. |
SDIF focuses on a narrower problem:
compact, semantic, canonicalizable structured data
that can move cleanly between humans, machines and AI systems.
| Document | Description |
|---|---|
docs/spec.md |
Full v1.0.0 specification |
docs/canonicalization.md |
Canonicalization contract |
docs/comparison.md |
Format comparison |
docs/semantic-quality.md |
Semantic quality methodology |
docs/ai-speed-profile.md |
AI speed profile contract |
examples/ |
Annotated examples |
conformance/ |
Shared conformance fixtures |
SDIF 1.0 is the stable core contract. Current benchmark results are promising, but should be read with these boundaries:
- results are corpus-dependent;
- not every data shape benefits equally from tabular projection;
- editors and tools must preserve literal tabs in table rows;
.sdif.aiis an agent projection surface, not the canonical signing surface;- Claude and Llama3 token counting must be enabled separately before claiming results for those tokenizers.
MIT. See LICENSE.
