Semantic Data Interchange Format
Compact, semantic and canonicalizable structured data
for AI agents, deterministic workflows and human-auditable records.
What is SDIF? · Quick start · Format at a glance · Token efficiency · Ecosystem · Documentation
|
Compact
Shape declared once. No repeated field names. |
Semantic
Tables, relations, metadata and intent. |
Canonical
Stable output for hashing, signing and comparison. |
Auditable
Designed to be read, reviewed and trusted. |
SDIF — Semantic Data Interchange Format is a compact, canonicalizable and AI-friendly data format for structured information that needs to move cleanly between humans, tools, agents and deterministic workflows.
It is designed for cases where data should be:
- small enough to be efficient in AI context windows;
- structured enough for machines to parse and validate;
- readable enough for humans to review;
- deterministic enough for hashing, signing and reproducible workflows;
- semantic enough to express tables, relations, metadata and intent.
SDIF also includes an AI projection surface, .sdif.ai, designed for token-dense agent exchange while remaining reversible back into canonical SDIF when the projection contract is respected.
pip install -e '.[dev]'sdif parse examples/plan.sdif
sdif canon examples/plan.sdif
sdif canon examples/plan.sdif --schema examples/schema.sdif
sdif hash examples/plan.sdif
sdif validate examples/plan.sdif
sdif validate examples/plan.sdif --schema examples/schema.sdif
sdif tokens examples/plan.sdif
sdif to-json examples/plan.sdif
sdif from-json document.json
sdif ai examples/plan.sdif --alias kind=k --alias status=stsdif tokens reports byte size, tokenizer identity and token count. It uses tiktoken/cl100k_base when available and falls back to a deterministic 4-bytes-per-token estimate.
sdif validate <file> [--schema <schema>] [--json] [--quiet]Validates an SDIF document for syntactic correctness. If --schema is provided,
the document is also validated against that schema.
Exit codes:
| Code | Meaning |
|---|---|
0 |
The document is valid SDIF. |
1 |
The document is invalid SDIF. |
2 |
The command failed for an operational reason (file not found, policy denial). |
Output modes:
| Flag | Behaviour |
|---|---|
| (default) | Human-readable result and diagnostics on stdout. |
--json |
Machine-readable JSON {"valid": bool, "diagnostics": [...]} on stdout. |
--quiet |
No stdout; result communicated only through the exit code. Takes precedence over --json. |
JSON repeats field names across every record:
[
{ "id": "R1", "status": "done", "owner": "build", "evidence": "reports/build.md" },
{ "id": "R2", "status": "open", "owner": "qa", "evidence": "reports/tests.md" },
{ "id": "R3", "status": "done", "owner": "security", "evidence": "reports/audit.md" }
]SDIF declares the shape once and uses literal tabs between cells. Editors must preserve tabs — this is a deliberate tradeoff for compactness:
@sdif 1.0
kind Plan
id release.v1
title "Release readiness plan"
items[id,status,owner,evidence]:
R1 done build reports/build.md
R2 open qa reports/tests.md
R3 done security reports/audit.md
rel:
release.v1 validated_by R1
release.v1 blocked_by R2
release.v1 governed_by R3
Semantic relationships are first-class, not embedded strings.
Structured information closer to a document,
while still behaving like a contract.
The benchmark derives every compared format from the same canonical JSON source in examples/golden/. Results below are from the most recent run across 21 documents and 3 tokenizers.
| Format | Consensus avg rank | Median ratio vs JSON Compact |
|---|---|---|
| SDIF AI | 1.10 | 56.8% |
| SDIF | 2.60 | 59.5% |
| CSV Bundle | 2.70 | 61.2% |
| YAML | 5.35 | 95.3% |
| JSON Compact | 5.65 | 100.0% |
| JSON Pretty | 7.00 | 137.3% |
| XML | 8.00 | 171.7% |
SDIF AI wins 57 of 63 tokenizer/document pairs. SDIF canonical wins 2.
The benchmark repository contains the exact corpus model, generated artifacts and methodology needed to reproduce these numbers.
These results are corpus-dependent. Not every data shape benefits equally from tabular projection. Claude and Llama tokenizers require separate opt-in before claiming results for those models.
For full methodology, corpus model and per-document breakdowns, see sdif-benchmarks.
This GitHub organization hosts the official SDIF ecosystem: the core format, reference tooling, benchmarks, examples, libraries, and editor extensions.
|
PYTHON CLIENT & CLI
Specification, parser, canonicalizer, and CLI. |
SPECIFICATION (SSOT)
Official format specification, canonicalization rules, |
BENCHMARKS Reproducible benchmark datasets and reports comparing SDIF with JSON, YAML, XML, and CSV. |
|
RUST IMPLEMENTATION Pure Rust parser implementation with a span-annotated AST designed for editor tooling. |
LANGUAGE SERVER (LSP) LSP language server binary (tower-lsp) providing real-time diagnostics and IDE features. |
EDITOR INTEGRATION VS Code extension client providing syntax highlighting, diagnostics, and LSP configuration. |
|
GRAMMAR FOUNDATION Tree-sitter grammar foundation for syntax highlighting and incremental parsing. |
DOCUMENTATION Official documentation website containing specification guides, tutorials, and examples. |
ORGANIZATION META Organization profile, assets, and shared community configuration files. |
Repository map
| Repository | Purpose |
|---|---|
sdif-py |
Core Python parser, validator, canonicalizer, and CLI |
sdif-spec |
Official format specification and conformance test suite (SSOT) |
sdif-benchmarks |
Benchmark datasets, reports, and comparison tooling |
sdif-rs |
Rust parser crate with span-annotated AST |
sdif-lsp |
LSP language server binary |
tree-sitter-sdif |
Tree-sitter grammar foundation for syntax highlighting |
vscode-sdif |
VS Code extension client for SDIF |
sdif-format.github.io |
Public documentation website (Docusaurus) |
.github |
Organization profile, assets, and shared GitHub community files |
SDIF does not try to replace JSON, YAML, CSV, Markdown, XML, Parquet or Protocol Buffers. Those formats are useful and battle-tested.
|
JSON
Universal and reliable, but noisy when repeated records dominate. |
YAML
Readable, but too permissive for deterministic workflows. |
CSV
Compact, but loses structure, relations and meaning quickly. |
Markdown
Great for humans, not enough when data must be parsed and verified. |
SDIF focuses on a narrower problem:
compact, semantic, canonicalizable structured data
that can move cleanly between humans, machines and AI systems.
The official specifications and conformance suite are hosted in the sibling sdif-spec repository:
| Document | Description |
|---|---|
| spec.md | Full v1.0.0 specification |
| canonicalization.md | Canonicalization contract |
| comparison.md | Format comparison |
| semantic-quality.md | Semantic quality methodology |
| ai-speed-profile.md | AI speed profile contract |
| conformance/ | Shared conformance fixtures |
| examples/ | Annotated examples (local in this repo) |
SDIF 1.0 is the stable core contract. Current benchmark results are promising, but should be read with these boundaries:
- results are corpus-dependent;
- not every data shape benefits equally from tabular projection;
- editors and tools must preserve literal tabs in table rows;
.sdif.aiis an agent projection surface, not the canonical signing surface;- Claude and Llama3 token counting must be enabled separately before claiming results for those tokenizers.
MIT. See LICENSE.
