GitHub - sdif-format/sdif-py: Semantic Data Interchange Format. Compact, canonicalizable structured data for AI and deterministic workflows.

Semantic Data Interchange Format

Compact, semantic and canonicalizable structured data
for AI agents, deterministic workflows and human-auditable records.

What is SDIF? · Quick start · Format at a glance · Token efficiency · Ecosystem · Documentation

Compact

Shape declared once.
No repeated field names.

Semantic

Tables, relations,
metadata and intent.

Canonical

Stable output for hashing,
signing and comparison.

Auditable

Designed to be read,
reviewed and trusted.

What is SDIF?

SDIF — Semantic Data Interchange Format is a compact, canonicalizable and AI-friendly data format for structured information that needs to move cleanly between humans, tools, agents and deterministic workflows.

It is designed for cases where data should be:

small enough to be efficient in AI context windows;
structured enough for machines to parse and validate;
readable enough for humans to review;
deterministic enough for hashing, signing and reproducible workflows;
semantic enough to express tables, relations, metadata and intent.

SDIF also includes an AI projection surface, .sdif.ai, designed for token-dense agent exchange while remaining reversible back into canonical SDIF when the projection contract is respected.

Quick start

pip install -e '.[dev]'

sdif parse    examples/plan.sdif
sdif canon    examples/plan.sdif
sdif canon    examples/plan.sdif --schema examples/schema.sdif
sdif hash     examples/plan.sdif
sdif validate examples/plan.sdif
sdif validate examples/plan.sdif --schema examples/schema.sdif
sdif tokens   examples/plan.sdif
sdif to-json  examples/plan.sdif
sdif from-json document.json
sdif ai       examples/plan.sdif --alias kind=k --alias status=st

sdif tokens reports byte size, tokenizer identity and token count. It uses tiktoken/cl100k_base when available and falls back to a deterministic 4-bytes-per-token estimate.

`sdif validate`

sdif validate <file> [--schema <schema>] [--json] [--quiet]

Validates an SDIF document for syntactic correctness. If --schema is provided, the document is also validated against that schema.

Exit codes:

Code	Meaning
`0`	The document is valid SDIF.
`1`	The document is invalid SDIF.
`2`	The command failed for an operational reason (file not found, policy denial).

Output modes:

Flag	Behaviour
(default)	Human-readable result and diagnostics on stdout.
`--json`	Machine-readable JSON `{"valid": bool, "diagnostics": [...]}` on stdout.
`--quiet`	No stdout; result communicated only through the exit code. Takes precedence over `--json`.

Format at a glance

JSON repeats field names across every record:

[
  { "id": "R1", "status": "done",    "owner": "build",    "evidence": "reports/build.md"  },
  { "id": "R2", "status": "open",    "owner": "qa",       "evidence": "reports/tests.md"  },
  { "id": "R3", "status": "done",    "owner": "security", "evidence": "reports/audit.md"  }
]

SDIF declares the shape once and uses literal tabs between cells. Editors must preserve tabs — this is a deliberate tradeoff for compactness:

@sdif 1.0

kind Plan
id   release.v1
title "Release readiness plan"

items[id,status,owner,evidence]:
  R1 done build reports/build.md
  R2 open qa reports/tests.md
  R3 done security reports/audit.md

rel:
  release.v1  validated_by  R1
  release.v1  blocked_by    R2
  release.v1  governed_by   R3

Semantic relationships are first-class, not embedded strings.

Structured information closer to a document,
while still behaving like a contract.

Token efficiency

The benchmark derives every compared format from the same canonical JSON source in examples/golden/. Results below are from the most recent run across 21 documents and 3 tokenizers.

Format	Consensus avg rank	Median ratio vs JSON Compact
SDIF AI	1.10	56.8%
SDIF	2.60	59.5%
CSV Bundle	2.70	61.2%
YAML	5.35	95.3%
JSON Compact	5.65	100.0%
JSON Pretty	7.00	137.3%
XML	8.00	171.7%

SDIF AI wins 57 of 63 tokenizer/document pairs. SDIF canonical wins 2.

The benchmark repository contains the exact corpus model, generated artifacts and methodology needed to reproduce these numbers.

These results are corpus-dependent. Not every data shape benefits equally from tabular projection. Claude and Llama tokenizers require separate opt-in before claiming results for those models.

For full methodology, corpus model and per-document breakdowns, see sdif-benchmarks.

Ecosystem

This GitHub organization hosts the official SDIF ecosystem: the core format, reference tooling, benchmarks, examples, libraries, and editor extensions.

_{PYTHON CLIENT & CLI}

sdif-py

Specification, parser, canonicalizer, and CLI.
The normative reference implementation.

Explore sdif-py →

_{SPECIFICATION (SSOT)}

sdif-spec

Official format specification, canonicalization rules,
and portable conformance test suite.

View specification →

_BENCHMARKS

sdif-benchmarks

Reproducible benchmark datasets and reports comparing SDIF with JSON, YAML, XML, and CSV.

View benchmarks →

_{RUST IMPLEMENTATION}

sdif-rs

Pure Rust parser implementation with a span-annotated AST designed for editor tooling.

Explore sdif-rs →

_{LANGUAGE SERVER (LSP)}

sdif-lsp

LSP language server binary (tower-lsp) providing real-time diagnostics and IDE features.

View sdif-lsp →

_{EDITOR INTEGRATION}

vscode-sdif

VS Code extension client providing syntax highlighting, diagnostics, and LSP configuration.

Open extension →

_{GRAMMAR FOUNDATION}

tree-sitter-sdif

Tree-sitter grammar foundation for syntax highlighting and incremental parsing.

Open grammar →

_{DOCUMENTATION}

sdif-format.github.io

Official documentation website containing specification guides, tutorials, and examples.

Read docs →

_{ORGANIZATION META}

.github

Organization profile, assets, and shared community configuration files.

View profile →

Repository map

Repository	Purpose
`sdif-py`	Core Python parser, validator, canonicalizer, and CLI
`sdif-spec`	Official format specification and conformance test suite (SSOT)
`sdif-benchmarks`	Benchmark datasets, reports, and comparison tooling
`sdif-rs`	Rust parser crate with span-annotated AST
`sdif-lsp`	LSP language server binary
`tree-sitter-sdif`	Tree-sitter grammar foundation for syntax highlighting
`vscode-sdif`	VS Code extension client for SDIF
`sdif-format.github.io`	Public documentation website (Docusaurus)
`.github`	Organization profile, assets, and shared GitHub community files

What SDIF is not

SDIF does not try to replace JSON, YAML, CSV, Markdown, XML, Parquet or Protocol Buffers. Those formats are useful and battle-tested.

JSON

Universal and reliable, but noisy when repeated records dominate.

YAML

Readable, but too permissive for deterministic workflows.

CSV

Compact, but loses structure, relations and meaning quickly.

Markdown

Great for humans, not enough when data must be parsed and verified.

SDIF focuses on a narrower problem:

compact, semantic, canonicalizable structured data
that can move cleanly between humans, machines and AI systems.

Documentation

The official specifications and conformance suite are hosted in the sibling sdif-spec repository:

Document	Description
spec.md	Full v1.0.0 specification
canonicalization.md	Canonicalization contract
comparison.md	Format comparison
semantic-quality.md	Semantic quality methodology
ai-speed-profile.md	AI speed profile contract
conformance/	Shared conformance fixtures
examples/	Annotated examples (local in this repo)

Limitations

SDIF 1.0 is the stable core contract. Current benchmark results are promising, but should be read with these boundaries:

results are corpus-dependent;
not every data shape benefits equally from tabular projection;
editors and tools must preserve literal tabs in table rows;
.sdif.ai is an agent projection surface, not the canonical signing surface;
Claude and Llama3 token counting must be enabled separately before claiming results for those tokenizers.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
.github		.github
docs		docs
examples		examples
scripts		scripts
src/sdif		src/sdif
tests		tests
tools		tools
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is SDIF?

Quick start

`sdif validate`

Format at a glance

Token efficiency

Ecosystem

sdif-py

sdif-spec

sdif-benchmarks

sdif-rs

sdif-lsp

vscode-sdif

tree-sitter-sdif

sdif-format.github.io

.github

What SDIF is not

Documentation

Limitations

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

What is SDIF?

Quick start

sdif validate

Format at a glance

Token efficiency

Ecosystem

sdif-py

sdif-spec

sdif-benchmarks

sdif-rs

sdif-lsp

vscode-sdif

tree-sitter-sdif

sdif-format.github.io

.github

What SDIF is not

Documentation

Limitations

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`sdif validate`

Packages