Rust native implementation of OriginBlame — record- and token-level data provenance for AI training datasets.
Paper + benchmark artifact:
Repository:
The Python package's pure-Python fallback scans lines in ~580ms. The Rust native implementation does it in <0.2ms using mmap, rayon parallelism, and a binary index — up to 2,900× faster on show/purge at scale. When the Rust binary is available (built via cargo build --release or on PATH), the Python package automatically delegates to it.
Requires Rust 1.85+ (edition 2024).
cargo build --release
./target/release/ob versionA Python package with an optional Rust backend is available in python/:
cd python
pip install . # Core package (pure Python + Rust backend if binary found)
pip install -e ".[dev]" # Development with pytest + ruffWhen the Rust ob binary is available (built via cargo build --release or on PATH), the Python package automatically delegates performance-critical operations to it. Pure-Python fallback is used when the binary is not found.
Delegation architecture: The Python package uses a 3-tier fallback chain:
_ob_native(PyO3 cdylib) — fastest path, compiled fromsrc/python.rswith 38 PyO3 bindingsobbinary (subprocess) — fallback if PyO3 module not built- Pure Python — last resort, no native dependencies
All CLI commands now delegate to the PyO3 bindings directly, avoiding subprocess overhead.
Build with PyO3 support: maturin develop --release (requires pip install maturin).
from ob import init, author_add, register_section, source, track
init()
author_add("Wikimedia", "wikimedia@example.com")
register_section("raw/wiki.xml", ["wikimedia@example.com"], "CC-BY-SA-4.0", "2024")
# Use source.append with optional section filtering
source.append("raw/wiki.xml") # all sections for this file
source.append("raw/wiki.xml", section="abc123...") # only one section
# Use track with source= parameter
track(data, "data.jsonl", source="raw/wiki.xml") # explicit file path
track(data, "data.jsonl", source=["abc123..."]) # explicit section hashes
track(data, "data.jsonl") # use source stack (backward compatible)Optional utilities (parsers, embedding reconciliation):
pip install ./python/packages/ob-util
pip install ./python/packages/ob-util[reconcile] # with torch + sentence-transformersob init [PATH] Initialize .ob/ tracking directory
ob author.add NAME EMAIL Register an author
ob register.add --path P --authors A Register a section
--license LICENSE --year YEAR
ob blame FILE LINE Show which section(s) a line belongs to
ob show [--author NAME] [--email EMAIL] Show provenance metadata
[--section HASH] [--license LICENSE]
[--revoked] [--index] [--tokenizer T]
ob revoke --author NAME Revoke at author level (all lines/sections)
ob revoke --section HASH Revoke at section level
ob revoke --line-hash HASH --file FILE Revoke at line level
[--email EMAIL] [--tokenizer T]
ob purge FILE [--index] [--author NAME] [--dry-run] Physically delete revoked data (requires prior revoke)
ob status [--tokenizer TOKENIZER] Show summary statistics
ob clean [--split] Merge PID files, archive revoked records
ob merge --absorb PATH Absorb provenance from another repository (auto-rebuilds index)
ob generate-set --tokenizer T -o FILE Generate binary forget set (bitmask)
ob log [--op OP] [--since TIMESTAMP] Query operation audit log (Python path)
ob index build Build binary index (OBIDXF02)
ob reconcile FILE [-m MODEL] Reconcile provenance after data edits
[-t THRESHOLD] [-e EMBEDDING_API]
[--compute-all-embeddings]
ob export-copyright [-o FILE] Export DEP-5 copyright file
[--data-file FILE]
ob parse --parser FORMAT --file FILE Parse structured data (MediaWiki XML)
[--license LICENSE] [--split]
ob version Show version
Add --tokenizer NAME to show, revoke, status to query token-index entries:
ob show --author InternetArchiveBot --tokenizer gpt2
# → 144480919 tokens across 43332 documents by InternetArchiveBot (tokenizer: gpt2)
ob revoke --author InternetArchiveBot --tokenizer gpt2
ob generate-set --tokenizer gpt2 -o forget.binThe token-index operates independently of the document-index. No data.jsonl required — provenance is recorded during the tokenization/pack stage.
.ob/
authors/ who: name, email, id = sha256(name+email)
sections/ what: file path + authors + license, sharded by sha256
document-index/ which line came from where: (line_hash, file, sources)
token-index.gpt2/ per-document token counts with sources (per tokenizer)
index/ binary index: OBIDXF02 with type-tagged IndexRef
log operation audit trail
Three layers: authors ← sections ← document-index. Extended with an independent token-index layer for streaming pipelines.
All operations are logged to .ob/log (JSONL audit trail). Use ob log (Python path) to query; Rust binary delegates this to the Python package.
Type-tagged references in the index:
| Tag | Type | Content |
|---|---|---|
0x00 |
DocumentIndexShard | bucket prefix byte |
0x01 |
TokenIndexRange | tokenizer, file number, byte offset, length |
A single section can reference both document-index shards and token-index ranges across multiple tokenizers.
{"token_count": 5, "sources": ["a3f7b2..."], "tokenizer": "gpt2", "revoked": false}token_count: tokens produced from one document (or one chunk if > max_context_tokens)sources: section hashes linking to the author chaintokenizer: identifier string (e.g., "gpt2", "llama3")revoked: boolean, marks entry as revoked
ob generate-set produces a binary bitmask where each bit corresponds to one token-index entry (1 = revoked, 0 = active). Directly usable by unlearning algorithms (NPO, exact retraining, etc.).
| File | Purpose |
|---|---|
show.rs |
Show provenance (line + token level, revoked, --section, --license, --revoked, --index) |
token_index.rs |
Token-index storage (PID files, merge, query) |
token_bin_index.rs |
OBIDXTI01 token binary index |
main.rs |
CLI entry point, command dispatch |
merge.rs |
Merge/absorb from another repository with auto index rebuild |
binary_index.rs |
OBIDXF02 binary index with type-tagged IndexRef |
python.rs |
PyO3 bindings (_ob_native module) |
purge.rs |
Physically delete revoked data |
revoke.rs |
Revoke at 3 levels (author, section, line) |
index.rs |
Index building, scanning token-index files |
clean.rs |
Merge PID files, archive revoked records |
track.rs |
Track data lines with provenance |
generate_set.rs |
Binary forget set generation |
export.rs |
DEP-5 copyright export |
reconcile.rs |
Two-phase reconcile (hash + embedding) |
authors.rs |
Author CRUD and query operations |
mmap_lines.rs |
mmap-based file reading and line iteration |
storage.rs |
JSONL read/append/mmap utilities |
indexer.rs |
Document-index CRUD (lookup, index) |
lib.rs |
Library root (test utilities, shared config) |
register.rs |
Section registration logic |
embeddings.rs |
Embedding storage/retrieval |
hash.rs |
SHA-256 content addressing |
oplog.rs |
Operation audit log |
blame.rs |
Line-level blame lookup |
See the OriginBlame README for full benchmark results including pipeline throughput, scalability, reconcile recovery, and machine unlearning evaluation.
rust-originblame/
src/ Rust native implementation (25 files)
show.rs Show provenance
token_index.rs Token-index storage
token_bin_index.rs OBIDXTI01 token binary index
main.rs CLI entry point
merge.rs Merge/absorb
binary_index.rs OBIDXF02 binary index
python.rs PyO3 bindings
purge.rs Delete revoked data
revoke.rs Revoke at 3 levels
index.rs Binary index builder
clean.rs Merge PIDs, archive
track.rs Track data lines
generate_set.rs Binary forget set
export.rs DEP-5 copyright export
reconcile.rs Hash + embedding reconcile
authors.rs Author CRUD
mmap_lines.rs mmap-based line iteration
storage.rs JSONL read/append
indexer.rs Document-index CRUD
lib.rs Library root
register.rs Section registration
embeddings.rs Embedding storage
hash.rs SHA-256 hashing
oplog.rs Operation log
blame.rs Blame lookup
python/ Python package with optional Rust backend
src/ob/ Core Python package (CLI, API, all modules)
api.py High-level API (3-tier delegation)
_track.py Track data with provenance
authors.py Author management
cli/ CLI commands (typer-based)
embeddings.py Embedding storage
exceptions.py Error types
indexer.py Document-index management
oplog.py Operation log
register.py Section registration
rust.py Rust binary delegation
source.py Source context manager
storage.py JSONL utilities
track.py Track implementation
util.py Shared utilities
packages/
ob-util/ Optional utilities (parsers, embeddings, copyright)
Cargo.toml Rust package manifest (edition 2024)
cargo test # 71 Rust tests
# Python tests (ob-util package only; core ob delegates to Rust)
cd python && pytest packages/ob-util/tests/ # 76 testsThe core ob Python package has no standalone tests — all performance-critical operations delegate to the Rust binary which has its own test suite. The ob-util package (parsers, embeddings, copyright export) has its own test suite.
MIT