rust-originblame

Rust native implementation of OriginBlame — record- and token-level data provenance for AI training datasets.

Paper + benchmark artifact: Repository:

Why Rust

The Python package's pure-Python fallback scans lines in ~580ms. The Rust native implementation does it in <0.2ms using mmap, rayon parallelism, and a binary index — up to 2,900× faster on show/purge at scale. When the Rust binary is available (built via cargo build --release or on PATH), the Python package automatically delegates to it.

Build

Requires Rust 1.85+ (edition 2024).

cargo build --release
./target/release/ob version

Python Package

A Python package with an optional Rust backend is available in python/:

cd python
pip install .                        # Core package (pure Python + Rust backend if binary found)
pip install -e ".[dev]"              # Development with pytest + ruff

When the Rust ob binary is available (built via cargo build --release or on PATH), the Python package automatically delegates performance-critical operations to it. Pure-Python fallback is used when the binary is not found.

Delegation architecture: The Python package uses a 3-tier fallback chain:

_ob_native (PyO3 cdylib) — fastest path, compiled from src/python.rs with 38 PyO3 bindings
ob binary (subprocess) — fallback if PyO3 module not built
Pure Python — last resort, no native dependencies

All CLI commands now delegate to the PyO3 bindings directly, avoiding subprocess overhead.

Build with PyO3 support: maturin develop --release (requires pip install maturin).

from ob import init, author_add, register_section, source, track

init()
author_add("Wikimedia", "wikimedia@example.com")
register_section("raw/wiki.xml", ["wikimedia@example.com"], "CC-BY-SA-4.0", "2024")

# Use source.append with optional section filtering
source.append("raw/wiki.xml")  # all sections for this file
source.append("raw/wiki.xml", section="abc123...")  # only one section

# Use track with source= parameter
track(data, "data.jsonl", source="raw/wiki.xml")  # explicit file path
track(data, "data.jsonl", source=["abc123..."])  # explicit section hashes
track(data, "data.jsonl")  # use source stack (backward compatible)

Optional utilities (parsers, embedding reconciliation):

pip install ./python/packages/ob-util
pip install ./python/packages/ob-util[reconcile]   # with torch + sentence-transformers

CLI Reference

ob init [PATH]                          Initialize .ob/ tracking directory
ob author.add NAME EMAIL                Register an author
ob register.add --path P --authors A     Register a section
  --license LICENSE --year YEAR

ob blame FILE LINE                      Show which section(s) a line belongs to
ob show [--author NAME] [--email EMAIL]  Show provenance metadata
  [--section HASH] [--license LICENSE]
  [--revoked] [--index] [--tokenizer T]
ob revoke --author NAME                  Revoke at author level (all lines/sections)
ob revoke --section HASH                 Revoke at section level
ob revoke --line-hash HASH --file FILE   Revoke at line level
  [--email EMAIL] [--tokenizer T]
ob purge FILE [--index] [--author NAME] [--dry-run]  Physically delete revoked data (requires prior revoke)
ob status [--tokenizer TOKENIZER]        Show summary statistics
ob clean [--split]                       Merge PID files, archive revoked records
ob merge --absorb PATH                   Absorb provenance from another repository (auto-rebuilds index)
ob generate-set --tokenizer T -o FILE   Generate binary forget set (bitmask)
ob log [--op OP] [--since TIMESTAMP]   Query operation audit log (Python path)

ob index build                          Build binary index (OBIDXF02)

ob reconcile FILE [-m MODEL]            Reconcile provenance after data edits
  [-t THRESHOLD] [-e EMBEDDING_API]
  [--compute-all-embeddings]
ob export-copyright [-o FILE]           Export DEP-5 copyright file
  [--data-file FILE]
ob parse --parser FORMAT --file FILE    Parse structured data (MediaWiki XML)
  [--license LICENSE] [--split]
ob version                              Show version

Token-Level Provenance

Add --tokenizer NAME to show, revoke, status to query token-index entries:

ob show --author InternetArchiveBot --tokenizer gpt2
# → 144480919 tokens across 43332 documents by InternetArchiveBot (tokenizer: gpt2)

ob revoke --author InternetArchiveBot --tokenizer gpt2
ob generate-set --tokenizer gpt2 -o forget.bin

The token-index operates independently of the document-index. No data.jsonl required — provenance is recorded during the tokenization/pack stage.

Architecture

.ob/
  authors/              who: name, email, id = sha256(name+email)
  sections/            what: file path + authors + license, sharded by sha256
  document-index/      which line came from where: (line_hash, file, sources)
  token-index.gpt2/     per-document token counts with sources (per tokenizer)
  index/                binary index: OBIDXF02 with type-tagged IndexRef
  log                   operation audit trail

Three layers: authors ← sections ← document-index. Extended with an independent token-index layer for streaming pipelines.

All operations are logged to .ob/log (JSONL audit trail). Use ob log (Python path) to query; Rust binary delegates this to the Python package.

Binary Index (OBIDXF02)

Type-tagged references in the index:

Tag	Type	Content
`0x00`	DocumentIndexShard	bucket prefix byte
`0x01`	TokenIndexRange	tokenizer, file number, byte offset, length

A single section can reference both document-index shards and token-index ranges across multiple tokenizers.

Token-Index Entry

{"token_count": 5, "sources": ["a3f7b2..."], "tokenizer": "gpt2", "revoked": false}

token_count: tokens produced from one document (or one chunk if > max_context_tokens)
sources: section hashes linking to the author chain
tokenizer: identifier string (e.g., "gpt2", "llama3")
revoked: boolean, marks entry as revoked

Forget Set Generation

ob generate-set produces a binary bitmask where each bit corresponds to one token-index entry (1 = revoked, 0 = active). Directly usable by unlearning algorithms (NPO, exact retraining, etc.).

Modules

File	Purpose
`show.rs`	Show provenance (line + token level, revoked, --section, --license, --revoked, --index)
`token_index.rs`	Token-index storage (PID files, merge, query)
`token_bin_index.rs`	OBIDXTI01 token binary index
`main.rs`	CLI entry point, command dispatch
`merge.rs`	Merge/absorb from another repository with auto index rebuild
`binary_index.rs`	OBIDXF02 binary index with type-tagged IndexRef
`python.rs`	PyO3 bindings (_ob_native module)
`purge.rs`	Physically delete revoked data
`revoke.rs`	Revoke at 3 levels (author, section, line)
`index.rs`	Index building, scanning token-index files
`clean.rs`	Merge PID files, archive revoked records
`track.rs`	Track data lines with provenance
`generate_set.rs`	Binary forget set generation
`export.rs`	DEP-5 copyright export
`reconcile.rs`	Two-phase reconcile (hash + embedding)
`authors.rs`	Author CRUD and query operations
`mmap_lines.rs`	mmap-based file reading and line iteration
`storage.rs`	JSONL read/append/mmap utilities
`indexer.rs`	Document-index CRUD (lookup, index)
`lib.rs`	Library root (test utilities, shared config)
`register.rs`	Section registration logic
`embeddings.rs`	Embedding storage/retrieval
`hash.rs`	SHA-256 content addressing
`oplog.rs`	Operation audit log
`blame.rs`	Line-level blame lookup

Performance

See the OriginBlame README for full benchmark results including pipeline throughput, scalability, reconcile recovery, and machine unlearning evaluation.

Repository Structure

rust-originblame/
  src/                  Rust native implementation (25 files)
    show.rs             Show provenance
    token_index.rs      Token-index storage
    token_bin_index.rs  OBIDXTI01 token binary index
    main.rs             CLI entry point
    merge.rs            Merge/absorb
    binary_index.rs     OBIDXF02 binary index
    python.rs           PyO3 bindings
    purge.rs            Delete revoked data
    revoke.rs           Revoke at 3 levels
    index.rs            Binary index builder
    clean.rs            Merge PIDs, archive
    track.rs            Track data lines
    generate_set.rs     Binary forget set
    export.rs           DEP-5 copyright export
    reconcile.rs        Hash + embedding reconcile
    authors.rs          Author CRUD
    mmap_lines.rs       mmap-based line iteration
    storage.rs          JSONL read/append
    indexer.rs          Document-index CRUD
    lib.rs              Library root
    register.rs         Section registration
    embeddings.rs       Embedding storage
    hash.rs             SHA-256 hashing
    oplog.rs            Operation log
    blame.rs            Blame lookup
  python/               Python package with optional Rust backend
    src/ob/             Core Python package (CLI, API, all modules)
      api.py            High-level API (3-tier delegation)
      _track.py         Track data with provenance
      authors.py        Author management
      cli/              CLI commands (typer-based)
      embeddings.py     Embedding storage
      exceptions.py     Error types
      indexer.py        Document-index management
      oplog.py          Operation log
      register.py       Section registration
      rust.py           Rust binary delegation
      source.py         Source context manager
      storage.py        JSONL utilities
      track.py          Track implementation
      util.py           Shared utilities
    packages/
      ob-util/          Optional utilities (parsers, embeddings, copyright)
  Cargo.toml            Rust package manifest (edition 2024)

Tests

cargo test    # 71 Rust tests

# Python tests (ob-util package only; core ob delegates to Rust)
cd python && pytest packages/ob-util/tests/    # 76 tests

The core ob Python package has no standalone tests — all performance-critical operations delegate to the Rust binary which has its own test suite. The ob-util package (parsers, embeddings, copyright export) has its own test suite.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
dist		dist
python		python
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rust-originblame

Why Rust

Build

Python Package

CLI Reference

Token-Level Provenance

Architecture

Binary Index (OBIDXF02)

Token-Index Entry

Forget Set Generation

Modules

Performance

Repository Structure

Tests

License

About

Uh oh!

Releases 1

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

rust-originblame

Why Rust

Build

Python Package

CLI Reference

Token-Level Provenance

Architecture

Binary Index (OBIDXF02)

Token-Index Entry

Forget Set Generation

Modules

Performance

Repository Structure

Tests

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Contributors

Uh oh!

Languages