Skip to content

tzbkk/rust-originblame

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rust-originblame

Rust native implementation of OriginBlame — record- and token-level data provenance for AI training datasets.

Paper + benchmark artifact: DOI Repository: DOI

Why Rust

The Python package's pure-Python fallback scans lines in ~580ms. The Rust native implementation does it in <0.2ms using mmap, rayon parallelism, and a binary index — up to 2,900× faster on show/purge at scale. When the Rust binary is available (built via cargo build --release or on PATH), the Python package automatically delegates to it.

Build

Requires Rust 1.85+ (edition 2024).

cargo build --release
./target/release/ob version

Python Package

A Python package with an optional Rust backend is available in python/:

cd python
pip install .                        # Core package (pure Python + Rust backend if binary found)
pip install -e ".[dev]"              # Development with pytest + ruff

When the Rust ob binary is available (built via cargo build --release or on PATH), the Python package automatically delegates performance-critical operations to it. Pure-Python fallback is used when the binary is not found.

Delegation architecture: The Python package uses a 3-tier fallback chain:

  1. _ob_native (PyO3 cdylib) — fastest path, compiled from src/python.rs with 38 PyO3 bindings
  2. ob binary (subprocess) — fallback if PyO3 module not built
  3. Pure Python — last resort, no native dependencies

All CLI commands now delegate to the PyO3 bindings directly, avoiding subprocess overhead.

Build with PyO3 support: maturin develop --release (requires pip install maturin).

from ob import init, author_add, register_section, source, track

init()
author_add("Wikimedia", "wikimedia@example.com")
register_section("raw/wiki.xml", ["wikimedia@example.com"], "CC-BY-SA-4.0", "2024")

# Use source.append with optional section filtering
source.append("raw/wiki.xml")  # all sections for this file
source.append("raw/wiki.xml", section="abc123...")  # only one section

# Use track with source= parameter
track(data, "data.jsonl", source="raw/wiki.xml")  # explicit file path
track(data, "data.jsonl", source=["abc123..."])  # explicit section hashes
track(data, "data.jsonl")  # use source stack (backward compatible)

Optional utilities (parsers, embedding reconciliation):

pip install ./python/packages/ob-util
pip install ./python/packages/ob-util[reconcile]   # with torch + sentence-transformers

CLI Reference

ob init [PATH]                          Initialize .ob/ tracking directory
ob author.add NAME EMAIL                Register an author
ob register.add --path P --authors A     Register a section
  --license LICENSE --year YEAR

ob blame FILE LINE                      Show which section(s) a line belongs to
ob show [--author NAME] [--email EMAIL]  Show provenance metadata
  [--section HASH] [--license LICENSE]
  [--revoked] [--index] [--tokenizer T]
ob revoke --author NAME                  Revoke at author level (all lines/sections)
ob revoke --section HASH                 Revoke at section level
ob revoke --line-hash HASH --file FILE   Revoke at line level
  [--email EMAIL] [--tokenizer T]
ob purge FILE [--index] [--author NAME] [--dry-run]  Physically delete revoked data (requires prior revoke)
ob status [--tokenizer TOKENIZER]        Show summary statistics
ob clean [--split]                       Merge PID files, archive revoked records
ob merge --absorb PATH                   Absorb provenance from another repository (auto-rebuilds index)
ob generate-set --tokenizer T -o FILE   Generate binary forget set (bitmask)
ob log [--op OP] [--since TIMESTAMP]   Query operation audit log (Python path)

ob index build                          Build binary index (OBIDXF02)

ob reconcile FILE [-m MODEL]            Reconcile provenance after data edits
  [-t THRESHOLD] [-e EMBEDDING_API]
  [--compute-all-embeddings]
ob export-copyright [-o FILE]           Export DEP-5 copyright file
  [--data-file FILE]
ob parse --parser FORMAT --file FILE    Parse structured data (MediaWiki XML)
  [--license LICENSE] [--split]
ob version                              Show version

Token-Level Provenance

Add --tokenizer NAME to show, revoke, status to query token-index entries:

ob show --author InternetArchiveBot --tokenizer gpt2
# → 144480919 tokens across 43332 documents by InternetArchiveBot (tokenizer: gpt2)

ob revoke --author InternetArchiveBot --tokenizer gpt2
ob generate-set --tokenizer gpt2 -o forget.bin

The token-index operates independently of the document-index. No data.jsonl required — provenance is recorded during the tokenization/pack stage.

Architecture

.ob/
  authors/              who: name, email, id = sha256(name+email)
  sections/            what: file path + authors + license, sharded by sha256
  document-index/      which line came from where: (line_hash, file, sources)
  token-index.gpt2/     per-document token counts with sources (per tokenizer)
  index/                binary index: OBIDXF02 with type-tagged IndexRef
  log                   operation audit trail

Three layers: authors ← sections ← document-index. Extended with an independent token-index layer for streaming pipelines.

All operations are logged to .ob/log (JSONL audit trail). Use ob log (Python path) to query; Rust binary delegates this to the Python package.

Binary Index (OBIDXF02)

Type-tagged references in the index:

Tag Type Content
0x00 DocumentIndexShard bucket prefix byte
0x01 TokenIndexRange tokenizer, file number, byte offset, length

A single section can reference both document-index shards and token-index ranges across multiple tokenizers.

Token-Index Entry

{"token_count": 5, "sources": ["a3f7b2..."], "tokenizer": "gpt2", "revoked": false}
  • token_count: tokens produced from one document (or one chunk if > max_context_tokens)
  • sources: section hashes linking to the author chain
  • tokenizer: identifier string (e.g., "gpt2", "llama3")
  • revoked: boolean, marks entry as revoked

Forget Set Generation

ob generate-set produces a binary bitmask where each bit corresponds to one token-index entry (1 = revoked, 0 = active). Directly usable by unlearning algorithms (NPO, exact retraining, etc.).

Modules

File Purpose
show.rs Show provenance (line + token level, revoked, --section, --license, --revoked, --index)
token_index.rs Token-index storage (PID files, merge, query)
token_bin_index.rs OBIDXTI01 token binary index
main.rs CLI entry point, command dispatch
merge.rs Merge/absorb from another repository with auto index rebuild
binary_index.rs OBIDXF02 binary index with type-tagged IndexRef
python.rs PyO3 bindings (_ob_native module)
purge.rs Physically delete revoked data
revoke.rs Revoke at 3 levels (author, section, line)
index.rs Index building, scanning token-index files
clean.rs Merge PID files, archive revoked records
track.rs Track data lines with provenance
generate_set.rs Binary forget set generation
export.rs DEP-5 copyright export
reconcile.rs Two-phase reconcile (hash + embedding)
authors.rs Author CRUD and query operations
mmap_lines.rs mmap-based file reading and line iteration
storage.rs JSONL read/append/mmap utilities
indexer.rs Document-index CRUD (lookup, index)
lib.rs Library root (test utilities, shared config)
register.rs Section registration logic
embeddings.rs Embedding storage/retrieval
hash.rs SHA-256 content addressing
oplog.rs Operation audit log
blame.rs Line-level blame lookup

Performance

See the OriginBlame README for full benchmark results including pipeline throughput, scalability, reconcile recovery, and machine unlearning evaluation.

Repository Structure

rust-originblame/
  src/                  Rust native implementation (25 files)
    show.rs             Show provenance
    token_index.rs      Token-index storage
    token_bin_index.rs  OBIDXTI01 token binary index
    main.rs             CLI entry point
    merge.rs            Merge/absorb
    binary_index.rs     OBIDXF02 binary index
    python.rs           PyO3 bindings
    purge.rs            Delete revoked data
    revoke.rs           Revoke at 3 levels
    index.rs            Binary index builder
    clean.rs            Merge PIDs, archive
    track.rs            Track data lines
    generate_set.rs     Binary forget set
    export.rs           DEP-5 copyright export
    reconcile.rs        Hash + embedding reconcile
    authors.rs          Author CRUD
    mmap_lines.rs       mmap-based line iteration
    storage.rs          JSONL read/append
    indexer.rs          Document-index CRUD
    lib.rs              Library root
    register.rs         Section registration
    embeddings.rs       Embedding storage
    hash.rs             SHA-256 hashing
    oplog.rs            Operation log
    blame.rs            Blame lookup
  python/               Python package with optional Rust backend
    src/ob/             Core Python package (CLI, API, all modules)
      api.py            High-level API (3-tier delegation)
      _track.py         Track data with provenance
      authors.py        Author management
      cli/              CLI commands (typer-based)
      embeddings.py     Embedding storage
      exceptions.py     Error types
      indexer.py        Document-index management
      oplog.py          Operation log
      register.py       Section registration
      rust.py           Rust binary delegation
      source.py         Source context manager
      storage.py        JSONL utilities
      track.py          Track implementation
      util.py           Shared utilities
    packages/
      ob-util/          Optional utilities (parsers, embeddings, copyright)
  Cargo.toml            Rust package manifest (edition 2024)

Tests

cargo test    # 71 Rust tests

# Python tests (ob-util package only; core ob delegates to Rust)
cd python && pytest packages/ob-util/tests/    # 76 tests

The core ob Python package has no standalone tests — all performance-critical operations delegate to the Rust binary which has its own test suite. The ob-util package (parsers, embeddings, copyright export) has its own test suite.

License

MIT

About

Record- and token-level data provenance for AI training datasets

Resources

License

Stars

Watchers

Forks

Contributors