Skip to content

tzbkk/originblame

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

399 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OriginBlame

Record- and token-level data provenance for AI training datasets.

DOI DOI

When a data contributor requests removal, model trainers face a practical gap: unlearning algorithms require a forget set, yet no tool can locate which training records belong to a given author. Existing provenance systems operate at file or dataset level, forcing catastrophic over-deletion. We present ob, a record- and token-level data provenance system that propagates author identity through data processing pipelines and resolves revocation requests into precise forget sets via deterministic queries. Evaluation on 219,555 Wikipedia pages demonstrates that record-level provenance eliminates dataset-level over-deletion (from 101× to 1.3×), while integration adds 1.3–4.0% throughput overhead (HuggingFace) and 2.1–19.0% (Datatrove) on wiki data. On a 1.7B model, provenance-based forget sets improve unlearning by 42% over random baselines.

Install

Source code lives in the rust-originblame repository, which contains both the Rust native implementation and the Python package.

# Rust binary (recommended for performance)
cd rust-originblame && cargo build --release

# Python package (with optional Rust backend)
cd rust-originblame/python && pip install .

CLI requires Python >= 3.12 and typer. When the Rust binary is available, the Python package automatically delegates to it for performance-critical operations.

5-Minute Quickstart

# 1. Initialize tracking in your dataset directory
ob init

# 2. Register an author (e.g., a data source you scraped from)
ob author.add "Wikimedia" "wikimedia@example.com"

# 3. Register a section (a file + author + license combination)
ob register.add \
  --path raw/wiki_en.xml \
  --authors wikimedia@example.com \
  --license CC-BY-SA-4.0 \
  --year 2024

# 4. Track data lines as you process them (Python API)
python << 'EOF'
import sys
sys.path.insert(0, "path/to/rust-originblame/python/src")
from ob import source, track

source.append("raw/wiki_en.xml")  # activate the section
for line in open("data.jsonl"):
    record = {"text": line.strip(), "lang": "en"}
    track(record, file="data.jsonl")
EOF

# 5. Check provenance for a specific line
ob blame data.jsonl 42

That's it. ob blame tells you exactly which author/section a line came from.

How It Works

OriginBlame tracks provenance at data collection time -- not retroactively. It stores metadata in .ob/ inside your repository, using plain JSONL files organized into three layers:

.ob/
  authors/          # who: name, email, id = sha256(name+email)
  sections/          # what: file path + authors + license, sharded by sha256
  document-index/    # which line came from where: (line_hash, file, sources)
  token-index.gpt2/ # how many tokens each document contributed (per tokenizer)
  index/            # binary index: id → refs[] (bucket prefixes + token ranges)
  log               # operation audit trail
  • Content-addressable: every record is indexed by SHA-256 hash. No IDs to manage, no central database.
  • Decentralized: metadata lives in your repo. No server, no config files, no external state.
  • Zero ML dependencies: the core ob package has no ML imports. Optional ob-util adds parsers and embedding-based reconciliation.
  • Reconcile after edits: when data files change, ob reconcile uses hash matching (Pass 1) and optional embedding similarity (Pass 2) to re-link provenance to modified lines.

CLI Reference

ob init                           Initialize .ob/ tracking directory
ob author.add NAME EMAIL          Register an author
ob register.add --path PATH --authors EMAIL --license LICENSE --year YEAR
                                  Register a section (path + authors + license + year)
ob blame FILE LINE                Show which section(s) a specific line belongs to
ob show [--author NAME] [--email EMAIL] [--section HASH] [--license NAME] [--revoked] [--index]
                                  Show provenance metadata by various dimensions
ob revoke (--author NAME | --email EMAIL | --section HASH | --line-hash HASH --file FILE)
                                  Revoke author/section claims at multiple granularities
ob purge --file FILE [--author EMAIL] [--dry-run] [--index]
                                  Physically delete revoked data from tracked files
ob clean                          Merge PID files, archive revoked records, rotate log
ob merge absorb PATH              Absorb provenance from another repository
ob reconcile FILE                 Reconcile provenance after data edits (hash + optional embedding)
  --model/-m MODEL                Embedding model name (enables Pass 2 semantic matching)
  --threshold/-t FLOAT            Cosine similarity threshold (default: 0.85)
  --embedding-api/-e URL          OpenAI-compatible embedding API URL
  --compute-all-embeddings        Compute and store embeddings for ALL lines
ob index build                    Build provenance index for fast lookups
ob status                         Show summary statistics (authors, sections, records)
ob log                            Show operation audit trail
ob version                        Show ob version

# Token-level provenance (add --tokenizer to show/revoke/status)
ob show --author Alice --tokenizer gpt2     Show token counts by author
ob revoke --author Alice --tokenizer gpt2   Revoke author's token entries
ob status --tokenizer gpt2                  Show token-index statistics
ob generate-set --tokenizer gpt2 -o forget.bin  Generate binary forget set (bitmask)

Python API

import sys
sys.path.insert(0, "path/to/rust-originblame/python/src")
from ob import init, author_add, register_section, source, track

init()

# Register provenance metadata
author_add("Wikimedia", "wikimedia@example.com")
register_section("raw/wiki.xml", ["wikimedia@example.com"], "CC-BY-SA-4.0", "2024")

# Track data lines (uses source stack for attribution)
source.append("raw/wiki.xml")
for record in read_jsonl("data.jsonl"):
    track(record, file="data.jsonl")
source.pop()

# Section filtering (optional - activate specific section by hash)
source.append("raw/wiki.xml", section="abcd1234efgh5678")
track(record, file="data.jsonl")
source.pop()

# Scoped tracking with context manager
with source.sources("raw/wiki.xml"):
    for record in read_jsonl("data.jsonl"):
        track(record, file="data.jsonl")

# Query provenance via CLI (delegates to native Rust binary)
# ob blame data.jsonl 42
# ob show --author Wikimedia
# ob revoke --author Wikimedia --tokenizer gpt2

ob-util (Optional)

The ob-util package lives in rust-originblame/python/packages/ob-util/. It adds parsers, embedding reconciliation, and copyright export:

pip install ./rust-originblame/python/packages/ob-util
pip install ./rust-originblame/python/packages/ob-util[reconcile]   # with torch + sentence-transformers

ob parse --format mediawiki --input raw/wiki.xml --output parsed.jsonl

# Reconcile with local embedding model
ob reconcile data.jsonl --model all-MiniLM-L6-v2

# Reconcile via embedding API (e.g. LM Studio, vLLM)
ob reconcile data.jsonl -m nomic-embed-text-v1.5 -e http://localhost:1234/v1

# Compute embeddings for all lines (prepare for future reconcile)
ob reconcile data.jsonl -m nomic-embed-text-v1.5 -e http://localhost:1234/v1 --compute-all-embeddings

ob export-copyright --format dep5 --output debian/copyright

Paper

OriginBlame: Record- and Token-Level Data Provenance for AI Training Datasets

See paper/originblame.tex for the full paper (LaTeX source). Artifact at Zenodo. Repository archived at Zenodo.

Reproducibility

Paper compiles with pdflatex originblame.tex && bibtex originblame && pdflatex originblame.tex && pdflatex originblame.tex. Benchmark reproduction requires:

Resource Size Source
zhwiki XML dump ~2 GB Wikimedia dumps (8 .7z files; URLs in benchmarks/README.md)
Qwen3-1.7B model 3.8 GB huggingface-cli download Qwen/Qwen3-1.7B --local-dir benchmarks/models/qwen3-1.7b
Linux kernel ~4 GB git clone https://mirrors.ustc.edu.cn/linux.git && git checkout e75a43c7cec459a07d91ed17de4de13ede2b7758
Zhipu API key Required for QA data generation in MU experiments (set ZHIPU_API_KEY in .env)
Embedding API Required for semantic reconcile only: OpenAI-compatible API at http://localhost:1234/v1 (LM Studio / vLLM with nomic-embed-text-v1.5)

All pipeline MAU unlearning results are fully deterministic given the same QA data, seed (42), and model weights. Hash-only reconcile and all query benchmarks require no API keys. See benchmarks/README.md for full setup instructions.

Key Results

Evaluated on a Chinese Wikipedia dump (219,555 pages, 482,543 contributors) at four scales (1k–220k pages):

Revocation Precision (10k scale) — Line-level provenance eliminates over-deletion:

Revoking Author Share Lines Removed (ob) Over-deletion (dataset-level)
InternetArchiveBot 79.5% 7,953 1.3×
Walter Grassroot 17.1% 1,712 5.8×
KLBot2 5.0% 499 20.0×
HuangQQ 1.0% 99 101.0×

Reconcile Recovery (after 10% edit + 5% delete + 5% insert mutation):

Scale Hash Match Semantic Match Recovery
1k 865 103 96.3%
10k 8,479 1,294 98.1%
100k 84,821 13,222 98.2%
100k 84,821 84.9%†

†Hash-only (Pass 1). Semantic matching was not measured at these scales due to embedding API throughput constraints.

Scalability (3-run avg., ms; native implementation with mmap, rayon, binary index):

Scale blame show show_idx revoke purge purge_idx
1k 1 3 3 <1 0.6 3
10k 1 9 10 <1 0.7 41
100k 1 33 34 <1 5.8 106
220k 3 80 78 <1 12 190

†Synthetic benchmark. All operations sub-100ms at 220k lines.

Storage overhead: decreases with scale from 0.32× at 1k lines to 0.22× at 220k lines. Line coverage: 100% at all scales.

Token-Level Streaming Benchmark — Real gpt2 tokenization on zhwiki data, no JSONL produced:

Pages Tokens Datatrove Drop HF Drop Storage (Datatrove) Query (ms)
1k 2.8M −13.8% −2.0% 1.33× 3
10k 25.9M −19.0% −2.5% 1.29× 9
100k 302.4M −13.4% −1.3% 1.24× 33
219,555 712.4M −2.1% −4.0% 1.23× 69

Machine Unlearning Evaluation — 8 unlearning experiments (2 forget set types × 2 algorithms × 2 authors) testing whether ob's line-level provenance produces better forget sets than random baseline. RMU is included as a known-limitation baseline (QLoRA incompatible — see benchmarks/README.md for details). Line-level forget sets dominate random baseline on all four metrics (PPL, ROUGE-L) for NPO, demonstrating that provenance-based localization directly affects unlearning quality.

Cross-Domain Generalization — Linux kernel source code with git blame attribution (3 scales):

Files Authors Datatrove Drop HF Drop Storage (Datatrove) Over-deletion (file vs record)
1,000 671 −25.7% −0.2% 1.06×
10,000 5,285 −40.9% −1.0% 1.02×
44,222 6,964 −2.5% 1.01× 1.3×

Attribution uses git blame (line-level authorship) on the top N C/H files from a deep clone of the Linux kernel repository, not git log commit authors. File-level deletion remains wasteful even with accurate attribution: at the smallest scale, revoking Linus Torvalds at file granularity would delete 9× more lines than necessary.

Development

# Rust implementation
cd rust-originblame && cargo test    # 71 tests

# Python tests (ob-util package only; core ob delegates to Rust)
cd python && pytest packages/ob-util/tests/    # 76 tests
ruff check src/

Repository Structure

This repository (originblame) contains the paper and benchmarks:

  • paper/ — LaTeX source for the paper
  • benchmarks/ — evaluation scripts and results

Source code lives in rust-originblame:

  • Rust native implementation (src/)
  • Python package (python/src/ob/) with optional Rust backend
  • Optional utilities (python/packages/ob-util/)

Roadmap

  • Rust native implementation — completed: independent rust-originblame repository with mmap, rayon parallelism, and binary index. All queries sub-100ms at 220k lines.
  • Token-level provenance — completed: independent token-index layer with streaming mode, binary forget-set generation, and framework integration (Datatrove, HuggingFace) with provenance tracking overhead as low as 1.3% at 100k scale.
  • Multi-level revocation — completed: three-level revocation model (author, section, line-hash) with lazy cascade and reversible tags.
  • Cross-domain evaluation — completed: Linux kernel source code with git blame attribution, demonstrating generalization beyond wiki-style text.
  • Machine unlearning validation — completed: experimental validation showing that line-level provenance produces better forget sets than random baselines (42% improvement in forgetting, 23% in utility preservation).

Target venue: EDBT 2027 (CCF-B).

About

Record- and token-level data provenance for AI training datasets

Resources

License

Stars

Watchers

Forks

Contributors