Skip to content

High-performance automated document redaction in C++ and Rust

Notifications You must be signed in to change notification settings

ralphcajipe/redact

Repository files navigation

Redact

High-performance automated document redaction. Scans text files for sensitive terms and replaces them with [REDACTED]. Available in C++ (speed) and Rust (speed + integrity verification). Ideal for legal teams, compliance departments, and any organization handling sensitive document processing.


📑 Table of Contents

  1. 📄 What It Does
  2. 🔐 Integrity Checksum Example (Tamper Detection)
  3. 🔧 Why Two Versions?
  4. 📊 Performance Benchmarks
  5. 🎯 How to Use
  6. 📂 File Structure
  7. 🏗️ Design Notes
  8. 📌 Version

What It Does 📄

Scans text files for these sensitive terms and replaces them with [REDACTED]:

  • Plaintiff
  • Confidential
  • SSN
  • Assets

Line-by-line processing means zero memory overhead, even for huge files.

Redact C++

On Rust, it performs the same redaction but includes integrity checksums for document verification:

Redact Rust


Integrity Checksum Example (Tamper Detection) 🔐

First Run:

Redaction complete. Output written to: file_redacted_rust.txt
Integrity Checksum: 9543857320846529

Second Run (Same File):

Redaction complete. Output written to: file_redacted_rust.txt
Integrity Checksum: 9543857320846529  ← SAME ✅

⚠️ If Someone Modifies the Output:

# (User accidentally edits file_redacted_rust.txt)

Redaction complete. Output written to: file_redacted_rust.txt
Integrity Checksum: 7284019356729481  ← DIFFERENT! ⚠️

Why This Matters: The checksum is a digital fingerprint. If it changes, the redacted file has been tampered with. Use this for audit trails and chain-of-custody verification in litigation.


Why Two Versions? 🔧

Feature C++ Rust
Speed ⚡⚡⚡ ⚡⚡⚡
File Size Unlimited Unlimited
Dependencies None None
Integrity Checksum No Yes ✓
Memory Safety STL (Checked by compiler) Rust Ownership (Strict compiler)

What this means:

  • STL (C++ Standard Library): Uses built-in, well-tested code libraries. The C++ compiler checks that arrays/strings don't overflow. Safe for production.
  • Rust Ownership: The Rust compiler forces safe memory handling. No dangling pointers, no buffer overflows. Nearly impossible to write unsafe code by accident.

C++: Raw speed for high-volume discovery processing.
Rust: Speed + digital fingerprinting to verify files haven't been tampered with.


Performance Benchmarks 📊

Real-world test on a 100+ MB file with 2,000,000 lines (legal_case_large.txt):

Language Time Throughput
C++ 10 seconds ~10 MB/sec
Rust 5 seconds ~20 MB/sec

Result: Rust processes the same file 2x faster while also computing the integrity checksum.

Why? Rust's ownership model and compiler optimizations eliminate memory overhead that C++ carries even with -O3. Both handle unlimited file sizes, but Rust wins on speed and safety.


How to Use 🎯

C++ Version:

# Build
g++ -std=c++17 -O3 main.cpp -o redact-core.exe

# Run
./redact-core.exe document.txt

# Output: document_redacted_cpp.txt

Rust Version:

# Build
cargo build --release

# Run
./target/release/redaction_rust.exe document.txt

# Output: document_redacted_rust.txt + Digital Fingerprint

File Structure 📂

redaction/
├── README.md                        # This file
├── legal_test.txt                   # Test document
├── redaction_cpp/
│   └── main.cpp                     # C++ engine
└── redaction_rust/
    ├── Cargo.toml                   # Rust project config
    ├── Cargo.lock                   # Dependency lock
    └── src/
        └── main.rs                  # Rust engine

Design Notes 🏗️

  • No regex: Simple find-and-replace is faster and more predictable
  • Case-sensitive: "Plaintiff" ≠ "plaintiff" (prevents over-redaction)
  • Hardcoded terms: No config files means no user error
  • Streaming I/O: Processes files in constant memory
  • No external deps: All STL (C++) or Rust std lib (Rust)

Version 📌

v1.0 - Both engines working, checksums functional.

About

High-performance automated document redaction in C++ and Rust

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors