A privacy-focused document redaction tool that removes sensitive information from PDFs, CSVs, Excel files, and text documents. Protects payer anonymity in bank statements while preserving financial amounts.
curl -fsSL https://raw.githubusercontent.com/sethsaler/redactor/main/install.sh | bashpip install git+https://github.com/sethsaler/redactor.gitgit clone https://github.com/sethsaler/redactor.git
cd redactor
pip install -r requirements.txtNote: For OCR support (scanned PDFs), install Tesseract:
- macOS:
brew install tesseract - Ubuntu/Debian:
sudo apt-get install tesseract-ocr - Windows: Download from GitHub releases
Run without arguments to launch the graphical interface:
python redact_ssns.pyThe GUI provides:
- File/folder selection with multi-select
- Three redaction modes with descriptions
- Progress bar and real-time log
- Output directory selection
- Cancel button to stop processing
- Remembers your last settings
Use command-line arguments for scripting and automation:
# SSN Mode (Default)
python redact_ssns.py statement.pdf
python redact_ssns.py data.csv -v
# Bank Statement Mode - preserves $ amounts
python redact_ssns.py bank_statement.pdf --mode bank
python redact_ssns.py transactions.csv --mode bank -o ./redacted/
# Comprehensive Mode
python redact_ssns.py document.pdf --mode all
python redact_ssns.py ./documents/ --mode all --recursive
# Excel and batch processing
python redact_ssns.py report.xlsx --mode bank
python redact_ssns.py *.csv --mode all -v| Feature | Description |
|---|---|
| Multiple Formats | PDF, CSV, TXT, XLSX support |
| GUI Interface | tkinter-based desktop app (default when run without args) |
| Three Redaction Modes | ssn, bank, all |
| OCR Support | Handles scanned/image-based PDFs via Tesseract |
| Currency Preservation | Dollar amounts like $1,234.56 are kept in bank mode |
| Multi-Currency | Supports $, €, £, ¥, ₹, ₩, ¢ |
| Batch Processing | Process entire directories recursively |
| Output Options | Custom output directory or auto-naming |
| Progress Tracking | Real-time progress bar and log in GUI mode |
- Social Security Numbers:
123-45-6789and123456789 - Validates against SSA rules (invalid area/group/serial numbers)
- Numbers: Account numbers, routing numbers, check numbers, card numbers (4+ digits)
- Emails: All email addresses
- Phone Numbers: US, international, and local formats
- Names: Capitalized person names (heuristic detection)
- Addresses: Street addresses, ZIP codes
Excluded from redaction:
- Currency amounts preceded by
$,€,£,¥, etc. - Dates
Combines both SSN and bank pattern redaction.
usage: redact_ssns.py [-h] [-r] [-v] [-o OUTPUT_DIR] [-m {ssn,bank,all}] [--gui] [path]
Redact sensitive information from documents (PDF, CSV, TXT, XLSX).
positional arguments:
path File or directory to process (PDF, CSV, TXT, XLSX)
(optional - launches GUI if not provided)
options:
-h, --help show this help message and exit
-r, --recursive Recurse into subdirectories
-v, --verbose Log each item found
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
Output directory
-m {ssn,bank,all}, --mode {ssn,bank,all}
Redaction mode (default: ssn)
--gui Launch GUI mode (default when run without args)
Modes:
ssn - Redact Social Security Numbers only (default)
bank - Redact numbers (except currency amounts), emails, phones, names, addresses
all - Redact both SSNs and bank patterns
Notes:
- Running without arguments launches the GUI
- Use --gui flag to force GUI mode even with other arguments
# Single file
python redact_ssns.py tax_return.pdf
# Verbose output
python redact_ssns.py form.pdf -v
# Output to specific directory
python redact_ssns.py *.pdf -o ./redacted/# Redact account numbers but keep $ amounts
python redact_ssns.py statement.pdf --mode bank
# Process all CSVs in directory
python redact_ssns.py ./exports/ --mode bank -r# Redact Excel workbook
python redact_ssns.py report.xlsx --mode bank
# Batch process
python redact_ssns.py ./spreadsheets/ --mode all -r -vRedacted files are saved with _redacted suffix:
statement.pdf→statement_redacted.pdfdata.csv→data_redacted.csvreport.xlsx→report_redacted.xlsx
| Package | Purpose |
|---|---|
| PyMuPDF | PDF text extraction and redaction |
| Pillow | Image processing |
| pytesseract | OCR for scanned PDFs |
| pdf2image | PDF to image conversion |
| openpyxl | Excel (.xlsx) support |
- File Detection: Determines file type from extension
- Pattern Matching: Uses regex to identify sensitive data
- Currency Check: For bank mode, checks if numbers are preceded by currency symbols
- Redaction:
- PDFs: Black rectangle annotations
- Text/CSV:
[REDACTED]replacement - Excel: Cell value replacement
- Output: Saves redacted copy preserving original
- Original files are never modified
- No data is transmitted or stored externally
- All processing is local
- No cloud dependencies
MIT License - see LICENSE file for details.
Issues and pull requests welcome! Please ensure your code follows the existing patterns and includes appropriate tests.
- Tesseract OCR for scanned document support
- PyMuPDF for PDF handling
- openpyxl for Excel support