Skip to content

rskrny/docproc-pdfcompressor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

docproc — Local-First Document & Image Processing Toolkit

A modular, privacy-first toolkit for everyday document and image tasks. Compress, merge, split, rotate, convert — all offline, no uploads, no subscriptions.

Now with AI chat assistant — describe what you want in plain English and the system builds + executes a processing plan automatically. Runs locally on your machine, opens in your default browser. Your files never leave your computer.

Web UI (GUI)

# Install with GUI support
pip install -e ".[gui]"

# Launch — opens in your browser automatically
docproc-gui

# Or with options
docproc-gui --port 8080        # custom port
docproc-gui --no-browser       # don't auto-open browser
python -m docproc.web.app      # alternative launch

The web UI provides drag-and-drop file upload, automatic parameter forms for every pipeline, real-time processing with progress feedback, and instant download. Your files never leave your computer — everything runs on localhost.

What this does

PDF operations:

  • Compress — shrink scanned PDFs to target sizes while maximizing quality
  • Merge — combine multiple PDFs into one document
  • Split — extract pages, split every N pages, or split each page separately
  • Rotate — rotate pages by 90°, 180°, or 270°
  • Pages — delete, keep, reorder, or reverse pages
  • Protect — add or remove password protection (AES-256)
  • Metadata — view, edit, or strip PDF metadata fields
  • Stamp — overlay text or image stamps on pages
  • Page numbers — add page numbers in multiple formats
  • PDF → Images — export pages as PNG or JPEG
  • Images → PDF — combine images into a single PDF

Image operations:

  • Resize — scale, fit, crop, or pad images to target dimensions
  • Convert — convert between PNG, JPEG, and WebP with quality control
  • Compress — reduce file size without changing dimensions (target size mode)
  • Crop — crop by pixel coordinates, percentage margins, or auto-detect content
  • EXIF — view or strip EXIF/GPS metadata from images
  • Background removal — AI model-based portrait segmentation (rembg)
  • Watermark removal — brightness-threshold detection + OpenCV inpainting

Infrastructure:

  • Recipe system — chain pipelines into named multi-step workflows (YAML/JSON)
  • RAG chat assistant — describe tasks in natural language, auto-generates execution plans
  • RAG document engine — upload any docs (PDF, DOCX, HTML, MD, CSV, images), get precise retrieval
  • GitHub device-flow OAuth for AI features (GitHub Models API)
  • Structured job logging for every run (JSONL)
  • Zero-boilerplate extensibility — new pipelines auto-register into CLI
  • 248 automated tests with pytest

Quick start

# Install (editable, with all extras)
pip install -e ".[all,dev]"

# --- PDF operations ---
docproc compress report.pdf                          # compress to 1MB default
docproc compress report.pdf --target 2MB             # custom target size
docproc merge doc1.pdf doc2.pdf doc3.pdf -o combined.pdf
docproc split report.pdf --pages 1-5                 # extract pages 1-5
docproc split report.pdf --each                      # each page → own file
docproc split report.pdf --every 10                  # split every 10 pages
docproc rotate scan.pdf --angle 90                   # rotate all pages 90°
docproc rotate scan.pdf --angle 180 --pages 3,5      # rotate specific pages
docproc pages report.pdf --delete 1,5                # delete pages 1 and 5
docproc pages report.pdf --keep 2-4                  # keep only pages 2-4
docproc pages report.pdf --reverse                   # reverse page order
docproc pdf2img report.pdf                           # export as PNGs
docproc pdf2img report.pdf --format jpg --dpi 300    # high-res JPEGs
docproc img2pdf photo1.png photo2.jpg photo3.png     # combine into PDF

# --- New PDF operations ---
docproc protect doc.pdf --user-pass secret           # encrypt with password
docproc protect doc.pdf --unlock --user-pass secret  # remove protection
docproc metadata doc.pdf --view                      # show PDF metadata
docproc metadata doc.pdf --title "My Doc" --author "Me"
docproc stamp doc.pdf --text "DRAFT" --position center --opacity 0.3
docproc stamp doc.pdf --image logo.png --position bottom-right
docproc pagenums doc.pdf --format page-n-of-m        # "Page 1 of 10"

# --- Image operations ---
docproc resize photo.png --width 800
docproc resize photo.png --width 600 --height 600 --fit cover

# Convert formats
docproc convert photo.png --format webp --quality 85

# Compress images (keeps dimensions, reduces file size)
docproc imgcompress photo.jpg --quality 70
docproc imgcompress photo.jpg --target 500KB

# Crop images
docproc crop photo.png --box "100,50,900,700"        # pixel coordinates
docproc crop photo.png --margin 10                    # trim 10% from each edge
docproc crop photo.png --auto                         # auto-detect content bounds

# View/strip EXIF metadata
docproc exif photo.jpg --view                         # show all EXIF tags
docproc exif photo.jpg --strip                        # remove all metadata
docproc exif photo.jpg --gps                          # GPS coordinates only

# Remove watermarks
docproc watermark photo.png
docproc watermark photo.png --preview --corner all

# Remove background (requires .venv310)
docproc background portrait.jpg

# Run a recipe (chained pipelines)
docproc recipe web-optimize photo.png
docproc recipe passport-photo portrait.jpg
docproc recipe email-ready-pdf report.pdf
docproc recipe social-media-image photo.png

# Introspection
docproc pipelines          # list all pipelines + recipes
docproc recipes            # list recipes with step details
docproc history            # recent job log
docproc stats              # aggregate statistics

Architecture

docproc/                         # Python package v0.5.0
    __init__.py                  # Version string
    __main__.py                  # python -m docproc support
    cli.py                       # Auto-generated CLI from Param descriptors
    registry.py                  # @register decorator + get_pipeline()/list_pipelines()
    recipes.py                   # YAML/JSON multi-step workflow loader
    job_log.py                   # Structured JSONL job log
    exceptions.py                # Custom exception hierarchy
    utils.py                     # format_size, parse_size, parse_page_range, etc.

    pipelines/                   # 18 registered pipelines
        base.py                  # Pipeline ABC + PipelineResult + Param descriptor
        pdf_compress.py          # DPI/quality ladder compression
        pdf_merge.py             # Multi-input PDF merge
        pdf_split.py             # Split by pages/every-N/each
        pdf_to_images.py         # PDF → PNG/JPEG
        images_to_pdf.py         # Multiple images → single PDF
        pdf_rotate.py            # Rotate pages 90°/180°/270°
        pdf_pages.py             # Delete/keep/reorder/reverse pages
        pdf_protect.py           # PDF password encrypt/unlock
        pdf_metadata.py          # View/edit/strip PDF metadata
        pdf_stamp.py             # Text or image stamp overlay
        pdf_pagenums.py          # Add page numbers
        resize.py                # Scale/fit/crop/pad
        convert.py               # Format conversion (PNG/JPEG/WebP/BMP/TIFF/GIF)
        image_compress.py        # Quality reduction + target-size mode
        crop.py                  # Crop by coords/margins/auto-detect
        exif.py                  # View/strip EXIF metadata
        watermark.py             # Brightness detection + OpenCV inpainting
        background.py            # rembg AI segmentation

    rag/                         # RAG engine + chat system
        parsers.py               # Parse PDF, DOCX, HTML, MD, CSV, TXT, images (OCR)
        chunking.py              # Fixed / semantic / hybrid chunking strategies
        storage.py               # SQLite backend: datasets, documents, chunks, BM25 index
        engine.py                # Orchestrator: ingest, query, dataset CRUD
        eval.py                  # IR evaluation: Precision@K, Recall@K, MRR, nDCG
        retriever.py             # BM25 + TF-IDF hybrid retriever, RRF, MMR
        embeddings.py            # API-based embeddings with disk cache
        knowledge.py             # Auto-gen pipeline docs + practices + workflows
        llm.py                   # GitHub Models API client (stdlib urllib)
        planner.py               # Parse → validate → execute plans + quality retry
        chat.py                  # ChatEngine: retrieve → augment → LLM → execute
        context.py               # Query intent classification + context assembly
        entities.py              # Zero-LLM entity extraction + conflict detection
        auth.py                  # Token storage + GitHub device-flow OAuth

    web/                         # Flask SPA
        app.py                   # Desktop launcher (port scan, browser open)
        api.py                   # 28 REST endpoints
        static/
            index.html           # Three-mode SPA: Tools wizard + Chat + Knowledge
            style.css            # Responsive design system
            app.js               # Client-side state management

tests/                           # 248 passing tests (pytest)
    conftest.py                  # Temp-dir fixtures, sample generators
    test_registry.py             # Registration + discovery (6)
    test_pdf_pipelines.py        # PDF pipelines (17)
    test_image_pipelines.py      # Image pipelines (13)
    test_utils.py                # Utility functions (24)
    test_rag.py                  # RAG retrieval system (48)
    test_rag_engine.py           # RAG engine: parsers, chunking, storage, engine, eval (51)
    test_new_pipelines.py        # New pipelines + tech debt (46)
    test_context_engineering.py  # Context engineering: intents, entities, conflicts (63)

recipes/                         # 8 named workflow presets
docs/                            # Design docs and development notes
pyproject.toml                   # Package metadata + entry points

AI Chat Assistant

The Chat tab provides a conversational interface powered by RAG (Retrieval-Augmented Generation):

  1. Sign in — Settings → "Sign in with GitHub" (device flow) or paste a PAT
  2. Ask — "Compress my PDF to under 1MB" or "Convert these images to WebP"
  3. Upload — Drag files onto the chat or click the attach button
  4. Download — Processed files appear as inline download links

The system retrieves relevant pipeline docs via hybrid BM25 + TF-IDF search (with optional API embeddings), augments the LLM prompt, generates a JSON execution plan, validates it against the pipeline registry (checking names, params, choices, dependencies), and executes deterministically. If a size target isn't met, it binary-searches the quality parameter automatically.

RAG Document Engine

Upload any documents (PDF, DOCX, HTML, Markdown, CSV, plain text, or images with OCR) and query them with high-precision hybrid retrieval:

# Create a dataset and ingest documents via API
curl -X POST localhost:5111/api/rag/datasets -H 'Content-Type: application/json' -d '{"name": "my-docs"}'
curl -X POST localhost:5111/api/rag/datasets/<id>/ingest -F file=@guide.pdf -F file=@faq.md
curl -X POST localhost:5111/api/rag/datasets/<id>/query -H 'Content-Type: application/json' -d '{"query": "how to reset password"}'

Documents are parsed, chunked (semantic section-aware splitting), and indexed for BM25 + TF-IDF hybrid retrieval. All data persists in SQLite at ~/.docproc/rag/. Duplicate files are automatically detected by content hash.

User: "Compress this to under 500KB"
 → TF-IDF retrieves: compress pipeline docs + email compression tips
 → LLM generates: {"plan": [{"pipeline": "compress", "params": {"target": "500KB"}, "verify": {"max_size_bytes": 512000}}]}
 → Planner validates: pipeline exists ✓, params valid ✓, deps installed ✓
 → Executor runs: compress pipeline → checks size → binary-search retry if needed
 → User gets: download link + size metrics

Current pipelines

Pipeline CLI command Category Use case
PDF compress docproc compress pdf Scanned PDF compression via DPI/quality ladder
PDF merge docproc merge pdf Combine multiple PDFs into one
PDF split docproc split pdf Extract pages or split into chunks
PDF rotate docproc rotate pdf Rotate pages by 90°/180°/270°
PDF pages docproc pages pdf Delete, keep, reorder, reverse pages
PDF protect docproc protect pdf Add/remove password protection
PDF metadata docproc metadata pdf View, edit, or strip PDF metadata
PDF stamp docproc stamp pdf Overlay text or image stamps
PDF page numbers docproc pagenums pdf Add page numbers (arabic, roman)
PDF → Images docproc pdf2img pdf Export pages as PNG or JPEG
Images → PDF docproc img2pdf pdf Combine images into a PDF
Image resize docproc resize image Scale, fit, crop, or pad images
Image convert docproc convert image Convert between PNG, JPEG, WebP
Image compress docproc imgcompress image Reduce file size (target size mode)
Image crop docproc crop image Crop by coords, margins, or auto-detect
Image EXIF docproc exif image View or strip EXIF metadata
Background removal docproc background image AI portrait segmentation (Python 3.10)
Watermark removal docproc watermark image Brightness detection + inpainting

Adding a new pipeline

  1. Create docproc/pipelines/my_pipeline.py extending Pipeline with @register
  2. Declare params = {...} using Param descriptors — the CLI auto-generates flags
  3. Import in docproc/pipelines/__init__.py

That's it. No CLI code to write — the Param metadata drives --help, argparse, and recipe validation automatically.

How to request work

  1. Input file name(s)
  2. Desired output (format + quality + constraints like max file size)
  3. Delivery preference (single final file vs multiple candidates)

Requirements

  • Python 3.10+ (3.10 required for background removal via rembg/ONNX)
  • See pyproject.toml for full dependency list

Quality standards

  • Prefer model-based segmentation over color-key heuristics
  • Preserve foreground colors (no jacket/skin recoloring)
  • Keep edges/hair natural, then composite onto pure white when requested
  • Produce deterministic output filenames and keep originals untouched
  • For PDFs: prefer color over grayscale, maximize DPI and JPEG quality within size budget

Operating principle

Root-cause first: when output quality fails, fix dependencies/runtime/model choice before tuning thresholds.

Documentation

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors