Skip to content

RoroCantCode/TransmutePipeline

Repository files navigation

PDF Transmutation Pipeline (Proof of Concept)

End-to-end demo pipeline that preprocesses, classifies, and extracts structured JSON from PDFs, storing results in MongoDB.

Categories: academic research papers (arXiv) and legal contracts/licenses (public PDFs).

Quick Start

cd "PDF Processing Pipeline Demo"
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Optional: OCR for scanned PDFs (macOS)
# brew install tesseract poppler

cp .env.example .env
# Start MongoDB locally, or use Atlas URI in .env

python run.py --synthetic-only   # offline smoke test
python run.py                    # scrape + pipeline → data/output/transmuted.jsonl
python run.py --mongo            # also persist to MongoDB (requires docker compose up)
python run.py --no-scrape        # reuse data/raw/

Project Layout

config/                 # Paths, thresholds, category IDs
scrape/                 # arXiv + legal PDF dataset scraper
layer1_preprocess/      # Format detection + A4 normalization
layer2_classify/        # YAML keyword templates + scorer
layer3_transmute/       # Schemas, extractors, PII scrub, MongoDB
pipeline/               # Orchestrator
docs/ARCHITECTURE.md    # Deep technical design notes

Layer Summary

Layer Module Output
1 layer1_preprocess/ data/canonical/*_canonical.pdf, data/text/*.txt
2 layer2_classify/ Category + confidence + score breakdown
3 layer3_transmute/ Category-specific JSON → MongoDB

See docs/ARCHITECTURE.md for algorithms, complexity, limitations, and production scaling notes.

Configuration

  • Scoring templates: layer2_classify/templates/*.yaml
  • JSON schemas: layer3_transmute/schemas/*.json
  • Extractors: layer3_transmute/extractors/

Dashboard (visualization + upload)

Start the local web UI and API:

source .venv/bin/activate
pip install -r requirements.txt
python scripts/serve_dashboard.py

Open http://127.0.0.1:8000 in your browser.

Re-running serve_dashboard.py automatically stops any previous instance on port 8000. To stop manually:

python scripts/stop_dashboard.py
# or: lsof -ti :8000 | xargs kill
Feature Description
Results view Summary counts + card grid (academic vs legal colors, tie/low flagged)
Refresh Loads data/output/last_run.json + transmuted.jsonl
Import JSON Paste in CLI pipeline output JSON
Upload PDF or image (PNG/JPEG/etc.) → convert → classify → show inline
Download Transmuted PDF (redacted per schema) on every card; JSON/canonical for uploads
MongoDB Optional checkbox (requires docker compose up -d)

MongoDB

Default database: pdf_transmutation_demo, collection: documents.

docker compose up -d
python run.py --mongo          # batch pipeline → MongoDB
# or use the dashboard checkbox when uploading
mongosh "mongodb://localhost:27017/pdf_transmutation_demo" --eval 'db.documents.find().pretty()'

License & Data Sources

  • arXiv papers: arXiv API Terms
  • Legal samples: public license PDFs (Apache 2.0, W3C); respect each host's terms when scraping

About

Proof of concept of transmutation pipeline from raw PDF to classified transmuted documents

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors