End-to-end demo pipeline that preprocesses, classifies, and extracts structured JSON from PDFs, storing results in MongoDB.
Categories: academic research papers (arXiv) and legal contracts/licenses (public PDFs).
cd "PDF Processing Pipeline Demo"
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Optional: OCR for scanned PDFs (macOS)
# brew install tesseract poppler
cp .env.example .env
# Start MongoDB locally, or use Atlas URI in .env
python run.py --synthetic-only # offline smoke test
python run.py # scrape + pipeline → data/output/transmuted.jsonl
python run.py --mongo # also persist to MongoDB (requires docker compose up)
python run.py --no-scrape # reuse data/raw/config/ # Paths, thresholds, category IDs
scrape/ # arXiv + legal PDF dataset scraper
layer1_preprocess/ # Format detection + A4 normalization
layer2_classify/ # YAML keyword templates + scorer
layer3_transmute/ # Schemas, extractors, PII scrub, MongoDB
pipeline/ # Orchestrator
docs/ARCHITECTURE.md # Deep technical design notes
| Layer | Module | Output |
|---|---|---|
| 1 | layer1_preprocess/ |
data/canonical/*_canonical.pdf, data/text/*.txt |
| 2 | layer2_classify/ |
Category + confidence + score breakdown |
| 3 | layer3_transmute/ |
Category-specific JSON → MongoDB |
See docs/ARCHITECTURE.md for algorithms, complexity, limitations, and production scaling notes.
- Scoring templates:
layer2_classify/templates/*.yaml - JSON schemas:
layer3_transmute/schemas/*.json - Extractors:
layer3_transmute/extractors/
Start the local web UI and API:
source .venv/bin/activate
pip install -r requirements.txt
python scripts/serve_dashboard.pyOpen http://127.0.0.1:8000 in your browser.
Re-running serve_dashboard.py automatically stops any previous instance on port 8000. To stop manually:
python scripts/stop_dashboard.py
# or: lsof -ti :8000 | xargs kill| Feature | Description |
|---|---|
| Results view | Summary counts + card grid (academic vs legal colors, tie/low flagged) |
| Refresh | Loads data/output/last_run.json + transmuted.jsonl |
| Import JSON | Paste in CLI pipeline output JSON |
| Upload | PDF or image (PNG/JPEG/etc.) → convert → classify → show inline |
| Download | Transmuted PDF (redacted per schema) on every card; JSON/canonical for uploads |
| MongoDB | Optional checkbox (requires docker compose up -d) |
Default database: pdf_transmutation_demo, collection: documents.
docker compose up -d
python run.py --mongo # batch pipeline → MongoDB
# or use the dashboard checkbox when uploading
mongosh "mongodb://localhost:27017/pdf_transmutation_demo" --eval 'db.documents.find().pretty()'- arXiv papers: arXiv API Terms
- Legal samples: public license PDFs (Apache 2.0, W3C); respect each host's terms when scraping