PDF Transmutation Pipeline (Proof of Concept)

End-to-end demo pipeline that preprocesses, classifies, and extracts structured JSON from PDFs, storing results in MongoDB.

Categories: academic research papers (arXiv) and legal contracts/licenses (public PDFs).

Quick Start

cd "PDF Processing Pipeline Demo"
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Optional: OCR for scanned PDFs (macOS)
# brew install tesseract poppler

cp .env.example .env
# Start MongoDB locally, or use Atlas URI in .env

python run.py --synthetic-only   # offline smoke test
python run.py                    # scrape + pipeline → data/output/transmuted.jsonl
python run.py --mongo            # also persist to MongoDB (requires docker compose up)
python run.py --no-scrape        # reuse data/raw/

Project Layout

config/                 # Paths, thresholds, category IDs
scrape/                 # arXiv + legal PDF dataset scraper
layer1_preprocess/      # Format detection + A4 normalization
layer2_classify/        # YAML keyword templates + scorer
layer3_transmute/       # Schemas, extractors, PII scrub, MongoDB
pipeline/               # Orchestrator
docs/ARCHITECTURE.md    # Deep technical design notes

Layer Summary

Layer	Module	Output
1	`layer1_preprocess/`	`data/canonical/_canonical.pdf`, `data/text/.txt`
2	`layer2_classify/`	Category + confidence + score breakdown
3	`layer3_transmute/`	Category-specific JSON → MongoDB

See docs/ARCHITECTURE.md for algorithms, complexity, limitations, and production scaling notes.

Configuration

Scoring templates: layer2_classify/templates/*.yaml
JSON schemas: layer3_transmute/schemas/*.json
Extractors: layer3_transmute/extractors/

Dashboard (visualization + upload)

Start the local web UI and API:

source .venv/bin/activate
pip install -r requirements.txt
python scripts/serve_dashboard.py

Open http://127.0.0.1:8000 in your browser.

Re-running serve_dashboard.py automatically stops any previous instance on port 8000. To stop manually:

python scripts/stop_dashboard.py
# or: lsof -ti :8000 | xargs kill

Feature	Description
Results view	Summary counts + card grid (academic vs legal colors, tie/low flagged)
Refresh	Loads `data/output/last_run.json` + `transmuted.jsonl`
Import JSON	Paste in CLI pipeline output JSON
Upload	PDF or image (PNG/JPEG/etc.) → convert → classify → show inline
Download	Transmuted PDF (redacted per schema) on every card; JSON/canonical for uploads
MongoDB	Optional checkbox (requires `docker compose up -d`)

MongoDB

Default database: pdf_transmutation_demo, collection: documents.

docker compose up -d
python run.py --mongo          # batch pipeline → MongoDB
# or use the dashboard checkbox when uploading
mongosh "mongodb://localhost:27017/pdf_transmutation_demo" --eval 'db.documents.find().pretty()'

License & Data Sources

arXiv papers: arXiv API Terms
Legal samples: public license PDFs (Apache 2.0, W3C); respect each host's terms when scraping

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Transmutation Pipeline (Proof of Concept)

Quick Start

Project Layout

Layer Summary

Configuration

Dashboard (visualization + upload)

MongoDB

License & Data Sources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.cursor		.cursor
api		api
config		config
data/output		data/output
docs		docs
layer1_preprocess		layer1_preprocess
layer2_classify		layer2_classify
layer3_transmute		layer3_transmute
pipeline		pipeline
scrape		scrape
scripts		scripts
web		web
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
run.py		run.py

Folders and files

Latest commit

History

Repository files navigation

PDF Transmutation Pipeline (Proof of Concept)

Quick Start

Project Layout

Layer Summary

Configuration

Dashboard (visualization + upload)

MongoDB

License & Data Sources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages