PDF Classification & Extraction Platform (Flask)

A clean Flask starter architecture for a PDF intelligence pipeline based on your flowchart:

PDF upload endpoint
Pre-processing (PyMuPDF + EasyOCR fallback placeholder)
Main bucket classifier (loads externally trained checkpoints; heuristics fallback)
Sub-bucket classifier
Key info extractor (spaCy NER + regex placeholder)
Storage/grouping engine
Polished web UI

Project Structure

my_second_python/
├── app/
│   ├── __init__.py
│   ├── config.py
│   ├── models/
│   │   └── document.py
│   ├── routes/
│   │   └── main.py
│   ├── services/
│   │   ├── classify.py
│   │   ├── extract.py
│   │   ├── grouping.py
│   │   ├── mlops.py
│   │   └── preprocess.py
│   ├── static/
│   │   ├── css/style.css
│   │   └── js/app.js
│   └── templates/
│       ├── base.html
│       ├── index.html
│       └── result.html
├── data/
│   ├── processed/
│   └── uploads/
├── requirements.txt
└── run.py

Run Locally

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
python run.py

Then open http://127.0.0.1:5000

External training (e.g. Kaggle)

Fine-tune distilbert-base-multilingual-cased on data/training/train.jsonl / validation.jsonl, then place checkpoints under models/main_bucket_classifier/ and models/sub_bucket_classifier/. See docs/kaggle-training.md for labels, layout, and wiring.

Next Up (Production)

Plug real models into app/services/classify.py (ONNXRuntime INT8)
Add PyMuPDF + EasyOCR logic in app/services/preprocess.py
Replace in-memory grouping with SQLite persistence
Add feedback endpoint to send correction events to retraining queue

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
app		app
data/feedback		data/feedback
docs		docs
scripts		scripts
tests		tests
.gitignore		.gitignore
Procfile		Procfile
README.md		README.md
allfiles.txt		allfiles.txt
render.yaml		render.yaml
requirements.txt		requirements.txt
run.py		run.py
runtime.txt		runtime.txt
wsgi.py		wsgi.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Classification & Extraction Platform (Flask)

Project Structure

Run Locally

External training (e.g. Kaggle)

Next Up (Production)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Classification & Extraction Platform (Flask)

Project Structure

Run Locally

External training (e.g. Kaggle)

Next Up (Production)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages