Extract structured JSON data from PDFs and documents using OCR and text analysis.
- PDF text extraction
- Table detection and extraction
- Metadata extraction (author, creation date, etc.)
- Page-by-page content parsing
- OCR support for scanned documents (optional)
# Install dependencies
pip install -r requirements.txt
# Run the API
uvicorn src.main:app --reload# Parse a PDF file
curl -X POST "http://localhost:8000/parse" \
-F "file=@document.pdf"
# Parse with specific options
curl -X POST "http://localhost:8000/parse" \
-F "file=@document.pdf" \
-F "extract_tables=true" \
-F "include_metadata=true"{
"filename": "document.pdf",
"pages": 3,
"content": {
"text": "Full extracted text...",
"pages": [
{"page": 1, "text": "Page 1 content..."},
{"page": 2, "text": "Page 2 content..."}
]
},
"metadata": {
"author": "John Doe",
"created": "2024-01-15",
"title": "Document Title"
},
"tables": []
}- Python 3.9+
- FastAPI
- pdfplumber (PDF extraction)
- python-multipart (file uploads)
MIT