Skip to content

santhoshworks/document-parser-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Document Parser API

Extract structured JSON data from PDFs and documents using OCR and text analysis.

Features

  • PDF text extraction
  • Table detection and extraction
  • Metadata extraction (author, creation date, etc.)
  • Page-by-page content parsing
  • OCR support for scanned documents (optional)

Quick Start

# Install dependencies
pip install -r requirements.txt

# Run the API
uvicorn src.main:app --reload

API Usage

# Parse a PDF file
curl -X POST "http://localhost:8000/parse" \
  -F "file=@document.pdf"

# Parse with specific options
curl -X POST "http://localhost:8000/parse" \
  -F "file=@document.pdf" \
  -F "extract_tables=true" \
  -F "include_metadata=true"

Response

{
  "filename": "document.pdf",
  "pages": 3,
  "content": {
    "text": "Full extracted text...",
    "pages": [
      {"page": 1, "text": "Page 1 content..."},
      {"page": 2, "text": "Page 2 content..."}
    ]
  },
  "metadata": {
    "author": "John Doe",
    "created": "2024-01-15",
    "title": "Document Title"
  },
  "tables": []
}

Tech Stack

  • Python 3.9+
  • FastAPI
  • pdfplumber (PDF extraction)
  • python-multipart (file uploads)

License

MIT

About

API to extract structured JSON from PDFs and documents

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors