High-performance Python library for text extraction from documents. Extract text from PDFs, images, office documents, and more with both async and sync APIs.
- 🚀 Fastest Performance: Benchmarked as the fastest text extraction library
- 💾 Memory Efficient: 14x smaller than alternatives (71MB vs 1GB+)
- ⚡ Dual APIs: Only library with both sync and async support
- 🔧 Zero Configuration: Works out of the box with sane defaults
- 🏠 Local Processing: No cloud dependencies or external API calls
- 📦 Rich Format Support: PDFs, images, Office docs, HTML, and more
- 🔍 Multiple OCR Engines: Tesseract, EasyOCR, and PaddleOCR support
- 🐳 Production Ready: CLI, REST API, and Docker images included
# Basic installation
pip install kreuzberg
# With optional features
pip install "kreuzberg[cli,api]" # CLI + REST API
pip install "kreuzberg[easyocr,gmft]" # EasyOCR + table extraction
pip install "kreuzberg[all]" # Everything
# Ubuntu/Debian
sudo apt-get install tesseract-ocr pandoc
# macOS
brew install tesseract pandoc
# Windows
choco install tesseract pandoc
import asyncio
from kreuzberg import extract_file
async def main():
# Extract from any document type
result = await extract_file("document.pdf")
print(result.content)
print(result.metadata)
asyncio.run(main())
# Run API server
docker run -p 8000:8000 goldziher/kreuzberg:3.4.0
# Extract files
curl -X POST http://localhost:8000/extract -F "data=@document.pdf"
Available variants: 3.4.0
, 3.4.0-easyocr
, 3.4.0-paddle
, 3.4.0-gmft
, 3.4.0-all
# Install and run
pip install "kreuzberg[api]"
litestar --app kreuzberg._api.main:app run
# Health check
curl http://localhost:8000/health
# Extract files
curl -X POST http://localhost:8000/extract -F "data=@file.pdf"
# Install CLI
pip install "kreuzberg[cli]"
# Extract to stdout
kreuzberg extract document.pdf
# JSON output with metadata
kreuzberg extract document.pdf --output-format json --show-metadata
# Batch processing
kreuzberg extract *.pdf --output-dir ./extracted/
Category | Formats |
---|---|
Documents | PDF, DOCX, DOC, RTF, TXT, EPUB |
Images | JPG, PNG, TIFF, BMP, GIF, WEBP |
Spreadsheets | XLSX, XLS, CSV, ODS |
Presentations | PPTX, PPT, ODP |
Web | HTML, XML, MHTML |
Archives | Support via extraction |
Fastest extraction speeds with minimal resource usage:
Library | Speed | Memory | Size | Success Rate |
---|---|---|---|---|
Kreuzberg | ⚡ Fastest | 💾 Lowest | 📦 71MB | ✅ 100% |
Unstructured | 2-3x slower | 2x higher | 146MB | 95% |
MarkItDown | 3-4x slower | 3x higher | 251MB | 90% |
Docling | 4-5x slower | 10x higher | 1,032MB | 85% |
Rule of thumb: Use async API for complex documents and batch processing (up to 4.5x faster)
- Installation Guide - Setup and dependencies
- User Guide - Comprehensive usage guide
- API Reference - Complete API documentation
- Docker Guide - Container deployment
- REST API - HTTP endpoints
- CLI Guide - Command-line usage
- OCR Configuration - OCR engine setup
- 📊 Table Extraction: Extract tables from PDFs with GMFT
- 🧩 Content Chunking: Split documents for RAG applications
- 🎯 Custom Extractors: Extend with your own document handlers
- 🔧 Configuration: Flexible TOML-based configuration
- 🪝 Hooks: Pre/post-processing customization
- 🌍 Multi-language OCR: 100+ languages supported
- ⚙️ Metadata Extraction: Rich document metadata
- 🔄 Batch Processing: Efficient bulk document processing
MIT License - see LICENSE for details.
Documentation • PyPI • Docker Hub • Discord
Made with ❤️ by the Kreuzberg contributors