Skip to content

Goldziher/kreuzberg

Repository files navigation

Kreuzberg

Discord PyPI version Documentation License: MIT

High-performance Python library for text extraction from documents. Extract text from PDFs, images, office documents, and more with both async and sync APIs.

📖 Complete Documentation

Why Kreuzberg?

  • 🚀 Fastest Performance: Benchmarked as the fastest text extraction library
  • 💾 Memory Efficient: 14x smaller than alternatives (71MB vs 1GB+)
  • ⚡ Dual APIs: Only library with both sync and async support
  • 🔧 Zero Configuration: Works out of the box with sane defaults
  • 🏠 Local Processing: No cloud dependencies or external API calls
  • 📦 Rich Format Support: PDFs, images, Office docs, HTML, and more
  • 🔍 Multiple OCR Engines: Tesseract, EasyOCR, and PaddleOCR support
  • 🐳 Production Ready: CLI, REST API, and Docker images included

Quick Start

Installation

# Basic installation
pip install kreuzberg

# With optional features
pip install "kreuzberg[cli,api]"        # CLI + REST API
pip install "kreuzberg[easyocr,gmft]"   # EasyOCR + table extraction
pip install "kreuzberg[all]"            # Everything

System Dependencies

# Ubuntu/Debian
sudo apt-get install tesseract-ocr pandoc

# macOS
brew install tesseract pandoc

# Windows
choco install tesseract pandoc

Basic Usage

import asyncio
from kreuzberg import extract_file

async def main():
    # Extract from any document type
    result = await extract_file("document.pdf")
    print(result.content)
    print(result.metadata)

asyncio.run(main())

Deployment Options

🐳 Docker (Recommended)

# Run API server
docker run -p 8000:8000 goldziher/kreuzberg:3.4.0

# Extract files
curl -X POST http://localhost:8000/extract -F "data=@document.pdf"

Available variants: 3.4.0, 3.4.0-easyocr, 3.4.0-paddle, 3.4.0-gmft, 3.4.0-all

🌐 REST API

# Install and run
pip install "kreuzberg[api]"
litestar --app kreuzberg._api.main:app run

# Health check
curl http://localhost:8000/health

# Extract files
curl -X POST http://localhost:8000/extract -F "data=@file.pdf"

💻 Command Line

# Install CLI
pip install "kreuzberg[cli]"

# Extract to stdout
kreuzberg extract document.pdf

# JSON output with metadata
kreuzberg extract document.pdf --output-format json --show-metadata

# Batch processing
kreuzberg extract *.pdf --output-dir ./extracted/

Supported Formats

Category Formats
Documents PDF, DOCX, DOC, RTF, TXT, EPUB
Images JPG, PNG, TIFF, BMP, GIF, WEBP
Spreadsheets XLSX, XLS, CSV, ODS
Presentations PPTX, PPT, ODP
Web HTML, XML, MHTML
Archives Support via extraction

Performance

Fastest extraction speeds with minimal resource usage:

Library Speed Memory Size Success Rate
Kreuzberg Fastest 💾 Lowest 📦 71MB 100%
Unstructured 2-3x slower 2x higher 146MB 95%
MarkItDown 3-4x slower 3x higher 251MB 90%
Docling 4-5x slower 10x higher 1,032MB 85%

Rule of thumb: Use async API for complex documents and batch processing (up to 4.5x faster)

Documentation

Quick Links

Advanced Features

  • 📊 Table Extraction: Extract tables from PDFs with GMFT
  • 🧩 Content Chunking: Split documents for RAG applications
  • 🎯 Custom Extractors: Extend with your own document handlers
  • 🔧 Configuration: Flexible TOML-based configuration
  • 🪝 Hooks: Pre/post-processing customization
  • 🌍 Multi-language OCR: 100+ languages supported
  • ⚙️ Metadata Extraction: Rich document metadata
  • 🔄 Batch Processing: Efficient bulk document processing

License

MIT License - see LICENSE for details.


DocumentationPyPIDocker HubDiscord

Made with ❤️ by the Kreuzberg contributors

About

A text extraction library supporting PDFs, images, office documents and more

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages