Kreuzberg

High-performance Python library for text extraction from documents. Extract text from PDFs, images, office documents, and more with both async and sync APIs.

📖 Complete Documentation

Why Kreuzberg?

🚀 Fastest Performance: Benchmarked as the fastest text extraction library
💾 Memory Efficient: 14x smaller than alternatives (71MB vs 1GB+)
⚡ Dual APIs: Only library with both sync and async support
🔧 Zero Configuration: Works out of the box with sane defaults
🏠 Local Processing: No cloud dependencies or external API calls
📦 Rich Format Support: PDFs, images, Office docs, HTML, and more
🔍 Multiple OCR Engines: Tesseract, EasyOCR, and PaddleOCR support
🐳 Production Ready: CLI, REST API, and Docker images included

Quick Start

Installation

# Basic installation
pip install kreuzberg

# With optional features
pip install "kreuzberg[cli,api]"        # CLI + REST API
pip install "kreuzberg[easyocr,gmft]"   # EasyOCR + table extraction
pip install "kreuzberg[all]"            # Everything

System Dependencies

# Ubuntu/Debian
sudo apt-get install tesseract-ocr pandoc

# macOS
brew install tesseract pandoc

# Windows
choco install tesseract pandoc

Basic Usage

import asyncio
from kreuzberg import extract_file

async def main():
    # Extract from any document type
    result = await extract_file("document.pdf")
    print(result.content)
    print(result.metadata)

asyncio.run(main())

Deployment Options

🐳 Docker (Recommended)

# Run API server
docker run -p 8000:8000 goldziher/kreuzberg:3.4.0

# Extract files
curl -X POST http://localhost:8000/extract -F "data=@document.pdf"

Available variants: 3.4.0, 3.4.0-easyocr, 3.4.0-paddle, 3.4.0-gmft, 3.4.0-all

🌐 REST API

# Install and run
pip install "kreuzberg[api]"
litestar --app kreuzberg._api.main:app run

# Health check
curl http://localhost:8000/health

# Extract files
curl -X POST http://localhost:8000/extract -F "data=@file.pdf"

💻 Command Line

# Install CLI
pip install "kreuzberg[cli]"

# Extract to stdout
kreuzberg extract document.pdf

# JSON output with metadata
kreuzberg extract document.pdf --output-format json --show-metadata

# Batch processing
kreuzberg extract *.pdf --output-dir ./extracted/

Supported Formats

Category	Formats
Documents	PDF, DOCX, DOC, RTF, TXT, EPUB
Images	JPG, PNG, TIFF, BMP, GIF, WEBP
Spreadsheets	XLSX, XLS, CSV, ODS
Presentations	PPTX, PPT, ODP
Web	HTML, XML, MHTML
Archives	Support via extraction

Performance

Fastest extraction speeds with minimal resource usage:

Library	Speed	Memory	Size	Success Rate
Kreuzberg	⚡ Fastest	💾 Lowest	📦 71MB	✅ 100%
Unstructured	2-3x slower	2x higher	146MB	95%
MarkItDown	3-4x slower	3x higher	251MB	90%
Docling	4-5x slower	10x higher	1,032MB	85%

Rule of thumb: Use async API for complex documents and batch processing (up to 4.5x faster)

Documentation

Quick Links

Installation Guide - Setup and dependencies
User Guide - Comprehensive usage guide
API Reference - Complete API documentation
Docker Guide - Container deployment
REST API - HTTP endpoints
CLI Guide - Command-line usage
OCR Configuration - OCR engine setup

Advanced Features

📊 Table Extraction: Extract tables from PDFs with GMFT
🧩 Content Chunking: Split documents for RAG applications
🎯 Custom Extractors: Extend with your own document handlers
🔧 Configuration: Flexible TOML-based configuration
🪝 Hooks: Pre/post-processing customization
🌍 Multi-language OCR: 100+ languages supported
⚙️ Metadata Extraction: Rich document metadata
🔄 Batch Processing: Efficient bulk document processing

License

MIT License - see LICENSE for details.

Documentation • PyPI • Docker Hub • Discord

Made with ❤️ by the Kreuzberg contributors

Name		Name	Last commit message	Last commit date
Latest commit History 332 Commits
.docker		.docker
.github		.github
benchmarks		benchmarks
docs		docs
kreuzberg		kreuzberg
python-text-extraction-libs-benchmarks @ bdea577		python-text-extraction-libs-benchmarks @ bdea577
tests		tests
.commitlintrc		.commitlintrc
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
.markdownlint.yaml		.markdownlint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
ai-rulez.yaml		ai-rulez.yaml
mkdocs.yaml		mkdocs.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kreuzberg

Why Kreuzberg?

Quick Start

Installation

System Dependencies

Basic Usage

Deployment Options

🐳 Docker (Recommended)

🌐 REST API

💻 Command Line

Supported Formats

Performance

Documentation

Quick Links

Advanced Features

License

About

Uh oh!

Releases 31

Packages

Uh oh!

Contributors 8

Languages

License

Goldziher/kreuzberg

Folders and files

Latest commit

History

Repository files navigation

Kreuzberg

Why Kreuzberg?

Quick Start

Installation

System Dependencies

Basic Usage

Deployment Options

🐳 Docker (Recommended)

🌐 REST API

💻 Command Line

Supported Formats

Performance

Documentation

Quick Links

Advanced Features

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 31

Packages 0

Uh oh!

Contributors 8

Languages

Packages