A powerful CLI tool for extracting text from documents using DeepSeek OCR and generating high-quality datasets with LLM assistance.
- 📄 Multi-Format Support: PDF, EPUB, and images
- 🏠 Self-Hosted OCR: Local transformers for DeepSeek-OCR / DeepSeek-OCR-2 (no API costs)
- 🔄 OCR Version Choice: Switch between DeepSeek-OCR-1 and OCR-2 via
--ocr-version - 🤖 Parallel Generation: Multiple LLM threads explore documents simultaneously
- 🎯 Smart Distribution: Control thread starting positions
- 💾 SQLite Storage: Real-time dataset storage with flexible export
- 📊 Multiple Formats: JSONL, Parquet, CSV, JSON
- 🌐 Flexible Modes: API or self-hosted for both stages
- 📈 Progress Tracking: Real-time progress bars
pip install bookdatamakergit clone https://github.com/yourusername/bookdatamaker.git
cd bookdatamaker
pip install -r requirements.txt
pip install -e .# For self-hosted OCR and LLM generation
pip install bookdatamaker[local] # From PyPI
# OR
pip install -e ".[local]" # From source - installs transformers==4.46.3, torch, flash-attn, etc.Note: The project requires transformers==4.46.3 for optimal compatibility with DeepSeek-OCR and DeepSeek-OCR-2. A warning will be displayed if a different version is detected.
For API Mode:
- Python 3.10+
- API keys (OpenAI, DeepSeek, etc.)
For Local Mode:
- Python 3.10-3.12 (3.13 not supported due to vLLM compatibility)
- NVIDIA GPU with CUDA support (or CPU, though slower)
- 16GB+ VRAM recommended for GPU
- transformers==4.46.3
- Linux or WSL2 (recommended)
# Set API keys (choose one based on your mode)
export OPENAI_API_KEY=your_openai_key # For API mode
export DEEPSEEK_API_KEY=your_deepseek_key # For API OCR mode# 1. Install
pip install bookdatamaker
# 2. Extract (uses DeepSeek-OCR-2 by default) → Generate → Export
bookdatamaker extract book.pdf -o ./extracted
bookdatamaker generate ./extracted -d dataset.db --distribution "10,10,20,30,20,10"
bookdatamaker export-dataset dataset.db -o output.parquet
# Use OCR-1 instead
bookdatamaker extract book.pdf --ocr-version 1 -o ./extracted# 1. Install with local dependencies
pip install bookdatamaker[local]
# 2. Extract with local OCR
bookdatamaker extract book.pdf --mode local --batch-size 8 -o ./extracted
# 3. Generate with vLLM
bookdatamaker generate ./extracted \
--mode vllm \
--vllm-model-path meta-llama/Llama-3-8B-Instruct \
--distribution "25,25,25,25" \
-d dataset.db
# 4. Export
bookdatamaker export-dataset dataset.db -o output.parquetFor API Mode:
- Python 3.10+
- API keys (OpenAI, DeepSeek, etc.)
For Local Mode:
- Python 3.10-3.12 (3.13 not supported due to vLLM compatibility)
- NVIDIA GPU with CUDA support (or CPU, though slower)
- 16GB+ VRAM recommended for GPU
- transformers==4.46.3
- Linux or WSL2 (recommended)
Extract text from documents using DeepSeek OCR.
By default, DeepSeek-OCR-2 is used. You can switch to OCR-1 with --ocr-version 1:
# OCR-2 (default) — better accuracy, image_size=768
bookdatamaker extract book.pdf -o ./extracted
# OCR-1 — legacy, image_size=640
bookdatamaker extract book.pdf --ocr-version 1 -o ./extracted| Feature | OCR-1 | OCR-2 |
|---|---|---|
| Model | deepseek-ai/DeepSeek-OCR |
deepseek-ai/DeepSeek-OCR-2 |
| image_size | 640 | 768 |
| flash_attention_2 | No | Yes |
| API ngram_size | 30 | 20 |
- PDF: Text extraction or OCR from rendered pages
- EPUB: E-book text extraction
- Images: JPG, PNG, BMP, TIFF, WebP
Note: DeepSeek does not provide an official OCR API. You need to self-host DeepSeek-OCR using vLLM.
Follow the vLLM DeepSeek-OCR recipe to set up your server
Once your vLLM server is running:
# Basic usage (default: http://localhost:8000/v1)
bookdatamaker extract book.pdf -o ./extracted
# Custom vLLM endpoint
bookdatamaker extract book.pdf \
--deepseek-api-url http://your-server:8000/v1 \
-o ./extracted
# Adjust concurrency for faster processing
bookdatamaker extract book.pdf \
--api-concurrency 8 \
-o ./extractedPerformance Options:
--api-concurrency N: Number of concurrent API requests (default: 4)- Higher values = faster processing (if your server can handle it)
- Adjust based on your vLLM server capacity and network bandwidth
- Example: 8-16 for powerful servers, 2-4 for smaller setups
Use local transformers model for OCR (DeepSeek-OCR, no API calls):
# Basic usage - uses transformers AutoModel with flash_attention_2
bookdatamaker extract book.pdf --mode local -o ./extracted
# With custom batch size (adjust based on GPU memory)
bookdatamaker extract book.pdf --mode local --batch-size 12 -o ./extracted
# Use CPU instead of GPU
bookdatamaker extract book.pdf --mode local --device cpu -o ./extracted
# Use specific GPU
bookdatamaker extract book.pdf --mode local --device cuda:1 -o ./extracted
# Process directory of images
bookdatamaker extract ./images/ --mode local -o ./extractedPerformance Options:
--batch-size N: Number of pages processed per local OCR batch before flushing results to disk (default: 8)- Higher values = higher throughput potential but more GPU/host memory pressure
- Lower values reduce memory spikes on long PDFs because pages are processed and persisted batch by batch
- Example: 4 for 8GB VRAM, 8-16 for 24GB+ VRAM
Resume Behavior:
- Local and document extraction now persist progress in the output directory using
.extraction_progress.json - Re-running the same command against the same output directory automatically resumes from unfinished pages
- Existing
page_XXX/result.mmdfiles are treated as completed pages and skipped - If source path or key extraction parameters change, the run is rejected to avoid mixing incompatible outputs
Device Options:
cuda(default): Use default CUDA GPUcuda:0,cuda:1, etc.: Use specific GPUcpu: Use CPU (slower, no GPU required)xpu: Use Intel XPU
For PDF with embedded text, skip OCR and extract text directly (much faster):
# Extract plain text from PDF without OCR
bookdatamaker extract book.pdf --plain-text -o ./extractedNote: EPUB files are automatically extracted as plain text (no OCR needed, no --plain-text flag required):
# EPUB always uses plain text extraction
bookdatamaker extract book.epub -o ./extractedWhen to use --plain-text (for PDF):
- ✅ PDF with embedded text (e.g., born-digital documents)
- ✅ Fast extraction without GPU/API requirements
- ✅ Text-only documents
When NOT to use --plain-text:
- ❌ Scanned PDFs (images of text)
- ❌ PDFs with complex layouts requiring OCR
- ❌ Documents where text extraction quality is poor
./extracted/
├── page_001/
│ ├── page_001.png # Page image
│ └── result.mmd # Extracted text in markdown
├── page_002/
│ ├── page_002.png
│ └── result.mmd
└── ...
Note: Each page is stored in its own subdirectory with the extracted text in result.mmd format.
Generate Q&A datasets using parallel LLM threads with page-based navigation.
The system uses page navigation:
- LLM threads navigate through document pages
- Tools available:
get_current_page,next_page,previous_page,jump_to_page,get_page_context - Each thread starts at a specific page based on distribution
- Threads can move forward/backward through pages to explore content
The generation process automatically saves checkpoints to the database:
- Thread state is saved after each successful Q&A submission
- If interrupted (Ctrl+C, crash, etc.), simply rerun the same command
- You'll be prompted to resume from checkpoint or start fresh
# First run (interrupted at 50%)
bookdatamaker generate ./extracted -d dataset.db --distribution "25,25,25,25"
# ^C (interrupted)
# Resume from checkpoint
bookdatamaker generate ./extracted -d dataset.db --distribution "25,25,25,25"
# ⚠️ Found 4 incomplete thread(s) in database:
# Thread 0: 8/20 pairs, last updated 2024-01-15 10:30:45
# Thread 1: 10/20 pairs, last updated 2024-01-15 10:30:48
# Thread 2: 12/20 pairs, last updated 2024-01-15 10:30:50
# Thread 3: 7/20 pairs, last updated 2024-01-15 10:30:43
#
# Do you want to resume from checkpoint? [Y/n]: y
# ✓ Resuming from checkpoint...Features:
- 💾 Automatic checkpoint after each Q&A pair submission
- 🔄 Resume from last position in document
- 💬 Preserves conversation history
- 🎯 Tracks progress per thread
# 6 threads (from distribution), 20 Q&A pairs per thread
bookdatamaker generate ./extracted \
-d dataset.db \
--distribution "10,10,20,30,20,10" \
--datasets-per-thread 20Key Concept: Thread count is determined by the number of comma-separated values in --distribution.
# OpenAI/Azure
bookdatamaker generate ./extracted \
-d dataset.db \
--openai-api-url https://api.openai.com/v1 \
--model gpt-4 \
--distribution "10,10,20,30,20,10"
# Custom API endpoint
bookdatamaker generate ./extracted \
--openai-api-url http://localhost:8000/v1 \
--model your-model-name \
--distribution "25,25,25,25"Use vLLM directly without API server:
# Single GPU
bookdatamaker generate ./extracted \
--mode vllm \
--vllm-model-path meta-llama/Llama-3-8B-Instruct \
--distribution "25,25,25,25" \
-d dataset.db
# Multi-GPU (4 GPUs, 6 threads)
bookdatamaker generate ./extracted \
--mode vllm \
--vllm-model-path meta-llama/Llama-3-70B-Instruct \
--tensor-parallel-size 4 \
--distribution "10,10,20,30,20,10" \
-d dataset.dbAdd specific instructions to guide LLM behavior:
# Language specification
bookdatamaker generate ./extracted \
--custom-prompt "Generate all Q&A in Chinese with simplified characters"
# Format specification
bookdatamaker generate ./extracted \
--custom-prompt "Questions should be multiple-choice with 4 options"
# Multiple requirements
bookdatamaker generate ./extracted \
--custom-prompt "Requirements:
1. Generate questions in English
2. Focus on practical applications
3. Include code examples
4. Answer length: 50-150 words
5. Difficulty: intermediate"Control conversation history to prevent token overflow:
# Limit conversation to 50 messages (keeps system prompt + last 10 when exceeded)
bookdatamaker generate ./extracted \
--max-messages 50 \
-d dataset.db
# For models with limited context windows
bookdatamaker generate ./extracted \
--max-messages 30 \
--model gpt-3.5-turboHow it works:
- When message count exceeds
--max-messages, history is pruned automatically - System prompt is always preserved
- Last 10 messages are kept for continuity
- Prevents token overflow errors during long generation sessions
- Useful for models with limited context windows (e.g., 4K, 8K tokens)
Export from SQLite database to your preferred format:
# Parquet (recommended for data analysis, default: zstd compression)
bookdatamaker export-dataset dataset.db -o output.parquet
# Parquet with different compression methods
bookdatamaker export-dataset dataset.db -o output.parquet -c snappy # Faster, larger files
bookdatamaker export-dataset dataset.db -o output.parquet -c gzip # Smaller, slower
bookdatamaker export-dataset dataset.db -o output.parquet -c brotli # Best compression
bookdatamaker export-dataset dataset.db -o output.parquet -c none # No compression
# JSON Lines (easy to stream)
bookdatamaker export-dataset dataset.db -o output.jsonl -f jsonl
# CSV (Excel-friendly)
bookdatamaker export-dataset dataset.db -o output.csv -f csv
# JSON with metadata
bookdatamaker export-dataset dataset.db -o output.json -f json --include-metadataFor Parquet files:
| Method | Speed | Size | Use Case |
|---|---|---|---|
zstd (default) |
Fast | Small | Best balance, recommended |
snappy |
Fastest | Larger | Real-time processing |
gzip |
Medium | Smaller | Network transfer |
brotli |
Slowest | Smallest | Archival storage |
none |
Instant | Largest | Debug/testing only |
Control where threads start in the document using distribution percentages.
Document: 100 pages
Distribution: "10,10,20,30,20,10" (6 threads)
Thread 0: Start at 0% → Page 1
Thread 1: Start at 10% → Page 10
Thread 2: Start at 20% → Page 20
Thread 3: Start at 50% → Page 50
Thread 4: Start at 70% → Page 70
Thread 5: Start at 80% → Page 80
# Even distribution (4 threads)
--distribution "25,25,25,25"
# Start at: 0%, 25%, 50%, 75%
# Front-heavy (4 threads) - focus on beginning
--distribution "40,30,20,10"
# Start at: 0%, 40%, 70%, 90%
# Middle-heavy (5 threads) - focus on middle
--distribution "10,20,40,20,10"
# Start at: 0%, 10%, 30%, 70%, 90%
# Dense sampling (10 threads) - fine-grained coverage
--distribution "10,10,10,10,10,10,10,10,10,10"- Small documents (<50 pages): 2-4 threads
- Medium documents (50-200 pages): 4-8 threads
- Large documents (>200 pages): 8-16 threads
Optimize extraction and generation speeds based on your hardware and requirements.
API Mode (vLLM):
# Increase concurrent requests (default: 4)
bookdatamaker extract book.pdf --api-concurrency 8
# Guidelines:
# - 2-4: Small vLLM server (1-2 GPUs)
# - 4-8: Medium server (2-4 GPUs)
# - 8-16: Large server (4+ GPUs)
# - Monitor server load and adjust accordinglyLocal Mode (Transformers):
# Increase batch size (default: 8)
bookdatamaker extract book.pdf --mode local --batch-size 16
# Guidelines based on GPU VRAM:
# - 8GB VRAM: batch-size 2-4
# - 16GB VRAM: batch-size 4-8
# - 24GB VRAM: batch-size 8-12
# - 40GB+ VRAM: batch-size 12-16Thread Count:
# More threads = faster generation (if LLM server can handle it)
bookdatamaker generate ./extracted \
--distribution "10,10,10,10,10,10,10,10,10,10" \
--threads 10
# Guidelines:
# - API mode: 4-16 threads (based on rate limits)
# - vLLM mode: 4-8 threads (based on GPU capacity)
# - Local mode: 2-4 threads (memory intensive)Message History Management:
# Limit conversation history to prevent memory issues
bookdatamaker generate ./extracted \
--max-messages 20 \
-d dataset.db
# Default: 20 messages (system message + last 10 exchanges)
# Lower values = less memory, potentially less context
# Higher values = more memory, better context retentionDuplicate Detection:
- Automatically enabled with 95% similarity threshold
- Uses rapidfuzz for efficient fuzzy matching
- Prevents redundant Q&A pairs in the dataset
- Start Small: Test with small concurrency/batch sizes first
- Monitor Resources: Watch GPU memory, CPU usage, and network
- Balance Quality vs Speed: Higher concurrency may reduce quality
- Network Bandwidth: API mode performance depends on network speed
- vLLM Configuration: Use tensor parallelism for multi-GPU setups
Chat with an LLM that can access your document through MCP tools. Perfect for exploring documents interactively or testing Q&A generation.
# Basic chat with GPT-4
bookdatamaker chat ./extracted
# With vLLM server
bookdatamaker chat ./extracted \
--openai-api-url http://localhost:8000/v1 \
--model Qwen/Qwen3-4B-Thinking-2507
# With custom database
bookdatamaker chat ./extracted --db my_dataset.dbSet environment variable for verbose logging:
export LOG_LEVEL=DEBUG
bookdatamaker generate ./extracted -d dataset.dbbookdatamaker/
├── src/bookdatamaker/
│ ├── cli.py # CLI interface
│ ├── ocr/
│ │ ├── extractor.py # OCR extraction
│ │ └── document_parser.py # Document parsing
│ ├── mcp/
│ │ └── server.py # MCP server
│ ├── llm/
│ │ └── parallel_generator.py # Parallel generation
│ ├── dataset/
│ │ ├── builder.py # Dataset building
│ │ └── dataset_manager.py # SQLite management
│ └── utils/
│ ├── page_manager.py # Page navigation
│ └── status.py # Progress indicators
└── tests/ # Test files
# Clone repository
git clone https://github.com/yourusername/bookdatamaker.git
cd bookdatamaker
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/
# Code formatting
black src/
ruff check src/
# Type checking
mypy src/Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new features
- Ensure all tests pass
- Submit a pull request
# Run all tests
pytest
# Run specific test file
pytest tests/test_ocr.py
# Run with coverage
pytest --cov=bookdatamaker tests/MIT License - see LICENSE file for details.