refactor(rag): Replace OCR with OpenAI Vision API to reduce Docker image size#49
Conversation
📝 WalkthroughWalkthroughThis PR introduces a Vision API-based document processing system as a replacement for legacy OCR dependencies. It removes tesseract and poppler from the Dockerfile and requirements.txt, adds Vision configuration fields (model, concurrency limits, DPI settings, prompts) to the Settings class, and creates a new Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related issues
Possibly related PRs
Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
📜 Review details
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro (Legacy)
📒 Files selected for processing (9)
services/rag/Dockerfileservices/rag/app/config.pyservices/rag/app/services/cognee/service.pyservices/rag/app/services/vision/__init__.pyservices/rag/app/services/vision/image_extractor.pyservices/rag/app/services/vision/openai_client.pyservices/rag/app/services/vision/pdf_extractor.pyservices/rag/app/services/vision/processor.pyservices/rag/requirements.txt
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-19T04:29:46.183Z
Learnt from: larryro
Repo: tale-project/tale PR: 26
File: services/rag/Dockerfile:10-20
Timestamp: 2025-12-19T04:29:46.183Z
Learning: Do not pin apt package versions in Dockerfiles within the tale-project/tale repository (e.g., services/rag/Dockerfile). Rely on regularly updated base images (like python:3.11-slim) and unpinned apt packages (curl, build-essential, libpq-dev) so that security updates and compatibility are handled via base image refresh and CI/CD caching. This reduces maintenance burden; verify through CI pipelines and ensure reproducibility comes from image rebuilds rather than manual pinning.
Applied to files:
services/rag/Dockerfile
🧬 Code graph analysis (6)
services/rag/app/services/vision/image_extractor.py (1)
services/rag/app/services/vision/openai_client.py (2)
ocr_image(50-108)describe_image(110-163)
services/rag/app/services/vision/openai_client.py (1)
services/rag/app/config.py (2)
get_llm_config(100-157)get_vision_model(159-171)
services/rag/app/services/cognee/service.py (1)
services/rag/app/services/vision/processor.py (2)
extract_text_from_document(48-104)is_vision_supported(22-32)
services/rag/app/services/vision/__init__.py (1)
services/rag/app/services/vision/processor.py (4)
extract_text_from_bytes(107-155)extract_text_from_document(48-104)is_passthrough_type(35-45)is_vision_supported(22-32)
services/rag/app/services/vision/pdf_extractor.py (1)
services/rag/app/services/vision/openai_client.py (2)
ocr_image(50-108)describe_image(110-163)
services/rag/app/services/vision/processor.py (2)
services/rag/app/services/vision/image_extractor.py (2)
extract_text_from_image(23-46)extract_text_from_image_bytes(49-92)services/rag/app/services/vision/pdf_extractor.py (2)
extract_text_from_pdf(29-127)extract_text_from_pdf_bytes(225-311)
🪛 Ruff (0.14.10)
services/rag/app/services/vision/image_extractor.py
39-39: Avoid specifying long messages outside the exception class
(TRY003)
76-76: Do not catch blind exception: Exception
(BLE001)
services/rag/app/services/vision/openai_client.py
104-104: Consider moving this statement to an else block
(TRY300)
services/rag/app/services/vision/__init__.py
19-24: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
services/rag/app/services/vision/pdf_extractor.py
52-52: Avoid specifying long messages outside the exception class
(TRY003)
113-113: Do not catch blind exception: Exception
(BLE001)
156-156: Do not catch blind exception: Exception
(BLE001)
212-212: Do not catch blind exception: Exception
(BLE001)
219-219: Do not catch blind exception: Exception
(BLE001)
297-297: Do not catch blind exception: Exception
(BLE001)
services/rag/app/services/vision/processor.py
79-79: Avoid specifying long messages outside the exception class
(TRY003)
104-104: Avoid specifying long messages outside the exception class
(TRY003)
155-155: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (10)
services/rag/Dockerfile (1)
9-20: Excellent minimal dependencies approach. PyMuPDF handles PDF-to-image rendering without requiring poppler or tesseract.The removal of OCR system dependencies and retention of only essential packages aligns perfectly with the Vision API integration. PyMuPDF's
get_pixmap()with DPI matrix scaling (150 DPI per config) produces PNG images purely in Python without external tools. Comments are clear and well-documented. No version pinning is correct per repository practice.services/rag/requirements.txt (1)
17-22: LGTM!The dependency changes align well with the PR objectives. Removing the
unstructuredand addingPyMuPDF>=1.24.0is a sound approach for reducing Docker image size while maintaining PDF processing capabilities through the Vision API.services/rag/app/config.py (2)
69-79: LGTM!The Vision API configuration fields are well-documented and follow the existing patterns in the file. The defaults (5 concurrent pages, 150 DPI) are sensible for balancing API costs and quality.
159-171: LGTM!The
get_vision_model()method follows the same priority pattern asget_llm_config(), providing consistent configuration resolution across the codebase.services/rag/app/services/vision/__init__.py (1)
1-24: LGTM!Clean module initialization with a well-documented public API. The hybrid processing approach is clearly explained in the docstring.
The
__all__list could be sorted alphabetically per linting suggestion (RUF022), but the current logical grouping is also reasonable.services/rag/app/services/cognee/service.py (1)
220-229: LGTM!Good use of a
finallyblock to ensure temp file cleanup regardless of success or failure. The logging levels (debug for success, warning for failure) are appropriate.services/rag/app/services/vision/openai_client.py (1)
32-48: LGTM!Good use of lazy initialization for the OpenAI client. The 120-second timeout is appropriate for Vision API calls which can be slower than text completion.
services/rag/app/services/vision/processor.py (3)
16-19: LGTM!Clear constant definitions for supported file types. The separation between Vision-processed types and passthrough types is well-organized.
48-104: LGTM!The routing logic is clean with appropriate logging at each branch. File existence validation before processing is good defensive programming.
107-155: LGTM!Good use of deferred imports to avoid potential circular import issues. The bytes-based processing mirrors the file-based logic appropriately.
…image Replace self-hosted Tesseract OCR with OpenAI Vision API to significantly reduce Docker image size from ~9GB to ~2GB. Changes: - Add vision module with smart hybrid PDF processing: - Direct text extraction via PyMuPDF for digital PDFs - Vision API OCR for scanned/low-text pages - Vision API descriptions for embedded images - Remove heavy system deps (tesseract, poppler, OpenCV libs) - Remove unstructured[pdf] extra, add PyMuPDF - Add Vision API configuration (model, DPI, concurrency) - Integrate vision pre-processor into Cognee service Benefits: - Docker image reduced from 9.37GB to ~2GB - Lower API costs (only scanned pages use Vision API) - Faster processing (direct text extraction is instant) - Simpler maintenance (fewer system dependencies)
40904d0 to
63aa276
Compare
…eter The docstring claimed the content parameter accepts "file path or text content", but the implementation only handles file paths. All current callers pass file paths after creating temporary files for raw text. Updated docstring to accurately reflect the actual behavior. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replaced magic number 20 with MIN_OCR_TEXT_LENGTH constant for better readability and easier adjustment of the meaningful text threshold. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When both OCR and description fail for an image, the code now returns an empty result instead of propagating the exception. This allows document ingestion to continue even if problematic images are encountered. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added _detect_mime_type helper function to detect the actual image format from bytes. This ensures correct MIME types are used in data URLs when sending images to the Vision API, improving compatibility with strict OpenAI-compatible providers. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…eaks Replaced manual fitz.open()/doc.close() with context manager pattern (with fitz.open() as doc:) in both extract_text_from_pdf and extract_text_from_pdf_bytes functions. This ensures PDF document handles are properly closed even if exceptions occur during processing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Changed process_page to return a tuple (text, vision_used) instead of mutating a nonlocal variable. The vision_used flag is now aggregated from all page results after asyncio.gather completes. This makes the pattern more robust and explicit. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Moved the common page-processing logic from extract_text_from_pdf and extract_text_from_pdf_bytes into a shared _process_pdf_document helper. This eliminates code duplication and makes the codebase easier to maintain. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…prefix Models that already have a provider prefix (e.g. 'moonshotai/kimi-k2') were incorrectly getting an additional 'openai/' prefix added. Now checks if model string already contains a '/' before adding the prefix. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace timescaledb-ha with timescaledb:2.24.0-pg16 (Alpine-based) for a lighter footprint. Update package manager from apt-get to apk and adjust data directory path accordingly. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add tenacity-based retry with exponential backoff (3 attempts, 2-30s wait) for cognee.cognify() to handle transient network/API errors. Safe to retry since incremental_loading=True skips already-processed items. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…age size (#49) Co-authored-by: Claude <noreply@anthropic.com>
Summary
Changes
New Vision Module (
services/rag/app/services/vision/)Dockerfile Optimizations
tesseract-ocr,poppler-utils, OpenCV libs (~600MB)curl,build-essential,libpq-dev,libmagic1Dependency Changes
unstructured[pdf]extra (eliminates unstructured-inference, ONNX models ~2-3GB)PyMuPDF>=1.24.0(17-24MB wheel, no system deps)Configuration
New settings in
config.py:openai_vision_model- Vision model (default:qwen/qwen3-vl-32b-instruct)vision_max_concurrent_pages- Concurrency limit (default: 5)vision_pdf_dpi- PDF render quality (default: 150)vision_extraction_prompt- Custom OCR prompt (optional)Test plan
Summary by CodeRabbit
Release Notes
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.