refactor(rag): Replace OCR with OpenAI Vision API to reduce Docker image size by larryro · Pull Request #49 · tale-project/tale

larryro · 2025-12-30T10:17:59Z

Summary

Replace self-hosted Tesseract OCR with OpenAI Vision API for PDF/image processing
Reduce RAG Docker image from 9.37GB to ~2GB by removing heavy OCR dependencies
Implement smart hybrid approach that minimizes API costs while handling all document types

Changes

New Vision Module (`services/rag/app/services/vision/`)

Smart PDF processing: Extract text directly via PyMuPDF for digital PDFs, use Vision API only for scanned pages (< 50 chars threshold)
Embedded image descriptions: Extract and describe images within PDFs for better indexing
Direct image file support: Handle PNG, JPG, and other image formats via Vision API
Concurrent processing: Semaphore-limited (default: 5) parallel Vision API calls

Dockerfile Optimizations

Remove heavy system deps: tesseract-ocr, poppler-utils, OpenCV libs (~600MB)
Keep minimal deps: curl, build-essential, libpq-dev, libmagic1

Dependency Changes

Remove unstructured[pdf] extra (eliminates unstructured-inference, ONNX models ~2-3GB)
Add PyMuPDF>=1.24.0 (17-24MB wheel, no system deps)

Configuration

New settings in config.py:

openai_vision_model - Vision model (default: qwen/qwen3-vl-32b-instruct)
vision_max_concurrent_pages - Concurrency limit (default: 5)
vision_pdf_dpi - PDF render quality (default: 150)
vision_extraction_prompt - Custom OCR prompt (optional)

Test plan

Build Docker image and verify size is ~2GB
Test digital PDF ingestion (should use direct extraction, no API calls)
Test scanned PDF ingestion (should use Vision API for OCR)
Test PDF with embedded images (should extract and describe images)
Test direct image file upload (PNG, JPG)
Test DOCX/PPTX/XLSX passthrough (should still work via unstructured)
Verify Vision API error handling and temp file cleanup

Summary by CodeRabbit

Release Notes

New Features
- Integrated Vision API for advanced PDF and image text extraction
- Added image description capabilities via Vision API
- Added configurable Vision API settings (model, DPI, concurrent page limits)
Chores
- Optimized dependencies by removing heavy OCR system packages
- Replaced legacy PDF extraction with lightweight, API-driven approach

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-30T10:21:38Z

📝 Walkthrough

Walkthrough

This PR introduces a Vision API-based document processing system as a replacement for legacy OCR dependencies. It removes tesseract and poppler from the Dockerfile and requirements.txt, adds Vision configuration fields (model, concurrency limits, DPI settings, prompts) to the Settings class, and creates a new vision service module with PDF and image text extraction capabilities using PyMuPDF and OpenAI Vision API. The document ingestion flow in cognee/service.py is updated to pre-process PDFs and images through this Vision pipeline before routing to storage.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related issues

refactor(rag): Replace self-hosted OCR with OpenAI Vision API to reduce Docker image size #43: Directly addresses the implementation of Vision API-based document processing with exact code-level changes including Dockerfile updates, configuration additions, and the vision service module integration.

Possibly related PRs

feat(rag): optimize PDF parsing with OCR and improve chunk settings #34: Modifies the RAG PDF ingestion flow and updates Dockerfile/OCR dependencies in a related refactoring context.
tale-project/poc2#393: Updates RAG service configuration and dependencies including Vision-related settings and package requirements.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between d4adc04 and 40904d0.

📒 Files selected for processing (9)

services/rag/Dockerfile
services/rag/app/config.py
services/rag/app/services/cognee/service.py
services/rag/app/services/vision/__init__.py
services/rag/app/services/vision/image_extractor.py
services/rag/app/services/vision/openai_client.py
services/rag/app/services/vision/pdf_extractor.py
services/rag/app/services/vision/processor.py
services/rag/requirements.txt

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-12-19T04:29:46.183Z

Learnt from: larryro
Repo: tale-project/tale PR: 26
File: services/rag/Dockerfile:10-20
Timestamp: 2025-12-19T04:29:46.183Z
Learning: Do not pin apt package versions in Dockerfiles within the tale-project/tale repository (e.g., services/rag/Dockerfile). Rely on regularly updated base images (like python:3.11-slim) and unpinned apt packages (curl, build-essential, libpq-dev) so that security updates and compatibility are handled via base image refresh and CI/CD caching. This reduces maintenance burden; verify through CI pipelines and ensure reproducibility comes from image rebuilds rather than manual pinning.

Applied to files:

services/rag/Dockerfile

🧬 Code graph analysis (6)

services/rag/app/services/vision/image_extractor.py (1)

services/rag/app/services/vision/openai_client.py (2)

ocr_image (50-108)

describe_image (110-163)

services/rag/app/services/vision/openai_client.py (1)

services/rag/app/config.py (2)

get_llm_config (100-157)

get_vision_model (159-171)

services/rag/app/services/cognee/service.py (1)

services/rag/app/services/vision/processor.py (2)

extract_text_from_document (48-104)

is_vision_supported (22-32)

services/rag/app/services/vision/__init__.py (1)

services/rag/app/services/vision/processor.py (4)

extract_text_from_bytes (107-155)

extract_text_from_document (48-104)

is_passthrough_type (35-45)

is_vision_supported (22-32)

services/rag/app/services/vision/pdf_extractor.py (1)

services/rag/app/services/vision/openai_client.py (2)

ocr_image (50-108)

describe_image (110-163)

services/rag/app/services/vision/processor.py (2)

services/rag/app/services/vision/image_extractor.py (2)

extract_text_from_image (23-46)

extract_text_from_image_bytes (49-92)

services/rag/app/services/vision/pdf_extractor.py (2)

extract_text_from_pdf (29-127)

extract_text_from_pdf_bytes (225-311)

🪛 Ruff (0.14.10)

services/rag/app/services/vision/image_extractor.py

39-39: Avoid specifying long messages outside the exception class

(TRY003)

76-76: Do not catch blind exception: Exception

(BLE001)

services/rag/app/services/vision/openai_client.py

104-104: Consider moving this statement to an else block

(TRY300)

services/rag/app/services/vision/__init__.py

19-24: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

services/rag/app/services/vision/pdf_extractor.py

52-52: Avoid specifying long messages outside the exception class

(TRY003)

113-113: Do not catch blind exception: Exception

(BLE001)

156-156: Do not catch blind exception: Exception

(BLE001)

212-212: Do not catch blind exception: Exception

(BLE001)

219-219: Do not catch blind exception: Exception

(BLE001)

297-297: Do not catch blind exception: Exception

(BLE001)

services/rag/app/services/vision/processor.py

79-79: Avoid specifying long messages outside the exception class

(TRY003)

104-104: Avoid specifying long messages outside the exception class

(TRY003)

155-155: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (10)

services/rag/Dockerfile (1)

9-20: Excellent minimal dependencies approach. PyMuPDF handles PDF-to-image rendering without requiring poppler or tesseract.

The removal of OCR system dependencies and retention of only essential packages aligns perfectly with the Vision API integration. PyMuPDF's get_pixmap() with DPI matrix scaling (150 DPI per config) produces PNG images purely in Python without external tools. Comments are clear and well-documented. No version pinning is correct per repository practice.

services/rag/requirements.txt (1)

17-22: LGTM!

The dependency changes align well with the PR objectives. Removing the pdf extra from unstructured and adding PyMuPDF>=1.24.0 is a sound approach for reducing Docker image size while maintaining PDF processing capabilities through the Vision API.

services/rag/app/config.py (2)

69-79: LGTM!

The Vision API configuration fields are well-documented and follow the existing patterns in the file. The defaults (5 concurrent pages, 150 DPI) are sensible for balancing API costs and quality.

159-171: LGTM!

The get_vision_model() method follows the same priority pattern as get_llm_config(), providing consistent configuration resolution across the codebase.

services/rag/app/services/vision/__init__.py (1)

1-24: LGTM!

Clean module initialization with a well-documented public API. The hybrid processing approach is clearly explained in the docstring.

The __all__ list could be sorted alphabetically per linting suggestion (RUF022), but the current logical grouping is also reasonable.

services/rag/app/services/cognee/service.py (1)

220-229: LGTM!

Good use of a finally block to ensure temp file cleanup regardless of success or failure. The logging levels (debug for success, warning for failure) are appropriate.

services/rag/app/services/vision/openai_client.py (1)

32-48: LGTM!

Good use of lazy initialization for the OpenAI client. The 120-second timeout is appropriate for Vision API calls which can be slower than text completion.

services/rag/app/services/vision/processor.py (3)

16-19: LGTM!

Clear constant definitions for supported file types. The separation between Vision-processed types and passthrough types is well-organized.

48-104: LGTM!

The routing logic is clean with appropriate logging at each branch. File existence validation before processing is good defensive programming.

107-155: LGTM!

Good use of deferred imports to avoid potential circular import issues. The bytes-based processing mirrors the file-based logic appropriately.

…image Replace self-hosted Tesseract OCR with OpenAI Vision API to significantly reduce Docker image size from ~9GB to ~2GB. Changes: - Add vision module with smart hybrid PDF processing: - Direct text extraction via PyMuPDF for digital PDFs - Vision API OCR for scanned/low-text pages - Vision API descriptions for embedded images - Remove heavy system deps (tesseract, poppler, OpenCV libs) - Remove unstructured[pdf] extra, add PyMuPDF - Add Vision API configuration (model, DPI, concurrency) - Integrate vision pre-processor into Cognee service Benefits: - Docker image reduced from 9.37GB to ~2GB - Lower API costs (only scanned pages use Vision API) - Faster processing (direct text extraction is instant) - Simpler maintenance (fewer system dependencies)

…eter The docstring claimed the content parameter accepts "file path or text content", but the implementation only handles file paths. All current callers pass file paths after creating temporary files for raw text. Updated docstring to accurately reflect the actual behavior. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replaced magic number 20 with MIN_OCR_TEXT_LENGTH constant for better readability and easier adjustment of the meaningful text threshold. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When both OCR and description fail for an image, the code now returns an empty result instead of propagating the exception. This allows document ingestion to continue even if problematic images are encountered. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Added _detect_mime_type helper function to detect the actual image format from bytes. This ensures correct MIME types are used in data URLs when sending images to the Vision API, improving compatibility with strict OpenAI-compatible providers. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…eaks Replaced manual fitz.open()/doc.close() with context manager pattern (with fitz.open() as doc:) in both extract_text_from_pdf and extract_text_from_pdf_bytes functions. This ensures PDF document handles are properly closed even if exceptions occur during processing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Changed process_page to return a tuple (text, vision_used) instead of mutating a nonlocal variable. The vision_used flag is now aggregated from all page results after asyncio.gather completes. This makes the pattern more robust and explicit. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Moved the common page-processing logic from extract_text_from_pdf and extract_text_from_pdf_bytes into a shared _process_pdf_document helper. This eliminates code duplication and makes the codebase easier to maintain. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…prefix Models that already have a provider prefix (e.g. 'moonshotai/kimi-k2') were incorrectly getting an additional 'openai/' prefix added. Now checks if model string already contains a '/' before adding the prefix. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace timescaledb-ha with timescaledb:2.24.0-pg16 (Alpine-based) for a lighter footprint. Update package manager from apt-get to apk and adjust data directory path accordingly. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add tenacity-based retry with exponential backoff (3 attempts, 2-30s wait) for cognee.cognify() to handle transient network/API errors. Safe to retry since incremental_loading=True skips already-processed items. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…age size (#49) Co-authored-by: Claude <noreply@anthropic.com>

coderabbitai Bot requested changes Dec 30, 2025

View reviewed changes

larryro force-pushed the claude/replace-ocr-vision-api-Gp6Xn branch from 40904d0 to 63aa276 Compare December 30, 2025 13:55

larryro and others added 7 commits December 30, 2025 21:57

coderabbitai Bot approved these changes Dec 30, 2025

View reviewed changes

larryro and others added 3 commits December 30, 2025 23:03

larryro merged commit 8bc7ec9 into main Dec 30, 2025
1 check passed

larryro deleted the claude/replace-ocr-vision-api-Gp6Xn branch December 30, 2025 15:54

coderabbitai Bot mentioned this pull request Jan 26, 2026

feat(rag): add multi-tenancy support and vision caching #285

Merged

3 tasks

yannickmonney pushed a commit that referenced this pull request Apr 8, 2026

refactor(rag): Replace OCR with OpenAI Vision API to reduce Docker im…

10c3cf0

…age size (#49) Co-authored-by: Claude <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(rag): Replace OCR with OpenAI Vision API to reduce Docker image size#49

refactor(rag): Replace OCR with OpenAI Vision API to reduce Docker image size#49
larryro merged 11 commits into
mainfrom
claude/replace-ocr-vision-api-Gp6Xn

larryro commented Dec 30, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Dec 30, 2025

Walkthrough

Estimated code review effort

Possibly related issues

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

larryro commented Dec 30, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

New Vision Module (services/rag/app/services/vision/)

Dockerfile Optimizations

Dependency Changes

Configuration

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Dec 30, 2025

Walkthrough

Estimated code review effort

Possibly related issues

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

larryro commented Dec 30, 2025 •

edited by coderabbitai Bot

Loading

New Vision Module (`services/rag/app/services/vision/`)