Skip to content

refactor(rag): Replace OCR with OpenAI Vision API to reduce Docker image size#49

Merged
larryro merged 11 commits into
mainfrom
claude/replace-ocr-vision-api-Gp6Xn
Dec 30, 2025
Merged

refactor(rag): Replace OCR with OpenAI Vision API to reduce Docker image size#49
larryro merged 11 commits into
mainfrom
claude/replace-ocr-vision-api-Gp6Xn

Conversation

@larryro
Copy link
Copy Markdown
Collaborator

@larryro larryro commented Dec 30, 2025

Summary

  • Replace self-hosted Tesseract OCR with OpenAI Vision API for PDF/image processing
  • Reduce RAG Docker image from 9.37GB to ~2GB by removing heavy OCR dependencies
  • Implement smart hybrid approach that minimizes API costs while handling all document types

Changes

New Vision Module (services/rag/app/services/vision/)

  • Smart PDF processing: Extract text directly via PyMuPDF for digital PDFs, use Vision API only for scanned pages (< 50 chars threshold)
  • Embedded image descriptions: Extract and describe images within PDFs for better indexing
  • Direct image file support: Handle PNG, JPG, and other image formats via Vision API
  • Concurrent processing: Semaphore-limited (default: 5) parallel Vision API calls

Dockerfile Optimizations

  • Remove heavy system deps: tesseract-ocr, poppler-utils, OpenCV libs (~600MB)
  • Keep minimal deps: curl, build-essential, libpq-dev, libmagic1

Dependency Changes

  • Remove unstructured[pdf] extra (eliminates unstructured-inference, ONNX models ~2-3GB)
  • Add PyMuPDF>=1.24.0 (17-24MB wheel, no system deps)

Configuration

New settings in config.py:

  • openai_vision_model - Vision model (default: qwen/qwen3-vl-32b-instruct)
  • vision_max_concurrent_pages - Concurrency limit (default: 5)
  • vision_pdf_dpi - PDF render quality (default: 150)
  • vision_extraction_prompt - Custom OCR prompt (optional)

Test plan

  • Build Docker image and verify size is ~2GB
  • Test digital PDF ingestion (should use direct extraction, no API calls)
  • Test scanned PDF ingestion (should use Vision API for OCR)
  • Test PDF with embedded images (should extract and describe images)
  • Test direct image file upload (PNG, JPG)
  • Test DOCX/PPTX/XLSX passthrough (should still work via unstructured)
  • Verify Vision API error handling and temp file cleanup

Summary by CodeRabbit

Release Notes

  • New Features

    • Integrated Vision API for advanced PDF and image text extraction
    • Added image description capabilities via Vision API
    • Added configurable Vision API settings (model, DPI, concurrent page limits)
  • Chores

    • Optimized dependencies by removing heavy OCR system packages
    • Replaced legacy PDF extraction with lightweight, API-driven approach

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Dec 30, 2025

📝 Walkthrough

Walkthrough

This PR introduces a Vision API-based document processing system as a replacement for legacy OCR dependencies. It removes tesseract and poppler from the Dockerfile and requirements.txt, adds Vision configuration fields (model, concurrency limits, DPI settings, prompts) to the Settings class, and creates a new vision service module with PDF and image text extraction capabilities using PyMuPDF and OpenAI Vision API. The document ingestion flow in cognee/service.py is updated to pre-process PDFs and images through this Vision pipeline before routing to storage.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related issues

Possibly related PRs


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between d4adc04 and 40904d0.

📒 Files selected for processing (9)
  • services/rag/Dockerfile
  • services/rag/app/config.py
  • services/rag/app/services/cognee/service.py
  • services/rag/app/services/vision/__init__.py
  • services/rag/app/services/vision/image_extractor.py
  • services/rag/app/services/vision/openai_client.py
  • services/rag/app/services/vision/pdf_extractor.py
  • services/rag/app/services/vision/processor.py
  • services/rag/requirements.txt
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-19T04:29:46.183Z
Learnt from: larryro
Repo: tale-project/tale PR: 26
File: services/rag/Dockerfile:10-20
Timestamp: 2025-12-19T04:29:46.183Z
Learning: Do not pin apt package versions in Dockerfiles within the tale-project/tale repository (e.g., services/rag/Dockerfile). Rely on regularly updated base images (like python:3.11-slim) and unpinned apt packages (curl, build-essential, libpq-dev) so that security updates and compatibility are handled via base image refresh and CI/CD caching. This reduces maintenance burden; verify through CI pipelines and ensure reproducibility comes from image rebuilds rather than manual pinning.

Applied to files:

  • services/rag/Dockerfile
🧬 Code graph analysis (6)
services/rag/app/services/vision/image_extractor.py (1)
services/rag/app/services/vision/openai_client.py (2)
  • ocr_image (50-108)
  • describe_image (110-163)
services/rag/app/services/vision/openai_client.py (1)
services/rag/app/config.py (2)
  • get_llm_config (100-157)
  • get_vision_model (159-171)
services/rag/app/services/cognee/service.py (1)
services/rag/app/services/vision/processor.py (2)
  • extract_text_from_document (48-104)
  • is_vision_supported (22-32)
services/rag/app/services/vision/__init__.py (1)
services/rag/app/services/vision/processor.py (4)
  • extract_text_from_bytes (107-155)
  • extract_text_from_document (48-104)
  • is_passthrough_type (35-45)
  • is_vision_supported (22-32)
services/rag/app/services/vision/pdf_extractor.py (1)
services/rag/app/services/vision/openai_client.py (2)
  • ocr_image (50-108)
  • describe_image (110-163)
services/rag/app/services/vision/processor.py (2)
services/rag/app/services/vision/image_extractor.py (2)
  • extract_text_from_image (23-46)
  • extract_text_from_image_bytes (49-92)
services/rag/app/services/vision/pdf_extractor.py (2)
  • extract_text_from_pdf (29-127)
  • extract_text_from_pdf_bytes (225-311)
🪛 Ruff (0.14.10)
services/rag/app/services/vision/image_extractor.py

39-39: Avoid specifying long messages outside the exception class

(TRY003)


76-76: Do not catch blind exception: Exception

(BLE001)

services/rag/app/services/vision/openai_client.py

104-104: Consider moving this statement to an else block

(TRY300)

services/rag/app/services/vision/__init__.py

19-24: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

services/rag/app/services/vision/pdf_extractor.py

52-52: Avoid specifying long messages outside the exception class

(TRY003)


113-113: Do not catch blind exception: Exception

(BLE001)


156-156: Do not catch blind exception: Exception

(BLE001)


212-212: Do not catch blind exception: Exception

(BLE001)


219-219: Do not catch blind exception: Exception

(BLE001)


297-297: Do not catch blind exception: Exception

(BLE001)

services/rag/app/services/vision/processor.py

79-79: Avoid specifying long messages outside the exception class

(TRY003)


104-104: Avoid specifying long messages outside the exception class

(TRY003)


155-155: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (10)
services/rag/Dockerfile (1)

9-20: Excellent minimal dependencies approach. PyMuPDF handles PDF-to-image rendering without requiring poppler or tesseract.

The removal of OCR system dependencies and retention of only essential packages aligns perfectly with the Vision API integration. PyMuPDF's get_pixmap() with DPI matrix scaling (150 DPI per config) produces PNG images purely in Python without external tools. Comments are clear and well-documented. No version pinning is correct per repository practice.

services/rag/requirements.txt (1)

17-22: LGTM!

The dependency changes align well with the PR objectives. Removing the pdf extra from unstructured and adding PyMuPDF>=1.24.0 is a sound approach for reducing Docker image size while maintaining PDF processing capabilities through the Vision API.

services/rag/app/config.py (2)

69-79: LGTM!

The Vision API configuration fields are well-documented and follow the existing patterns in the file. The defaults (5 concurrent pages, 150 DPI) are sensible for balancing API costs and quality.


159-171: LGTM!

The get_vision_model() method follows the same priority pattern as get_llm_config(), providing consistent configuration resolution across the codebase.

services/rag/app/services/vision/__init__.py (1)

1-24: LGTM!

Clean module initialization with a well-documented public API. The hybrid processing approach is clearly explained in the docstring.

The __all__ list could be sorted alphabetically per linting suggestion (RUF022), but the current logical grouping is also reasonable.

services/rag/app/services/cognee/service.py (1)

220-229: LGTM!

Good use of a finally block to ensure temp file cleanup regardless of success or failure. The logging levels (debug for success, warning for failure) are appropriate.

services/rag/app/services/vision/openai_client.py (1)

32-48: LGTM!

Good use of lazy initialization for the OpenAI client. The 120-second timeout is appropriate for Vision API calls which can be slower than text completion.

services/rag/app/services/vision/processor.py (3)

16-19: LGTM!

Clear constant definitions for supported file types. The separation between Vision-processed types and passthrough types is well-organized.


48-104: LGTM!

The routing logic is clean with appropriate logging at each branch. File existence validation before processing is good defensive programming.


107-155: LGTM!

Good use of deferred imports to avoid potential circular import issues. The bytes-based processing mirrors the file-based logic appropriately.

Comment thread services/rag/app/services/cognee/service.py
Comment thread services/rag/app/services/vision/image_extractor.py
Comment thread services/rag/app/services/vision/image_extractor.py
Comment thread services/rag/app/services/vision/openai_client.py
Comment thread services/rag/app/services/vision/pdf_extractor.py Outdated
Comment thread services/rag/app/services/vision/pdf_extractor.py Outdated
Comment thread services/rag/app/services/vision/pdf_extractor.py Outdated
…image

Replace self-hosted Tesseract OCR with OpenAI Vision API to significantly
reduce Docker image size from ~9GB to ~2GB.

Changes:
- Add vision module with smart hybrid PDF processing:
  - Direct text extraction via PyMuPDF for digital PDFs
  - Vision API OCR for scanned/low-text pages
  - Vision API descriptions for embedded images
- Remove heavy system deps (tesseract, poppler, OpenCV libs)
- Remove unstructured[pdf] extra, add PyMuPDF
- Add Vision API configuration (model, DPI, concurrency)
- Integrate vision pre-processor into Cognee service

Benefits:
- Docker image reduced from 9.37GB to ~2GB
- Lower API costs (only scanned pages use Vision API)
- Faster processing (direct text extraction is instant)
- Simpler maintenance (fewer system dependencies)
@larryro larryro force-pushed the claude/replace-ocr-vision-api-Gp6Xn branch from 40904d0 to 63aa276 Compare December 30, 2025 13:55
larryro and others added 7 commits December 30, 2025 21:57
…eter

The docstring claimed the content parameter accepts "file path or text
content", but the implementation only handles file paths. All current
callers pass file paths after creating temporary files for raw text.
Updated docstring to accurately reflect the actual behavior.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replaced magic number 20 with MIN_OCR_TEXT_LENGTH constant for better
readability and easier adjustment of the meaningful text threshold.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When both OCR and description fail for an image, the code now returns
an empty result instead of propagating the exception. This allows
document ingestion to continue even if problematic images are
encountered.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added _detect_mime_type helper function to detect the actual image
format from bytes. This ensures correct MIME types are used in data
URLs when sending images to the Vision API, improving compatibility
with strict OpenAI-compatible providers.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…eaks

Replaced manual fitz.open()/doc.close() with context manager pattern
(with fitz.open() as doc:) in both extract_text_from_pdf and
extract_text_from_pdf_bytes functions. This ensures PDF document handles
are properly closed even if exceptions occur during processing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Changed process_page to return a tuple (text, vision_used) instead of
mutating a nonlocal variable. The vision_used flag is now aggregated
from all page results after asyncio.gather completes. This makes the
pattern more robust and explicit.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Moved the common page-processing logic from extract_text_from_pdf and
extract_text_from_pdf_bytes into a shared _process_pdf_document helper.
This eliminates code duplication and makes the codebase easier to
maintain.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
larryro and others added 3 commits December 30, 2025 23:03
…prefix

Models that already have a provider prefix (e.g. 'moonshotai/kimi-k2')
were incorrectly getting an additional 'openai/' prefix added. Now checks
if model string already contains a '/' before adding the prefix.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace timescaledb-ha with timescaledb:2.24.0-pg16 (Alpine-based) for
a lighter footprint. Update package manager from apt-get to apk and
adjust data directory path accordingly.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add tenacity-based retry with exponential backoff (3 attempts, 2-30s
wait) for cognee.cognify() to handle transient network/API errors.
Safe to retry since incremental_loading=True skips already-processed
items.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@larryro larryro merged commit 8bc7ec9 into main Dec 30, 2025
1 check passed
@larryro larryro deleted the claude/replace-ocr-vision-api-Gp6Xn branch December 30, 2025 15:54
yannickmonney pushed a commit that referenced this pull request Apr 8, 2026
…age size (#49)

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants