Skip to content

Task 04: OCR Pipeline (DeepFellow easyOCR) #7

@zalun

Description

@zalun

Summary

Implement the OCR extraction module that sends PDF/image files to DeepFellow's easyOCR API for deterministic text extraction.

Changes

  • src/docproc/ocr.py — Async OCR extraction with retry logic
  • tests/test_ocr.py — ~20 test functions
  • src/docproc/config.py — Add ocr_endpoint field to DeepfellowConfig
  • config-example.yaml — Add ocr_endpoint: "/v1/ocr"
  • tests/test_config.py — Add ocr_endpoint to MINIMAL_CONFIG
  • pyproject.toml — Version bump, add httpx/pytest-asyncio deps, add asyncio_mode = "auto"
  • Version bump 0.1.30.1.4
  • CHANGELOG.md — Add [0.1.4] entry

Acceptance Criteria

  • Can extract text from PDF/image files via DeepFellow easyOCR API
  • Returns structured OCRResult with page-level breakdown
  • Handles API errors with retry logic (exponential backoff)
  • Works async for parallel execution with Vision
  • Unsupported file types raise clear error
  • All quality gates pass (ruff, ty, pytest ≥80% coverage)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions