Summary
Implement the OCR extraction module that sends PDF/image files to DeepFellow's easyOCR API for deterministic text extraction.
Changes
src/docproc/ocr.py — Async OCR extraction with retry logic
tests/test_ocr.py — ~20 test functions
src/docproc/config.py — Add ocr_endpoint field to DeepfellowConfig
config-example.yaml — Add ocr_endpoint: "/v1/ocr"
tests/test_config.py — Add ocr_endpoint to MINIMAL_CONFIG
pyproject.toml — Version bump, add httpx/pytest-asyncio deps, add asyncio_mode = "auto"
- Version bump
0.1.3 → 0.1.4
CHANGELOG.md — Add [0.1.4] entry
Acceptance Criteria
- Can extract text from PDF/image files via DeepFellow easyOCR API
- Returns structured
OCRResult with page-level breakdown
- Handles API errors with retry logic (exponential backoff)
- Works async for parallel execution with Vision
- Unsupported file types raise clear error
- All quality gates pass (ruff, ty, pytest ≥80% coverage)
Summary
Implement the OCR extraction module that sends PDF/image files to DeepFellow's easyOCR API for deterministic text extraction.
Changes
src/docproc/ocr.py— Async OCR extraction with retry logictests/test_ocr.py— ~20 test functionssrc/docproc/config.py— Addocr_endpointfield toDeepfellowConfigconfig-example.yaml— Addocr_endpoint: "/v1/ocr"tests/test_config.py— Addocr_endpointtoMINIMAL_CONFIGpyproject.toml— Version bump, addhttpx/pytest-asynciodeps, addasyncio_mode = "auto"0.1.3→0.1.4CHANGELOG.md— Add[0.1.4]entryAcceptance Criteria
OCRResultwith page-level breakdown