Skip to content

Add async OCR extraction module (task 04)#8

Merged
zalun merged 3 commits intomainfrom
04-ocr-pipeline
Feb 27, 2026
Merged

Add async OCR extraction module (task 04)#8
zalun merged 3 commits intomainfrom
04-ocr-pipeline

Conversation

@zalun
Copy link
Copy Markdown
Owner

@zalun zalun commented Feb 27, 2026

Summary

  • Add src/docproc/ocr.py — async OCR extraction via DeepFellow easyOCR API with httpx
  • Retry logic with exponential backoff (3 attempts, 1s initial, 2x factor) on 5xx/timeouts; 4xx fails immediately
  • File validation for supported types (PDF, PNG, JPG, JPEG, TIFF, TIF)
  • Add ocr_endpoint field to DeepfellowConfig and config-example.yaml
  • Add httpx>=0.28.0 as explicit dependency, pytest-asyncio>=0.25.0 dev dep with asyncio_mode = "auto"
  • Version bump 0.1.30.1.4
  • 32 new tests (100% module coverage, 96.48% overall)

Test plan

  • uv run ruff check src/ tests/ — passes
  • uv run ruff format --check src/ tests/ — passes
  • uv run ty check src/ — passes
  • uv run pytest — 105 tests pass, 96.48% coverage (≥80%)

Closes #7

🤖 Generated with Claude Code

zalun and others added 3 commits February 27, 2026 12:26
Implements the OCR pipeline (task 04) with httpx-based async HTTP client,
exponential backoff retry logic, file validation, and structured response
parsing into OCRResult. Adds ocr_endpoint to config, httpx and
pytest-asyncio dependencies, and 32 tests at 100% module coverage.

Closes #7

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Broaden retry catch from TimeoutException to TransportError (ConnectError, ReadError, etc.)
- Validate 'pages' key exists in API response to prevent silent empty results
- Wrap KeyError/TypeError from malformed page entries into OCRError
- Handle response.json() decode failures with clear OCRError
- Read file bytes once before retry loop, wrap OSError into OCRError
- Use is_file() instead of exists() in _validate_file
- Add file context to retry warning logs, ERROR log on final failure
- Extract mock_ocr_client fixture to reduce test boilerplate
- Add tests for ConnectError retry, non-JSON response, malformed API response, missing page keys

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add pydantic.ValidationError to the caught exception types in
  _parse_response so invalid API values (e.g. page_number=0) are
  wrapped as OCRError instead of escaping as raw ValidationError
- Move OCRResult construction inside the try block for full coverage
- Add test for invalid page_number triggering OCRError

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@zalun zalun merged commit 76d83ec into main Feb 27, 2026
3 checks passed
@zalun zalun deleted the 04-ocr-pipeline branch February 27, 2026 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Task 04: OCR Pipeline (DeepFellow easyOCR)

1 participant