Skip to content

Add Pydantic data models for processing pipeline#6

Merged
zalun merged 4 commits intomainfrom
03-data-models
Feb 27, 2026
Merged

Add Pydantic data models for processing pipeline#6
zalun merged 4 commits intomainfrom
03-data-models

Conversation

@zalun
Copy link
Copy Markdown
Owner

@zalun zalun commented Feb 27, 2026

Summary

  • Add 7 Pydantic models for the processing pipeline: ProcessingJob, PageText, OCRResult, VisionResult, ReconciledDocument, Classification, ProcessedDocument
  • Add _parse_date helper supporting ISO dates, ISO datetimes, European (dd/mm/YYYY), and dot-separated (dd.mm.YYYY) formats with whitespace stripping
  • Add Confidence type alias (Annotated[float, Field(ge=0.0, le=1.0)]) shared across models
  • Add validate_assignment=True on models with validators for runtime safety
  • Bump version to 0.1.3

Test plan

  • 73 tests pass covering all models, validation, edge cases, and JSON roundtrips
  • models.py at 99% coverage (97% uncovered lines are defensive branches for None/empty-string inputs that Pydantic skips)
  • ruff check passes
  • ruff format --check passes
  • ty check passes
  • Overall coverage 95% (well above 80% threshold)

Closes #5

🤖 Generated with Claude Code

zalun and others added 4 commits February 27, 2026 11:24
Define 7 pipeline models (ProcessingJob, PageText, OCRResult,
VisionResult, ReconciledDocument, Classification, ProcessedDocument)
with validators for confidence ranges, date parsing, and string
constraints. Includes ~25 tests covering creation, defaults,
validation, and JSON roundtrips.

Closes #5

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace duplicate confidence_in_range validators with Annotated type
- Add _parse_date validator to ProcessedDocument.document_date
- Remove ambiguous US date format from _DATE_FORMATS
- Add validate_assignment=True to ProcessingJob for runtime safety
- Add min_length=1 to ProcessingJob.file_type
- Add whitespace-only rejection to Classification fields
- Add tests for new validations (non-string date, whitespace, assignment)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ound 2

- Handle datetime-is-subclass-of-date in _parse_date by extracting .date()
- Strip whitespace from Classification fields (recipient, category, subject)
- Add blank-check validator to ProcessingJob.file_type
- Add tests for datetime-to-date conversion, whitespace stripping, blank file_type

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add validate_assignment=True to Classification, ReconciledDocument,
  and ProcessedDocument so validators run on mutation
- Strip and lowercase file_type in ProcessingJob validator
- Strip whitespace from date strings before parsing in _parse_date
- Add ISO datetime string fallback in _parse_date for LLM outputs
- Add tests for assignment validation, datetime strings, whitespace dates

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@zalun zalun merged commit abfcc1f into main Feb 27, 2026
2 of 3 checks passed
@zalun zalun deleted the 03-data-models branch February 27, 2026 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Task 03: Pydantic data models for processing pipeline

1 participant