Skip to content

v3.0.0

Choose a tag to compare

@Goldziher Goldziher released this 23 Mar 15:57
· 6452 commits to main since this release

Enhancements:

  • added support for multiple OCR backends: added PaddleOCR and Easy OCR (feature)
  • added support for having no OCR backend (feature)
  • changed Tesseract OCR to optional (enhancement)
  • added support for registering creating custom extractors (feature)
  • added support for overriding builtin extractors (feature)
  • added support for post-processing hooks (feature)
  • added support for validation hooks (feature)
  • added PDF metadata extraction using Playa-PDF (feature)
  • added optional chunking support (feature)
  • added documentation site (documentation)

Breaking Changes:

  • Changed ExtractionResults from NamedTuple to TypedDict (breaking change; api)

Internal:

  • Rework internals to allow extensibility by changing to a class-based architecture (internal; architecture)