Skip to content

v0.6.0rc1

Pre-release
Pre-release

Choose a tag to compare

@kba kba released this 10 Oct 14:35
· 568 commits to main since this release

Fixed:

  • continue processing when no columns detected but text regions exist
  • convert marginalia to main text if no main text is present
  • reset deskewing angle to 0° when text covers <30% image area and detected angle >45°
  • 🔥 polygons: avoid invalid paths (use Polygon.buffer() instead of dilation etc.)
  • return_boxes_of_images_by_order_of_reading_new: avoid Numpy.dtype mismatch, simplify
  • return_boxes_of_images_by_order_of_reading_new: log any exceptions instead of ignoring
  • filter_contours_without_textline_inside: avoid removing from duplicate lists twice
  • get_marginals: exit early if no peaks found to avoid spurious overlap mask
  • get_smallest_skew: after shifting search range of rotation angle, use overall best result
  • Dockerfile: fix CUDA installation (cuDNN contested between Torch and TF due to extra OCR)
  • OCR: re-instate missing methods and fix utils_ocr function calls
  • mbreorder/enhancement CLIs: missing imports
  • 🔥 writer: SeparatorRegion needs SeparatorRegionType (not ImageRegionType)
    f458e3e
  • tests: switch from pytest-subtests to parametrize so we can use pytest-isolate
    (so CUDA memory gets freed between tests if running on GPU)

Added:

  • 🔥 layout CLI: new option --model_version to override default choices
  • test coverage for OCR options in layout
  • test coverage for table detection in layout
  • CI linting with ruff

Changed:

  • polygons: slightly widen for regions and lines, increase for separators
  • various refactorings, some code style and identifier improvements
  • deskewing/multiprocessing: switch back to ProcessPoolExecutor (faster),
    but use shared memory if necessary, and switch back from loky to stdlib,
    and shutdown in del() instead of atexit
  • 🔥 OCR: switch CNN-RNN model to 20250930 version compatible with TF 2.12 on CPU, too
  • OCR: allow running -tr without -fl, too
  • 🔥 writer: use @type='heading' instead of 'header' for headings
  • 🔥 performance gains via refactoring (simplification, less copy-code, vectorization,
    avoiding unused calculations, avoiding unnecessary 3-channel image operations)
  • 🔥 heuristic reading order detection: many improvements
    • contour vs splitter box matching:
      • contour must be contained in box exactly instead of heuristics
      • make fallback center matching, center must be contained in box
    • original vs deskewed contour matching:
      • same min-area filter on both sides
      • similar area score in addition to center proximity
      • avoid duplicate and missing mappings by allowing N:M
        matches and splitting+joining where necessary
  • CI: update+improve model caching

Merged PRs