Skip to content

parse_document() produces Key Error: 'has_ocr_text' when document.use_ocr is False. #335

@Creykish

Description

@Creykish

Parsing a document results in a key error when ocr is disabled. See below traceback from helpers/document_layout.py in parse_document function.

849 else:
849 else:
     850     decision = {"should_ocr": False}
--> [852]if decision["has_ocr_text"]:  # prevent MD styling if already OCR'd
     853     page_full_ocred = True
     855 if decision["should_ocr"]:
     856     # We should be OCR: check full-page vs. text-only

KeyError: 'has_ocr_text'

This is because when document.use_ocr is false it bypasses the check_ocr.should_ocr_page() method which would normally populate the decision dictionary with the field, and instead produces its own, sparse dictionary which is missing this field. See below for an annotated snippet.

def parse document (
        ...
        if document.use_ocr:
            decision = check_ocr.should_ocr_page(
                page,
                dpi=ocr_dpi,
                edge_thresh=0.015,
                blocks=blocks,
            )
        else:
            decision = {"should_ocr": False} <- dict created here has no "has_ocr_text" field

        if decision["has_ocr_text"]:  <- which KeyErrors here
            page_full_ocred = True
        ...

Suggest either: 1). adding this field to the dict when bypassing this method (quick fix) or 2). refactor to use a dataclass or similar for decision such that required fields are always present with appropriate defaults.

Thanks!

C.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions