parse_document() produces Key Error: 'has_ocr_text' when document.use_ocr is False.

Parsing a document results in a key error when ocr is disabled. See below traceback from `helpers/document_layout.py` in `parse_document` function. 

```
849 else:
849 else:
     850     decision = {"should_ocr": False}
--> [852]if decision["has_ocr_text"]:  # prevent MD styling if already OCR'd
     853     page_full_ocred = True
     855 if decision["should_ocr"]:
     856     # We should be OCR: check full-page vs. text-only

KeyError: 'has_ocr_text'

```

This is because when `document.use_ocr` is false it bypasses the `check_ocr.should_ocr_page()` method which would normally populate the `decision` dictionary with the field, and instead produces its own, sparse dictionary which is missing this field. See below for an annotated snippet. 


```
def parse document (
        ...
        if document.use_ocr:
            decision = check_ocr.should_ocr_page(
                page,
                dpi=ocr_dpi,
                edge_thresh=0.015,
                blocks=blocks,
            )
        else:
            decision = {"should_ocr": False} <- dict created here has no "has_ocr_text" field

        if decision["has_ocr_text"]:  <- which KeyErrors here
            page_full_ocred = True
        ...

```

Suggest either: 1). adding this field to the dict when bypassing this method (quick fix) or 2). refactor to use a dataclass or similar for `decision` such that required fields are always present with appropriate defaults.

Thanks!

C. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

parse_document() produces Key Error: 'has_ocr_text' when document.use_ocr is False. #335

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

parse_document() produces Key Error: 'has_ocr_text' when document.use_ocr is False. #335

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions