-
Notifications
You must be signed in to change notification settings - Fork 168
Closed
Labels
Description
Parsing a document results in a key error when ocr is disabled. See below traceback from helpers/document_layout.py in parse_document function.
849 else:
849 else:
850 decision = {"should_ocr": False}
--> [852]if decision["has_ocr_text"]: # prevent MD styling if already OCR'd
853 page_full_ocred = True
855 if decision["should_ocr"]:
856 # We should be OCR: check full-page vs. text-only
KeyError: 'has_ocr_text'
This is because when document.use_ocr is false it bypasses the check_ocr.should_ocr_page() method which would normally populate the decision dictionary with the field, and instead produces its own, sparse dictionary which is missing this field. See below for an annotated snippet.
def parse document (
...
if document.use_ocr:
decision = check_ocr.should_ocr_page(
page,
dpi=ocr_dpi,
edge_thresh=0.015,
blocks=blocks,
)
else:
decision = {"should_ocr": False} <- dict created here has no "has_ocr_text" field
if decision["has_ocr_text"]: <- which KeyErrors here
page_full_ocred = True
...
Suggest either: 1). adding this field to the dict when bypassing this method (quick fix) or 2). refactor to use a dataclass or similar for decision such that required fields are always present with appropriate defaults.
Thanks!
C.
redbmk