Adding whitelist changes whitespace detection behavior #2923

CIRLOAM · 2020-03-13T00:03:58Z

Environment:

Python / shell input

Tesseract Version:
Tesseract Open Source OCR Engine v4.1.1-rc2-21-gf4ef with Leptonica
Tesseract Open Source OCR Engine v5.0.0-alpha-635-g90405 with Leptonica
Commit Number:
N/A
Platform:
Linux 4.15.0-88-generic

Current Behavior:

Using test image attached as example and running command:
tesseract -l eng test-image.png test-out

gets output with correct spaces like the below attached:

However, when you add a character whitelist like so:
tesseract -l eng -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz test-image.png test-out-2

Tesseract no longer recognizes the whitespace (spaces between words):

Confirmed issue with above versions of tesseract, as well as the stock eng.traineddata and best eng.traineddata files. However, if you use the combined eng.traineddata and add --oem 0 to the command, whitespace detection works as expected even with a whitelist, but with much worse ocr. I originally encountered this while using python bindings to access tesseract, and know it also occurs if you request HOCR output instead of plain text.

Expected Behavior:

Whitespace detection should not change when adding character whitelist

Suggested Fix:

(I tried to look through the codebase but had no luck finding a better reference than the following)
Since it works with --oem 0, I suspect there is a change to how blobs are classified as whitespace vs. character in --oem 1, and somehow the whitelist interacts with that decision tree. Changing that tree so that the tessedit_char_whitelist flag does not affect the tree should correct the issue

test-out-5_00.txt
test-out-5_00-2.txt

The text was updated successfully, but these errors were encountered:

Shreeshrii · 2020-03-13T01:24:30Z

Have you tried putting your whitelist in quotes and including a space in it?

CIRLOAM · 2020-03-13T20:19:13Z

I thought I did, but when I double checked it correctly detected spaces. Looks like this can be closed. Thanks for the quick response

Fixes #9151. The current version of libavfilter/vf_ocr.c does not have white space in the default whitelist. But it is recommanded to include white space. See tesseract-ocr/tesseract#2923 Signed-off-by: Marton Balint <cus@passwd.hu>

lucaswiman · 2023-08-10T19:02:59Z

For anyone else running into this issue, it seems to be a behavior change between 4.0.0 and 4.1.1, which can be triggered by upgrading from buster debian to bullseye and apt install tesseract-ocr. In the changelog for 4.1.0, it lists "Implemented support for whitelist/blacklist in LSTM engine.", so it seems that previously the -c tessedit_char_whitelist parameter was a no-op and only including whitespace if requested is correct behavior.

CIRLOAM closed this as completed Mar 13, 2020

amitdo added the allowlist / denylist label Oct 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding whitelist changes whitespace detection behavior #2923

Adding whitelist changes whitespace detection behavior #2923

CIRLOAM commented Mar 13, 2020

Shreeshrii commented Mar 13, 2020

CIRLOAM commented Mar 13, 2020

lucaswiman commented Aug 10, 2023 •

edited

Loading

Adding whitelist changes whitespace detection behavior #2923

Adding whitelist changes whitespace detection behavior #2923

Comments

CIRLOAM commented Mar 13, 2020

Environment:

Current Behavior:

Expected Behavior:

Suggested Fix:

Shreeshrii commented Mar 13, 2020

CIRLOAM commented Mar 13, 2020

lucaswiman commented Aug 10, 2023 • edited Loading

lucaswiman commented Aug 10, 2023 •

edited

Loading