Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding whitelist changes whitespace detection behavior #2923

Closed
CIRLOAM opened this issue Mar 13, 2020 · 3 comments
Closed

Adding whitelist changes whitespace detection behavior #2923

CIRLOAM opened this issue Mar 13, 2020 · 3 comments

Comments

@CIRLOAM
Copy link

CIRLOAM commented Mar 13, 2020

Environment:

Python / shell input

  • Tesseract Version:
    Tesseract Open Source OCR Engine v4.1.1-rc2-21-gf4ef with Leptonica
    Tesseract Open Source OCR Engine v5.0.0-alpha-635-g90405 with Leptonica

  • Commit Number:
    N/A

  • Platform:
    Linux 4.15.0-88-generic

Current Behavior:

Using test image attached as example and running command:
tesseract -l eng test-image.png test-out

gets output with correct spaces like the below attached:

However, when you add a character whitelist like so:
tesseract -l eng -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz test-image.png test-out-2

Tesseract no longer recognizes the whitespace (spaces between words):

Confirmed issue with above versions of tesseract, as well as the stock eng.traineddata and best eng.traineddata files. However, if you use the combined eng.traineddata and add --oem 0 to the command, whitespace detection works as expected even with a whitelist, but with much worse ocr. I originally encountered this while using python bindings to access tesseract, and know it also occurs if you request HOCR output instead of plain text.

Expected Behavior:

Whitespace detection should not change when adding character whitelist

Suggested Fix:

(I tried to look through the codebase but had no luck finding a better reference than the following)
Since it works with --oem 0, I suspect there is a change to how blobs are classified as whitespace vs. character in --oem 1, and somehow the whitelist interacts with that decision tree. Changing that tree so that the tessedit_char_whitelist flag does not affect the tree should correct the issue
test-image
test-out-5_00.txt
test-out-5_00-2.txt

@Shreeshrii
Copy link
Collaborator

Have you tried putting your whitelist in quotes and including a space in it?

@CIRLOAM
Copy link
Author

CIRLOAM commented Mar 13, 2020

I thought I did, but when I double checked it correctly detected spaces. Looks like this can be closed. Thanks for the quick response

@CIRLOAM CIRLOAM closed this as completed Mar 13, 2020
TimothyGu pushed a commit to FFmpeg/FFmpeg that referenced this issue Mar 19, 2021
Fixes #9151. The current version of libavfilter/vf_ocr.c does not have white
space in the default whitelist. But it is recommanded to include white
space. See tesseract-ocr/tesseract#2923

Signed-off-by: Marton Balint <cus@passwd.hu>
@lucaswiman
Copy link

lucaswiman commented Aug 10, 2023

For anyone else running into this issue, it seems to be a behavior change between 4.0.0 and 4.1.1, which can be triggered by upgrading from buster debian to bullseye and apt install tesseract-ocr. In the changelog for 4.1.0, it lists "Implemented support for whitelist/blacklist in LSTM engine.", so it seems that previously the -c tessedit_char_whitelist parameter was a no-op and only including whitespace if requested is correct behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants