You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tesseract Version:
Tesseract Open Source OCR Engine v4.1.1-rc2-21-gf4ef with Leptonica
Tesseract Open Source OCR Engine v5.0.0-alpha-635-g90405 with Leptonica
Commit Number:
N/A
Platform:
Linux 4.15.0-88-generic
Current Behavior:
Using test image attached as example and running command: tesseract -l eng test-image.png test-out
gets output with correct spaces like the below attached:
However, when you add a character whitelist like so: tesseract -l eng -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz test-image.png test-out-2
Tesseract no longer recognizes the whitespace (spaces between words):
Confirmed issue with above versions of tesseract, as well as the stock eng.traineddata and best eng.traineddata files. However, if you use the combined eng.traineddata and add --oem 0 to the command, whitespace detection works as expected even with a whitelist, but with much worse ocr. I originally encountered this while using python bindings to access tesseract, and know it also occurs if you request HOCR output instead of plain text.
Expected Behavior:
Whitespace detection should not change when adding character whitelist
Suggested Fix:
(I tried to look through the codebase but had no luck finding a better reference than the following)
Since it works with --oem 0, I suspect there is a change to how blobs are classified as whitespace vs. character in --oem 1, and somehow the whitelist interacts with that decision tree. Changing that tree so that the tessedit_char_whitelist flag does not affect the tree should correct the issue test-out-5_00.txt test-out-5_00-2.txt
The text was updated successfully, but these errors were encountered:
Fixes #9151. The current version of libavfilter/vf_ocr.c does not have white
space in the default whitelist. But it is recommanded to include white
space. See tesseract-ocr/tesseract#2923
Signed-off-by: Marton Balint <cus@passwd.hu>
For anyone else running into this issue, it seems to be a behavior change between 4.0.0 and 4.1.1, which can be triggered by upgrading from buster debian to bullseye and apt install tesseract-ocr. In the changelog for 4.1.0, it lists "Implemented support for whitelist/blacklist in LSTM engine.", so it seems that previously the -c tessedit_char_whitelist parameter was a no-op and only including whitespace if requested is correct behavior.
Environment:
Python / shell input
Tesseract Version:
Tesseract Open Source OCR Engine v4.1.1-rc2-21-gf4ef with Leptonica
Tesseract Open Source OCR Engine v5.0.0-alpha-635-g90405 with Leptonica
Commit Number:
N/A
Platform:
Linux 4.15.0-88-generic
Current Behavior:
Using test image attached as example and running command:
tesseract -l eng test-image.png test-out
gets output with correct spaces like the below attached:
However, when you add a character whitelist like so:
tesseract -l eng -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz test-image.png test-out-2
Tesseract no longer recognizes the whitespace (spaces between words):
Confirmed issue with above versions of tesseract, as well as the stock eng.traineddata and best eng.traineddata files. However, if you use the combined eng.traineddata and add --oem 0 to the command, whitespace detection works as expected even with a whitelist, but with much worse ocr. I originally encountered this while using python bindings to access tesseract, and know it also occurs if you request HOCR output instead of plain text.
Expected Behavior:
Whitespace detection should not change when adding character whitelist
Suggested Fix:
(I tried to look through the codebase but had no luck finding a better reference than the following)
Since it works with --oem 0, I suspect there is a change to how blobs are classified as whitespace vs. character in --oem 1, and somehow the whitelist interacts with that decision tree. Changing that tree so that the tessedit_char_whitelist flag does not affect the tree should correct the issue
test-out-5_00.txt
test-out-5_00-2.txt
The text was updated successfully, but these errors were encountered: