Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract thresholding eliminates text on bright background colors #1990

Closed
jbarlow83 opened this issue Oct 14, 2018 · 10 comments
Closed

Tesseract thresholding eliminates text on bright background colors #1990

jbarlow83 opened this issue Oct 14, 2018 · 10 comments

Comments

@jbarlow83
Copy link

jbarlow83 commented Oct 14, 2018

Environment

  • Tesseract Version: 4.0.0-rc1-8-g1bee homebrew build
  • Platform: macOS High Sierra

Current Behavior:

When tesseract is run on the attached image, the text on highlight backgrounds is missing from output.

highlight

tesseract -c tessedit_write_images=true  _.png stdout
Not highlighted text

The thresholder blacks out the text (this is tessinput.tif):

tessinput

Expected Behavior:

Thresholder should treat highlights as background so that Tesseract recognizes all of the text.

@amitdo
Copy link
Collaborator

amitdo commented Oct 15, 2018

I wonder why does tesseract use its own otsu binarization and not leptonica's one.

You might want to test this image with leptonica's binarization options.

It's also possible that your issue is cause by some filter done before or after the binarization.

@amitdo
Copy link
Collaborator

amitdo commented Oct 15, 2018

I used gimp for image thresholding and gave the binary image to tesseract.

Output:

Not highlighted text

Highlighted text

 

 

Highlighted background

@amitdo
Copy link
Collaborator

amitdo commented Oct 15, 2018

related: #242 (comment)

@amitdo
Copy link
Collaborator

amitdo commented Oct 16, 2018

@zhangpy
Copy link

zhangpy commented Nov 6, 2018

@amitdo Could you show me how to use gimp to change image threshold? a command? thanks

@amitdo
Copy link
Collaborator

amitdo commented Nov 6, 2018

Gimp is a GUI tool. You might want to try imagemagick which is command line tool instead.

@amitdo
Copy link
Collaborator

amitdo commented Nov 16, 2018

CC: @jbreiden

@Shreeshrii
Copy link
Collaborator

I wonder why does tesseract use its own otsu binarization and not leptonica's one.

https://github.com/DanBloomberg/leptonica/blob/master/src/binarize.c

 *  ===================================================================
 *  Image binarization algorithms are found in:
 *    grayquant.c:   standard, simple, general grayscale quantization
 *    adaptmap.c:    local adaptive; mostly gray-to-gray in preparation
 *                   for binarization
 *    binarize.c:    special binarization methods, locally adaptive and
 *                   global.
 *  ================================================================

@amitdo
Copy link
Collaborator

amitdo commented May 7, 2021

With my patch in #3418, and the eng.traineddata from best I get:


Not highlighted text

Highlighted text

Highlighted background

I used -c thresholding_method=2 -> Sauvola thresholding method.

@amitdo
Copy link
Collaborator

amitdo commented May 10, 2021

Fixed in #3418.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants