Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accuracy problem 4.0 beta3 -> 4.0 final? #2048

Closed
niksedk opened this issue Nov 11, 2018 · 14 comments
Closed

Accuracy problem 4.0 beta3 -> 4.0 final? #2048

niksedk opened this issue Nov 11, 2018 · 14 comments

Comments

@niksedk
Copy link

niksedk commented Nov 11, 2018

I'm having some issues with accuracy when upgrading from 4.0 beta 3 to 4.0 final.

Setup:

cmd: tesseract image output -l eng --oem 1
tessdata: https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata
platform: windows 7 (Tesseract compiled via vcpg)

Results:

a.png
tesseract 4.0.0 final: In @ crowded city, as | bump shoulders, I'm all alone
tesseract 4.0 beta 3: In a crowded city, as | bump shoulders, I'm all alone

b.png
tesseract 4.0.0 final: ane unexpectedly, ror whatever reason,
tesseract 4.0 beta 3: and unexpectedly, for whatever reason,

c.png
tesseract 4.0.0 final: they show me kincdness, | loetl
tesseract 4.0 beta 3: they show me kindness, | bet!

d.png
tesseract 4.0.0 final: [t wash t my rault...l
tesseract 4.0 beta 3: It wasn't my fault...!

test-images.zip

@stweil
Copy link
Contributor

stweil commented Nov 11, 2018

Thank you for reporting this. I can confirm your results and will have a look what caused this regression.

@stweil
Copy link
Contributor

stweil commented Nov 11, 2018

4.0.0-rc1 was the first bad version.

@stweil
Copy link
Contributor

stweil commented Nov 11, 2018

Here is the result of git bisect:

5fe1390748a15c0e445a5c57c834edff27ff2f4d is the first bad commit
commit 5fe1390748a15c0e445a5c57c834edff27ff2f4d
Author: zdenop <zdenop@gmail.com>
Date:   Thu Sep 27 19:40:15 2018 +0200

    remove alpha channel from png: issue #1914

:040000 040000 fe9b65d6b515cd8c6c04fd9cfe1eaed10e38cb58 685d0cfb0f4449435456bd788438639e6b53119c M	src

So the fix for issue #1914 (commit 5fe1390) caused an accuracy regression for other images.

@zdenop
Copy link
Contributor

zdenop commented Nov 12, 2018

@niksedk : just wondering: how did you build tesseract with vcpg? As far as I see there support 3.05.02 only...

@niksedk
Copy link
Author

niksedk commented Nov 12, 2018

@zdenop: E.g. like vcpkg install tesseract:x86-windows-static --head (only support for master, a commit-id would have been nice)

@zdenop
Copy link
Contributor

zdenop commented Nov 12, 2018

Problem is related to input image: the letters are white with black outline which is exposed by removal of alpha channel with white color:
no_aplha
And after binarization this is used for OCR:

4 0-tessinput
And as we know, tesseract is not very good with outline characters...

Just note: removal of alpha channel was implemented because (some?) png images were not processed by tesseract correctly and pdf creation did it anyway (just for pdf output and not for OCR).

If alpha channel is not removed: tesseract did binarization different way (if I got it correctly it inverted input image) and it use this image for OCR:
rc0-tessinput
I have not time to test it deeply but it seems that tesseract interprets missing background as black background and invert image for binarization.
Just quick test: pixRemoveAlpha(pix) is equivalent for pixAlphaBlendUniform(pix, 0xffffff00). If tesseract use there black for replacing alpha channel (pixAlphaBlendUniform(pix, 0x00000000)) we will get the same result as before implementing removal of alpha channel:
tessinput_black

I feel stuck:

  1. If we get rid of removal of alpha channel from png => some users will need to remove alpha channel by them self in case of problems.
  2. if we will interpret alpha channel as white color (today solution): some users will need to fix alpha channel before sending image to tesseract in case of problems
  3. if we will interpret alpha channel as black color (image viewers I use do it this way): some user will need to fix alpha channel before sending image to tesseract in case of problems (when white background is expected).

So what ever we will do, some users will not be satisfied...

@amitdo
Copy link
Collaborator

amitdo commented Nov 12, 2018

According to your analysis tesseract is doing the right thing.

And as we know, tesseract is not very good with outline characters...

This knowledge should be documented somewhere, maybe in the ImproveQuality page.

@amitdo
Copy link
Collaborator

amitdo commented Nov 12, 2018

And as we know, tesseract is not very good with outline characters...

If the input is not similar to the trained data, you should not expect good results.

@zdenop
Copy link
Contributor

zdenop commented Nov 14, 2018

I put remark to ImproveQuality.
I am closing this issue as "wontfix", because tesseract is working as expected, just image need to be pre-processed (inverted or use black background)

@zdenop zdenop closed this as completed Nov 14, 2018
@zdenop zdenop added the wontfix label Nov 14, 2018
@niksedk
Copy link
Author

niksedk commented Nov 14, 2018

Thx for the info.

@zdenop: So black letters without outline and a white background is likely the way to go?

@cypherbits
Copy link

I don't think this is logical at all.

tesseract is working as expected, just image need to be pre-processed (inverted or use black background).

Now: how do you know if the image needs this pre-processing if you are building an automated system when all kinds of images are input? Tesseract should figure out what to do, outlined texts are texts too.
I feel that you closed the bug and blamed the text itself just to work less on a fix. Tesseract should work on all kinds of texts.

@jbreiden
Copy link
Contributor

There are a lot of really weird PNG images out there. I was too afraid to touch how recognition works when I was looking at the alpha channel problem during PDF generation. Some have text shapes defined entirely in the alpha channel. It was never was clear to me what is the best thing to do; at one point I had considered to running recognition on each color channel separately, including alpha.

gmail-logo-without-alpha

logo1-alpha

logo1

@zdenop
Copy link
Contributor

zdenop commented Nov 17, 2018

@cypherbits : tesseract is OCR engine. It was always communicated that user should do preprocessing of image. Wiki page Improving quality is one of the oldest. So users is always responsible for quality of image input. There was never intention that tesseract will figure out how t improve image (and there was plenty request for automatic screenshot OCR).

Yes, tesseract do some image processing e.g. binarization, but it use otsu algorithm for it. It does not work for all images, but if you do not like, you can do binarize image by yourself with other algorithm.

In the same logic: we can do nothing regarding alpha channel (e.g. some image will work, some not). Because usually for user is difficult to identify that problem with image OCR is alpha channel I choose strategy to do something: replace alpha channel with white...

What you are requiring is that we will change tesseract from OCR engine to OCR suite. This is simple no goal, simply because of lack of resources (programmers willing to contribute to opensource). But you can take is your business opportunity as several people did ;-)

Your patches for tesseract handling all kind of text are welcomed.

@zdenop
Copy link
Contributor

zdenop commented Nov 17, 2018

@niksedk: basically you can use black text on white but also white text on black - but text should not be outlined... If you are OCR subtitles, you should be able to improve image easily: remove transparency with outline color (maybe just inverting image could work)...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants