Accuracy problem 4.0 beta3 -> 4.0 final? #2048

niksedk · 2018-11-11T06:09:44Z

I'm having some issues with accuracy when upgrading from 4.0 beta 3 to 4.0 final.

Setup:

cmd: tesseract image output -l eng --oem 1
tessdata: https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata
platform: windows 7 (Tesseract compiled via vcpg)

Results:

a.png
tesseract 4.0.0 final: In @ crowded city, as | bump shoulders, I'm all alone
tesseract 4.0 beta 3: In a crowded city, as | bump shoulders, I'm all alone

b.png
tesseract 4.0.0 final: ane unexpectedly, ror whatever reason,
tesseract 4.0 beta 3: and unexpectedly, for whatever reason,

c.png
tesseract 4.0.0 final: they show me kincdness, | loetl
tesseract 4.0 beta 3: they show me kindness, | bet!

d.png
tesseract 4.0.0 final: [t wash t my rault...l
tesseract 4.0 beta 3: It wasn't my fault...!

test-images.zip

The text was updated successfully, but these errors were encountered:

stweil · 2018-11-11T07:53:39Z

Thank you for reporting this. I can confirm your results and will have a look what caused this regression.

stweil · 2018-11-11T07:59:37Z

4.0.0-rc1 was the first bad version.

stweil · 2018-11-11T08:08:27Z

Here is the result of git bisect:

5fe1390748a15c0e445a5c57c834edff27ff2f4d is the first bad commit
commit 5fe1390748a15c0e445a5c57c834edff27ff2f4d
Author: zdenop <zdenop@gmail.com>
Date:   Thu Sep 27 19:40:15 2018 +0200

    remove alpha channel from png: issue #1914

:040000 040000 fe9b65d6b515cd8c6c04fd9cfe1eaed10e38cb58 685d0cfb0f4449435456bd788438639e6b53119c M	src

So the fix for issue #1914 (commit 5fe1390) caused an accuracy regression for other images.

zdenop · 2018-11-12T15:05:07Z

@niksedk : just wondering: how did you build tesseract with vcpg? As far as I see there support 3.05.02 only...

niksedk · 2018-11-12T15:13:30Z

@zdenop: E.g. like vcpkg install tesseract:x86-windows-static --head (only support for master, a commit-id would have been nice)

zdenop · 2018-11-12T20:19:34Z

Problem is related to input image: the letters are white with black outline which is exposed by removal of alpha channel with white color:

And after binarization this is used for OCR:

And as we know, tesseract is not very good with outline characters...

Just note: removal of alpha channel was implemented because (some?) png images were not processed by tesseract correctly and pdf creation did it anyway (just for pdf output and not for OCR).

If alpha channel is not removed: tesseract did binarization different way (if I got it correctly it inverted input image) and it use this image for OCR:

I have not time to test it deeply but it seems that tesseract interprets missing background as black background and invert image for binarization.
Just quick test: pixRemoveAlpha(pix) is equivalent for pixAlphaBlendUniform(pix, 0xffffff00). If tesseract use there black for replacing alpha channel (pixAlphaBlendUniform(pix, 0x00000000)) we will get the same result as before implementing removal of alpha channel:

I feel stuck:

If we get rid of removal of alpha channel from png => some users will need to remove alpha channel by them self in case of problems.
if we will interpret alpha channel as white color (today solution): some users will need to fix alpha channel before sending image to tesseract in case of problems
if we will interpret alpha channel as black color (image viewers I use do it this way): some user will need to fix alpha channel before sending image to tesseract in case of problems (when white background is expected).

So what ever we will do, some users will not be satisfied...

amitdo · 2018-11-12T20:37:38Z

According to your analysis tesseract is doing the right thing.

And as we know, tesseract is not very good with outline characters...

This knowledge should be documented somewhere, maybe in the ImproveQuality page.

amitdo · 2018-11-12T20:42:38Z

And as we know, tesseract is not very good with outline characters...

If the input is not similar to the trained data, you should not expect good results.

zdenop · 2018-11-14T16:08:42Z

I put remark to ImproveQuality.
I am closing this issue as "wontfix", because tesseract is working as expected, just image need to be pre-processed (inverted or use black background)

niksedk · 2018-11-14T20:46:05Z

Thx for the info.

@zdenop: So black letters without outline and a white background is likely the way to go?

cypherbits · 2018-11-15T08:42:16Z

I don't think this is logical at all.

tesseract is working as expected, just image need to be pre-processed (inverted or use black background).

Now: how do you know if the image needs this pre-processing if you are building an automated system when all kinds of images are input? Tesseract should figure out what to do, outlined texts are texts too.
I feel that you closed the bug and blamed the text itself just to work less on a fix. Tesseract should work on all kinds of texts.

jbreiden · 2018-11-16T07:54:36Z

There are a lot of really weird PNG images out there. I was too afraid to touch how recognition works when I was looking at the alpha channel problem during PDF generation. Some have text shapes defined entirely in the alpha channel. It was never was clear to me what is the best thing to do; at one point I had considered to running recognition on each color channel separately, including alpha.

zdenop · 2018-11-17T09:33:36Z

@cypherbits : tesseract is OCR engine. It was always communicated that user should do preprocessing of image. Wiki page Improving quality is one of the oldest. So users is always responsible for quality of image input. There was never intention that tesseract will figure out how t improve image (and there was plenty request for automatic screenshot OCR).

Yes, tesseract do some image processing e.g. binarization, but it use otsu algorithm for it. It does not work for all images, but if you do not like, you can do binarize image by yourself with other algorithm.

In the same logic: we can do nothing regarding alpha channel (e.g. some image will work, some not). Because usually for user is difficult to identify that problem with image OCR is alpha channel I choose strategy to do something: replace alpha channel with white...

What you are requiring is that we will change tesseract from OCR engine to OCR suite. This is simple no goal, simply because of lack of resources (programmers willing to contribute to opensource). But you can take is your business opportunity as several people did ;-)

Your patches for tesseract handling all kind of text are welcomed.

zdenop · 2018-11-17T09:39:41Z

@niksedk: basically you can use black text on white but also white text on black - but text should not be outlined... If you are OCR subtitles, you should be able to improve image easily: remove transparency with outline color (maybe just inverting image could work)...

stweil added the accuracy label Nov 11, 2018

zdenop closed this as completed Nov 14, 2018

zdenop added the wontfix label Nov 14, 2018

amitdo added the alpha channel label Oct 4, 2022

niksedk mentioned this issue Jan 10, 2023

Subtitle Edit - VOBSUB and OCR results in random characters SubtitleEdit/subtitleedit#6587

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accuracy problem 4.0 beta3 -> 4.0 final? #2048

Accuracy problem 4.0 beta3 -> 4.0 final? #2048

niksedk commented Nov 11, 2018 •

edited

Loading

stweil commented Nov 11, 2018 •

edited

Loading

stweil commented Nov 11, 2018

stweil commented Nov 11, 2018 •

edited

Loading

zdenop commented Nov 12, 2018

niksedk commented Nov 12, 2018 •

edited

Loading

zdenop commented Nov 12, 2018

amitdo commented Nov 12, 2018

amitdo commented Nov 12, 2018

zdenop commented Nov 14, 2018

niksedk commented Nov 14, 2018

cypherbits commented Nov 15, 2018

jbreiden commented Nov 16, 2018

zdenop commented Nov 17, 2018

zdenop commented Nov 17, 2018

Accuracy problem 4.0 beta3 -> 4.0 final? #2048

Accuracy problem 4.0 beta3 -> 4.0 final? #2048

Comments

niksedk commented Nov 11, 2018 • edited Loading

Setup:

Results:

stweil commented Nov 11, 2018 • edited Loading

stweil commented Nov 11, 2018

stweil commented Nov 11, 2018 • edited Loading

zdenop commented Nov 12, 2018

niksedk commented Nov 12, 2018 • edited Loading

zdenop commented Nov 12, 2018

amitdo commented Nov 12, 2018

amitdo commented Nov 12, 2018

zdenop commented Nov 14, 2018

niksedk commented Nov 14, 2018

cypherbits commented Nov 15, 2018

jbreiden commented Nov 16, 2018

zdenop commented Nov 17, 2018

zdenop commented Nov 17, 2018

niksedk commented Nov 11, 2018 •

edited

Loading

stweil commented Nov 11, 2018 •

edited

Loading

stweil commented Nov 11, 2018 •

edited

Loading

niksedk commented Nov 12, 2018 •

edited

Loading