Tesseract fails to recognize big text #3480

Euclisa · 2021-06-30T12:31:12Z

Environment

Tesseract Version: 4.1.1
Platform: Linux arch 5.10.46-1-lts

Current Behavior:

Tesseract can't recognize text on this image.

Suggested Fix:

I reduced size of the image and it works, but shouldn't Tesseract recognize it without any extra transformations?

nocun · 2021-06-30T22:39:12Z

I think this may be the issue with the text being white-on-black. Can you check it by manually inverting colors and trying again?

wollmers · 2021-07-01T04:25:10Z

Tried it with with tesseract version 5 and it works perfect:

$ tesseract big.jpg big.jpg  --tessdata-dir /usr/local/share/tessdata txt hocr
Tesseract Open Source OCR Engine v5.0.0-alpha-773-gd33ed with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 1841
$ cat big.jpg.txt
Pot: 192,100

nocun · 2021-07-01T09:04:58Z

@wollmers Did you use tessdata best or fast? I tried Tesseract 5 with fast and it failed.

Euclisa · 2021-07-01T16:34:57Z

@nocun Yes, I tried and it works. But I have some pictures with the same text size and on some of them Tesseract fails.

wollmers · 2021-07-01T19:12:09Z

@wollmers Did you use tessdata best or fast? I tried Tesseract 5 with fast and it failed.

None of best or fast.

Just downloaded today for a fresh compile/installation on another server

wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata

$ tesseract big.jpg big.w4.oem1.jpg  --oem 1 --tessdata-dir /usr/local/share/tessdata txt hocr
Tesseract Open Source OCR Engine v5.0.0-alpha-20210401 with Leptonica
Estimating resolution as 1841
$ cat  big.w4.oem1.jpg.txt
Pot: 192,100

stweil · 2021-07-02T13:31:59Z

Here is the image used for this issue (the link above might not be available in some weeks).

wollmers · 2021-07-02T16:07:30Z

Here is the image used for this issue (the link above might not be available in some weeks).

Images that small (33 KB) should be in the issue thread.

If I test an issue, I try to record the case in an own github repository like this: https://github.com/wollmers/ocr-tess-issues/tree/main/issues/issue_3304_big_text.

amitdo · 2021-09-19T22:36:26Z

Yes, I tried and it works.

So, no issue recognizing the image you provided. Getting different results with different models is normal.

But I have some pictures with the same text size and on some of them Tesseract fails.

So provide 2-3 examples. Use the latest 5.0.0 codebase.

amitdo · 2021-09-30T10:42:54Z

No feedback from the OP.

amitdo closed this as completed Sep 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tesseract fails to recognize big text #3480

Tesseract fails to recognize big text #3480

Euclisa commented Jun 30, 2021 •

edited

nocun commented Jun 30, 2021

wollmers commented Jul 1, 2021

nocun commented Jul 1, 2021

Euclisa commented Jul 1, 2021

wollmers commented Jul 1, 2021

stweil commented Jul 2, 2021

wollmers commented Jul 2, 2021

amitdo commented Sep 19, 2021 •

edited

amitdo commented Sep 30, 2021

Tesseract fails to recognize big text #3480

Tesseract fails to recognize big text #3480

Comments

Euclisa commented Jun 30, 2021 • edited

Environment

Current Behavior:

Suggested Fix:

nocun commented Jun 30, 2021

wollmers commented Jul 1, 2021

nocun commented Jul 1, 2021

Euclisa commented Jul 1, 2021

wollmers commented Jul 1, 2021

stweil commented Jul 2, 2021

wollmers commented Jul 2, 2021

amitdo commented Sep 19, 2021 • edited

amitdo commented Sep 30, 2021

Euclisa commented Jun 30, 2021 •

edited

amitdo commented Sep 19, 2021 •

edited