Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract fails to recognize big text #3480

Closed
Euclisa opened this issue Jun 30, 2021 · 9 comments
Closed

Tesseract fails to recognize big text #3480

Euclisa opened this issue Jun 30, 2021 · 9 comments

Comments

@Euclisa
Copy link

Euclisa commented Jun 30, 2021

Environment

  • Tesseract Version: 4.1.1
  • Platform: Linux arch 5.10.46-1-lts

Current Behavior:

Tesseract can't recognize text on this image.

Suggested Fix:

I reduced size of the image and it works, but shouldn't Tesseract recognize it without any extra transformations?

@nocun
Copy link
Contributor

nocun commented Jun 30, 2021

I think this may be the issue with the text being white-on-black. Can you check it by manually inverting colors and trying again?

@wollmers
Copy link

wollmers commented Jul 1, 2021

Tried it with with tesseract version 5 and it works perfect:

$ tesseract big.jpg big.jpg  --tessdata-dir /usr/local/share/tessdata txt hocr
Tesseract Open Source OCR Engine v5.0.0-alpha-773-gd33ed with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 1841
$ cat big.jpg.txt
Pot: 192,100

@nocun
Copy link
Contributor

nocun commented Jul 1, 2021

@wollmers Did you use tessdata best or fast? I tried Tesseract 5 with fast and it failed.

@Euclisa
Copy link
Author

Euclisa commented Jul 1, 2021

@nocun Yes, I tried and it works. But I have some pictures with the same text size and on some of them Tesseract fails.

@wollmers
Copy link

wollmers commented Jul 1, 2021

@wollmers Did you use tessdata best or fast? I tried Tesseract 5 with fast and it failed.

None of best or fast.

Just downloaded today for a fresh compile/installation on another server

wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata

$ tesseract big.jpg big.w4.oem1.jpg  --oem 1 --tessdata-dir /usr/local/share/tessdata txt hocr
Tesseract Open Source OCR Engine v5.0.0-alpha-20210401 with Leptonica
Estimating resolution as 1841
$ cat  big.w4.oem1.jpg.txt
Pot: 192,100

@stweil
Copy link
Contributor

stweil commented Jul 2, 2021

Here is the image used for this issue (the link above might not be available in some weeks).

1

@wollmers
Copy link

wollmers commented Jul 2, 2021

Here is the image used for this issue (the link above might not be available in some weeks).

Images that small (33 KB) should be in the issue thread.

If I test an issue, I try to record the case in an own github repository like this: https://github.com/wollmers/ocr-tess-issues/tree/main/issues/issue_3304_big_text.

@amitdo
Copy link
Collaborator

amitdo commented Sep 19, 2021

Yes, I tried and it works.

So, no issue recognizing the image you provided. Getting different results with different models is normal.

But I have some pictures with the same text size and on some of them Tesseract fails.

So provide 2-3 examples. Use the latest 5.0.0 codebase.

@amitdo
Copy link
Collaborator

amitdo commented Sep 30, 2021

No feedback from the OP.

@amitdo amitdo closed this as completed Sep 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants