Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSTM: Non-dictionary words with combination of letters and numbers not recognized. #733

Open
Shreeshrii opened this issue Feb 22, 2017 · 10 comments
Labels

Comments

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Feb 22, 2017

https://groups.google.com/d/msgid/tesseract-ocr/1a3e8773-7151-48f9-92bb-fda888293eab%40googlegroups.com?utm_medium=email&utm_source=footer

While the single "S" is recognized correctly, the text "2S" is recognized as "25".

Here is link to the test image:

https://03054610326450256607.googlegroups.com/attach/b8b86693ac072/2s.png?part=0.4&view=1

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Feb 22, 2017 via email

@andrewisplinghoff
Copy link

Yes, the legacy engine (--oem 0) gets this one right.

tesseract4 --psm 7 --oem 0 2s.png 2s-out-oem0-psm7.txt

2s-out-oem0-psm7.txt

@Shreeshrii Shreeshrii changed the title LSTM: Poor recognition quality of characters following digits LSTM: Non-dictionary words with combination of letters and numbers not recognized. Mar 28, 2018
@Shreeshrii
Copy link
Collaborator Author

@zdenop Please label : accuracy.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Mar 28, 2018

Another instance reported in forum, in context of recognizing license plates.

Please see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/tesseract-ocr/qxB-aCa3r6E

Test image is

minus-4l

@Shreeshrii
Copy link
Collaborator Author

numbers-dawg has patterns of numbers with punctuation and letters. However currently there is no way to specify patterns such as license plates, VIN, product IDs which are non-dictionary words and random combinations of numbers and letters.

Here are the other two images from error reports:

minus-0o

2s

@theraysmith

Is there a variable which can be set for better accuracy in such cases?

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Apr 30, 2018

Another issue, reported in the forum

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/tesseract-ocr/6a6sKOXdZsA

I to 1
A to 4

- an image containing "12345678I" => `123456781`
- an image containing "GLOTHUVFI" => `GLOTHUVFI`
- an image containing "12345678H" => `12345678H`
- an image containing "GLOTHUVFH" => `GLOTHUVFH`
- an image containing "12345678A" => `123456784`
- an image containing "GLOTHUVFA" => `GLOTHUVFA`

@kolakao
Copy link

kolakao commented Apr 14, 2019

Unfortunately, I've fallen into the same pit, is there any solution yet maybe?
I guess I've tried everything and all the topics regarding that matter in the internet are left without the solution.

@FrancescoSaverioZuppichini

Same problem here

@ghost
Copy link

ghost commented Feb 24, 2022

Hello, do you have datasets somewhere available for testing?

@SHANDLEMAN
Copy link

This thread has been open for 5 years. Has anyone come up with a method for reliably getting tesseract to read a combination of letters and numbers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants