Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSTM engine needs more training #2221

Open
YamashitaRen opened this issue Feb 3, 2019 · 5 comments
Open

LSTM engine needs more training #2221

YamashitaRen opened this issue Feb 3, 2019 · 5 comments

Comments

@YamashitaRen
Copy link

YamashitaRen commented Feb 3, 2019

According to Tesseract 4.0.0 Release Notes :

Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains.

My testing with this new OCR engine does not corroborate this statement. The results I get are vastly inferior to the ones I was getting with the legacy engine. (As in, if there is difference in the two output, 99 times out of 100, the legacy engine's result will be WAY better.)

Example : the output of Tesseract v4.0.0 when used on this picture with tessdata_best fra.traineddata is "ROUTIER".

This issue seems critical as LSTM engine is now the default and distros like Ubuntu 18.04 don't provide traineddata including legacy models anymore.

@dagnelies
Copy link

@Shreeshrii
Copy link
Collaborator

@dagnelies

Please share the test image, command used and error received to identify what is broken.

@dagnelies
Copy link

Hi @Shreeshrii

I used the same image as the OP ( https://user-images.githubusercontent.com/4103637/52183232-3d9b1800-2806-11e9-831d-25aa090eab34.jpg )

The first two traineddata resulted in "ROUTIER" (one of them with an additional whitespace).

The tessdata_fast resulted in "ce jour-là," which is correct.

Dunno what is going on with the two other trained data, or if I missed something, but they seem to produce nonsense.

@Shreeshrii
Copy link
Collaborator

The test image is white text on black background. If you invert it, all three traineddata give correct result.

Since tessdata_fast produces correct result in both cases, it seems to have been trained with samples with white text on black background too.

$ convert fra-black.jpg -negate fra.jpg

$ tesseract fra.jpg stdout -l fra --oem 1 --tessdata-dir ~/tessdata_fast
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 322
ce jour-là,
$ tesseract fra.jpg stdout -l fra --oem 1 --tessdata-dir ~/tessdata_best
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 322
ce jour-là,
$ tesseract fra.jpg stdout -l fra --oem 1 --tessdata-dir ~/tessdata
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 322
ce jour-là,

@YamashitaRen
Copy link
Author

YamashitaRen commented Feb 16, 2019

Nice find @Shreeshrii !

According to my testing, LSTM is still worse than Legacy.
It seems to always? miss ellipses and italics.

Ellipses :

$ tesseract 00h03m04s200-00h03m07s920.jpg stdout -l fra --oem 1
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 387
Je n'ai jamais entendu parler
d'un tel lieu.

Italics :

$ tesseract 00h05m24s400-00h05m26s480.jpg stdout -l fra --oem 1 hocr > output.txt
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 464

output.txt

Apart theses, if you disregards how MUCH longer the processing takes, it seems to be on par with the Legacy engine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants