LSTM engine needs more training #2221

YamashitaRen · 2019-02-03T22:04:12Z

According to Tesseract 4.0.0 Release Notes :

Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains.

My testing with this new OCR engine does not corroborate this statement. The results I get are vastly inferior to the ones I was getting with the legacy engine. (As in, if there is difference in the two output, 99 times out of 100, the legacy engine's result will be WAY better.)

Example : the output of Tesseract v4.0.0 when used on this picture with tessdata_best fra.traineddata is "ROUTIER".

This issue seems critical as LSTM engine is now the default and distros like Ubuntu 18.04 don't provide traineddata including legacy models anymore.

dagnelies · 2019-02-11T14:06:11Z

From a brief test, it appears following models are "broken":

Only this one appears to produce meaningful results:

https://github.com/tesseract-ocr/tessdata_fast/raw/master/fra.traineddata

Shreeshrii · 2019-02-11T14:53:35Z

@dagnelies

Please share the test image, command used and error received to identify what is broken.

dagnelies · 2019-02-11T15:03:44Z

Hi @Shreeshrii

I used the same image as the OP ( https://user-images.githubusercontent.com/4103637/52183232-3d9b1800-2806-11e9-831d-25aa090eab34.jpg )

The first two traineddata resulted in "ROUTIER" (one of them with an additional whitespace).

The tessdata_fast resulted in "ce jour-là," which is correct.

Dunno what is going on with the two other trained data, or if I missed something, but they seem to produce nonsense.

Shreeshrii · 2019-02-12T10:49:12Z

The test image is white text on black background. If you invert it, all three traineddata give correct result.

Since tessdata_fast produces correct result in both cases, it seems to have been trained with samples with white text on black background too.

$ convert fra-black.jpg -negate fra.jpg

$ tesseract fra.jpg stdout -l fra --oem 1 --tessdata-dir ~/tessdata_fast
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 322
ce jour-là,
$ tesseract fra.jpg stdout -l fra --oem 1 --tessdata-dir ~/tessdata_best
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 322
ce jour-là,
$ tesseract fra.jpg stdout -l fra --oem 1 --tessdata-dir ~/tessdata
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 322
ce jour-là,

YamashitaRen · 2019-02-16T23:27:20Z

Nice find @Shreeshrii !

According to my testing, LSTM is still worse than Legacy.
It seems to always? miss ellipses and italics.

Ellipses :

$ tesseract 00h03m04s200-00h03m07s920.jpg stdout -l fra --oem 1
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 387
Je n'ai jamais entendu parler
d'un tel lieu.

Italics :

$ tesseract 00h05m24s400-00h05m26s480.jpg stdout -l fra --oem 1 hocr > output.txt
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 464

output.txt

Apart theses, if you disregards how MUCH longer the processing takes, it seems to be on par with the Legacy engine.

Shreeshrii mentioned this issue Feb 12, 2019

Unused function PrepareDistortedPix() #1052

Closed

dagnelies mentioned this issue Mar 27, 2019

No output at all for a subject line #2354

Closed

stweil added the accuracy label Jun 25, 2019

amitdo added the textlnes inversion label Jul 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LSTM engine needs more training #2221

LSTM engine needs more training #2221

YamashitaRen commented Feb 3, 2019 •

edited

Loading

dagnelies commented Feb 11, 2019

Shreeshrii commented Feb 11, 2019

dagnelies commented Feb 11, 2019

Shreeshrii commented Feb 12, 2019

YamashitaRen commented Feb 16, 2019 •

edited

Loading

LSTM engine needs more training #2221

LSTM engine needs more training #2221

Comments

YamashitaRen commented Feb 3, 2019 • edited Loading

dagnelies commented Feb 11, 2019

Shreeshrii commented Feb 11, 2019

dagnelies commented Feb 11, 2019

Shreeshrii commented Feb 12, 2019

YamashitaRen commented Feb 16, 2019 • edited Loading

YamashitaRen commented Feb 3, 2019 •

edited

Loading

YamashitaRen commented Feb 16, 2019 •

edited

Loading