-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LSTM engine needs more training #2221
Comments
From a brief test, it appears following models are "broken":
Only this one appears to produce meaningful results: |
Please share the test image, command used and error received to identify what is broken. |
Hi @Shreeshrii I used the same image as the OP ( https://user-images.githubusercontent.com/4103637/52183232-3d9b1800-2806-11e9-831d-25aa090eab34.jpg ) The first two The Dunno what is going on with the two other trained data, or if I missed something, but they seem to produce nonsense. |
The test image is white text on black background. If you invert it, all three traineddata give correct result. Since
|
Nice find @Shreeshrii ! According to my testing, LSTM is still worse than Legacy. Ellipses :
Italics :
Apart theses, if you disregards how MUCH longer the processing takes, it seems to be on par with the Legacy engine. |
According to Tesseract 4.0.0 Release Notes :
My testing with this new OCR engine does not corroborate this statement. The results I get are vastly inferior to the ones I was getting with the legacy engine. (As in, if there is difference in the two output, 99 times out of 100, the legacy engine's result will be WAY better.)
Example : the output of Tesseract v4.0.0 when used on this picture with tessdata_best fra.traineddata is "ROUTIER".
This issue seems critical as LSTM engine is now the default and distros like Ubuntu 18.04 don't provide traineddata including legacy models anymore.
The text was updated successfully, but these errors were encountered: