-
Notifications
You must be signed in to change notification settings - Fork 9.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic Page segmentation mode is not working in Tesseract 4.0 #1273
Comments
I investigated this further with latest code from github.
The image with Nepali text in Devanagari script provided as test case by Niranjan.
I have not tested for other languages. @jbreiden You may want to check on this before bundling 'fast' traineddata for debian. |
Hmm... This was not expected.
…On Jan 12, 2018 9:21 PM, "Shreeshrii" ***@***.***> wrote:
I investigated this further with latest code from github.
1. eng (tessdata_best and tessdata_fast) work ok with phototest.tif.
2. Latin (tessdata_best and tessdata_fast) work ok with phototest.tif.
The image with Nepali text in Devanagari script provided as test case by
Niranjan.
1. san (tessdata_best and tessdata_fast) work ok.
2. Devanagari (tessdata_best and tessdata_fast) BOTH DO NOT WORK for
psm 1 and 3.
3. hin, mar, nep (tessdata_best) works ok.
4. hin, mar, nep (tessdata_fast) DO NOT WORK for psm 1 and 3.
I have not tested for other languages.
@jbreiden <https://github.com/jbreiden> You may want to check on this
before bundling 'fast' traineddata for debian.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1273 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEu2pqvXmvoXOwen8VFLUci3mY2vUhEjks5tKD09gaJpZM4RdFtD>
.
|
For sake of completeness, I also tried with traineddata from tessdata repo, using --oem 1.
'work' means it produces OCR output, I have not compared the accuracy of the recognition. I also looked at the version strings in the traineddata in tessdata_fast and tessdata_best. I had assumed (incorrectly, it seems) that tessdata_fast traineddata files were integer versions of the tessdata_best files. However, that might not the case - at least for ALL languages. eg. for Nepali tessdata_best tessdata_fast best has Lfx512 vs Lfx128 in fast. |
#1167 could be related Segfault on using -psm 0 when using fast eng.traineddata |
@stweil, |
I see, thank you. That makes things even more confusing. So there is no general rule how the fast data was produced. The version string for fast English not even tells me the LSTM parameters (it is |
https://groups.google.com/forum/?hl=en#!topic/tesseract-dev/6lcwffUpK1U Far greater performance improvements can be made by making the network smaller. As I already indicated, I have had some very good results in this area, with a network 3x faster than the legacy code (for English) and much faster than the legacy code for complex scripts. |
I figured out the problem for Devanagari script languages traineddata using default PSM from tessdata_fast. None of these (except san) have a lang.config file. config files are used for layout analysis. tessdata_best has the config files for san, hin, nep and mar but not for Devanagari. Adding the config files to these traineddata files fixes the problem. I will make PRs for the tessdata_fast (Devanagari, hin, nep and mar) and tessdata_best (Devanagari). It is possible that config files are needed for other languages also. I haven't looked beyond this set of files. |
|
@Shreeshrii, I just tested |
Good news, package migration to Ubuntu went much faster than expected. If this is a critical change, I think we can attempt another packaging iteration. |
@stweil
I won't be able to try till tomorrow.
The change is actually quite simple.
You can unpack Devanagari.traineddata from tessdata_fast, which I have
modified, and add the Devanagari.config file to Devanagari.traineddata in
tessdata_best using combine_tessdata.
I was trying a web upload to my fork of the repo, which got the error.
…On 26-Feb-2018 10:29 PM, "Stefan Weil" ***@***.***> wrote:
@Shreeshrii <https://github.com/shreeshrii>, I just tested git push from
a local git clone and had no problem with the size, see my test commit
<stweil/tessdata_best@fcf6d8c>.
If that does not work for you, you could perhaps provide your patch in a
different form, so I can upload it?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1273 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o7LMML5iclHZU68FrsgbkFC-9VX5ks5tYuLcgaJpZM4RdFtD>
.
|
Yes, it will be good to update the traineddata files. The PR in
tessdata_fast has already been merged.
It would also help if we can check if any other languages were similarly
impacted.
|
The current code in Ubuntu 18.04 is based on 766b7bd. |
I updated tesseract-lang and tesseract in PPA. |
Great! Done! |
Are the language and script packs also updated in the PPA for trusty etc?
|
@Shreeshrii Yes |
@AlexanderP
Thank you.
Your PPA is very helpful in making tesseract 4.00 accessible to users with
older versions of Ubuntu. Specially so for the Indian languages as there is
a great deal of improvement in the accuracy with the newer version.
|
Today is the Ubuntu 18.04 cutoff date. Alexander's latest packages made it. Total success. https://launchpad.net/ubuntu/+source/tesseract |
https://packages.ubuntu.com/search?keywords=libtesseract&searchon=names&suite=bionic§ion=all |
Oh no, I read the dashboard wrong!
|
They were late changes. I am still happy that 4.00... has been packaged
and will ship :-)
What is the next date for updates, so that we can plan better.
|
@jbreiden, an update: https://packages.ubuntu.com/search?keywords=libtesseract&searchon=names&suite=bionic§ion=all |
Environment
Current Behavior:
I am using Tesseract with nep.trainneddata but the automatic page segmentation results with empty result. Only --psm 4 and 6 (single column or uniform block of text) are working with the Nepali text image attached here. Both automatic page segmentation modes with or without OSD gives no result.
The discussion on tesseract-ocr forum is here.
Expected Behavior:
It is expected that the automatic page segmentation mode must automatically check whether to treat the image as single column or multi-column document.
Suggested Fix:
Fixing of automatic psm modes.
The text was updated successfully, but these errors were encountered: