Automatic Page segmentation mode is not working in Tesseract 4.0 #1273

nirajan-pant · 2018-01-13T03:46:10Z

Environment

Tesseract Version: tesseract 4.00.00alpha (windows executables from UB Mannheim)
Platform: Windows 10 64-bit

Current Behavior:

I am using Tesseract with nep.trainneddata but the automatic page segmentation results with empty result. Only --psm 4 and 6 (single column or uniform block of text) are working with the Nepali text image attached here. Both automatic page segmentation modes with or without OSD gives no result.

The discussion on tesseract-ocr forum is here.

Expected Behavior:

It is expected that the automatic page segmentation mode must automatically check whether to treat the image as single column or multi-column document.

Suggested Fix:

Fixing of automatic psm modes.

Shreeshrii · 2018-01-13T05:20:51Z

I investigated this further with latest code from github.

eng (tessdata_best and tessdata_fast) work ok with phototest.tif.
Latin (tessdata_best and tessdata_fast) work ok with phototest.tif.

The image with Nepali text in Devanagari script provided as test case by Niranjan.

san (tessdata_best and tessdata_fast) work ok.
Devanagari (tessdata_best and tessdata_fast) BOTH DO NOT WORK for psm 1 and 3.
hin, mar, nep (tessdata_best) works ok.
hin, mar, nep (tessdata_fast) DO NOT WORK for psm 1 and 3.

I have not tested for other languages.

@jbreiden You may want to check on this before bundling 'fast' traineddata for debian.

jbreiden · 2018-01-13T05:55:03Z

Hmm... This was not expected.

…

On Jan 12, 2018 9:21 PM, "Shreeshrii" ***@***.***> wrote: I investigated this further with latest code from github. 1. eng (tessdata_best and tessdata_fast) work ok with phototest.tif. 2. Latin (tessdata_best and tessdata_fast) work ok with phototest.tif. The image with Nepali text in Devanagari script provided as test case by Niranjan. 1. san (tessdata_best and tessdata_fast) work ok. 2. Devanagari (tessdata_best and tessdata_fast) BOTH DO NOT WORK for psm 1 and 3. 3. hin, mar, nep (tessdata_best) works ok. 4. hin, mar, nep (tessdata_fast) DO NOT WORK for psm 1 and 3. I have not tested for other languages. @jbreiden <https://github.com/jbreiden> You may want to check on this before bundling 'fast' traineddata for debian. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1273 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEu2pqvXmvoXOwen8VFLUci3mY2vUhEjks5tKD09gaJpZM4RdFtD> .

Shreeshrii · 2018-01-13T10:32:36Z

For sake of completeness, I also tried with traineddata from tessdata repo, using --oem 1.

Devanagari - traineddata does not exist
hin - psm 1, 3, 4, 6, 11 - ALL work ok
mar - psm 1 and 3 don't work
nep - psm 1 and 3 don't work
san - psm 1 and 3 don't work.

'work' means it produces OCR output, I have not compared the accuracy of the recognition.

I also looked at the version strings in the traineddata in tessdata_fast and tessdata_best.

I had assumed (incorrectly, it seems) that tessdata_fast traineddata files were integer versions of the tessdata_best files. However, that might not the case - at least for ALL languages.

eg. for Nepali

tessdata_best
4.00.00alpha:nep:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]

tessdata_fast
4.00.00alpha:nep:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx128O1c1]

best has Lfx512 vs Lfx128 in fast.

Shreeshrii · 2018-01-13T10:45:53Z

#1167 could be related

Segfault on using -psm 0 when using fast eng.traineddata

amitdo · 2018-02-08T15:58:16Z

@stweil,
Shree's Comments are related to your question about fast vs best data in the dev forum.

stweil · 2018-02-08T16:22:45Z

I see, thank you. That makes things even more confusing. So there is no general rule how the fast data was produced. The version string for fast English not even tells me the LSTM parameters (it is 4.00.00alpha:eng:synth20170629).

amitdo · 2018-02-08T16:59:15Z

best has Lfx512 vs Lfx128 in fast.

https://groups.google.com/forum/?hl=en#!topic/tesseract-dev/6lcwffUpK1U

Far greater performance improvements can be made by making the network smaller. As I already indicated, I have had some very good results in this area, with a network 3x faster than the legacy code (for English) and much faster than the legacy code for complex scripts.

Shreeshrii · 2018-02-26T14:25:25Z

@jbreiden @AlexanderP

I figured out the problem for Devanagari script languages traineddata using default PSM from tessdata_fast. None of these (except san) have a lang.config file.

config files are used for layout analysis. tessdata_best has the config files for san, hin, nep and mar but not for Devanagari.

Adding the config files to these traineddata files fixes the problem. I will make PRs for the tessdata_fast (Devanagari, hin, nep and mar) and tessdata_best (Devanagari).

It is possible that config files are needed for other languages also. I haven't looked beyond this set of files.

Shreeshrii · 2018-02-26T14:56:29Z

Not able to upload Devanagari.traineddata for tessdata_best as it is greater than 25MB.
Created PR for tessdata_fast

tesseract-ocr/tessdata_fast#10

stweil · 2018-02-26T16:58:51Z

@Shreeshrii, I just tested git push from a local git clone and had no problem with the size, see my test commit. If that does not work for you, you could perhaps provide your patch in a different form, so I can upload it?

jbreiden · 2018-02-26T17:07:17Z

Good news, package migration to Ubuntu went much faster than expected. If this is a critical change, I think we can attempt another packaging iteration.

Shreeshrii · 2018-02-26T17:35:59Z

@stweil I won't be able to try till tomorrow. The change is actually quite simple. You can unpack Devanagari.traineddata from tessdata_fast, which I have modified, and add the Devanagari.config file to Devanagari.traineddata in tessdata_best using combine_tessdata. I was trying a web upload to my fork of the repo, which got the error.

…

On 26-Feb-2018 10:29 PM, "Stefan Weil" ***@***.***> wrote: @Shreeshrii <https://github.com/shreeshrii>, I just tested git push from a local git clone and had no problem with the size, see my test commit <stweil/tessdata_best@fcf6d8c>. If that does not work for you, you could perhaps provide your patch in a different form, so I can upload it? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1273 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o7LMML5iclHZU68FrsgbkFC-9VX5ks5tYuLcgaJpZM4RdFtD> .

Shreeshrii · 2018-02-26T17:42:40Z

Yes, it will be good to update the traineddata files. The PR in tessdata_fast has already been merged. It would also help if we can check if any other languages were similarly impacted.

amitdo · 2018-02-26T17:58:00Z

The current code in Ubuntu 18.04 is based on 766b7bd.
It misses a few fixes from master.

AlexanderP · 2018-02-26T19:33:55Z

I updated tesseract-lang and tesseract in PPA.
AlexanderP/tesseract-lang-debian@4c7bf8a
AlexanderP/tesseract-debian@412b06c

jbreiden · 2018-02-27T06:19:39Z

Great! Done!

Shreeshrii · 2018-02-27T08:18:03Z

Are the language and script packs also updated in the PPA for trusty etc?

AlexanderP · 2018-02-27T17:31:05Z

@Shreeshrii Yes

Shreeshrii · 2018-02-28T05:21:49Z

@AlexanderP Thank you. Your PPA is very helpful in making tesseract 4.00 accessible to users with older versions of Ubuntu. Specially so for the Indian languages as there is a great deal of improvement in the accuracy with the newer version.

jbreiden · 2018-03-01T18:15:33Z

Today is the Ubuntu 18.04 cutoff date. Alexander's latest packages made it. Total success.

https://launchpad.net/ubuntu/+source/tesseract
https://launchpad.net/ubuntu/+source/tesseract-lang

amitdo · 2018-03-01T20:46:12Z

https://packages.ubuntu.com/search?keywords=libtesseract&searchon=names&suite=bionic&section=all
libtesseract4 & libtesseract-dev
4.00~git2207-766b7bd6-3.1

jbreiden · 2018-03-01T21:23:19Z

Oh no, I read the dashboard wrong!

The Bionic Beaver (active development)Tesseract trunk series
--
4.00~git2207-766b7bd6-3.1 | release (universe) | 2018-02-23
4.00~git2219-40f43111-1.2 | proposed (universe) | 2018-02-28

Shreeshrii · 2018-03-02T04:56:20Z

They were late changes. I am still happy that 4.00... has been packaged and will ship :-) What is the next date for updates, so that we can plan better.

amitdo · 2018-03-02T22:42:21Z

@jbreiden, an update:

https://packages.ubuntu.com/search?keywords=libtesseract&searchon=names&suite=bionic&section=all
libtesseract4 & libtesseract-dev
4.00~git2219-40f43111-1.2

Shreeshrii referenced this issue in nguyenq/VietOCR3 Jan 14, 2018

Update URLs to best lang packs

fc231ed

Shreeshrii mentioned this issue Feb 26, 2018

Add config files to fix auto PSM issue 1273 tesseract-ocr/tessdata_fast#10

Merged

zdenop closed this as completed in tesseract-ocr/tessdata_fast#10 Feb 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic Page segmentation mode is not working in Tesseract 4.0 #1273

Automatic Page segmentation mode is not working in Tesseract 4.0 #1273

nirajan-pant commented Jan 13, 2018

Shreeshrii commented Jan 13, 2018 •

edited

Loading

jbreiden commented Jan 13, 2018 via email

Shreeshrii commented Jan 13, 2018

Shreeshrii commented Jan 13, 2018

amitdo commented Feb 8, 2018

stweil commented Feb 8, 2018

amitdo commented Feb 8, 2018

Shreeshrii commented Feb 26, 2018

Shreeshrii commented Feb 26, 2018

stweil commented Feb 26, 2018

jbreiden commented Feb 26, 2018

Shreeshrii commented Feb 26, 2018 via email

Shreeshrii commented Feb 26, 2018 via email •

edited

Loading

amitdo commented Feb 26, 2018

AlexanderP commented Feb 26, 2018

jbreiden commented Feb 27, 2018

Shreeshrii commented Feb 27, 2018 via email •

edited

Loading

AlexanderP commented Feb 27, 2018

Shreeshrii commented Feb 28, 2018 via email •

edited

Loading

jbreiden commented Mar 1, 2018

amitdo commented Mar 1, 2018

jbreiden commented Mar 1, 2018

Shreeshrii commented Mar 2, 2018 via email

amitdo commented Mar 2, 2018

Automatic Page segmentation mode is not working in Tesseract 4.0 #1273

Automatic Page segmentation mode is not working in Tesseract 4.0 #1273

Comments

nirajan-pant commented Jan 13, 2018

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

Shreeshrii commented Jan 13, 2018 • edited Loading

jbreiden commented Jan 13, 2018 via email

Shreeshrii commented Jan 13, 2018

Shreeshrii commented Jan 13, 2018

amitdo commented Feb 8, 2018

stweil commented Feb 8, 2018

amitdo commented Feb 8, 2018

Shreeshrii commented Feb 26, 2018

Shreeshrii commented Feb 26, 2018

stweil commented Feb 26, 2018

jbreiden commented Feb 26, 2018

Shreeshrii commented Feb 26, 2018 via email

Shreeshrii commented Feb 26, 2018 via email • edited Loading

amitdo commented Feb 26, 2018

AlexanderP commented Feb 26, 2018

jbreiden commented Feb 27, 2018

Shreeshrii commented Feb 27, 2018 via email • edited Loading

AlexanderP commented Feb 27, 2018

Shreeshrii commented Feb 28, 2018 via email • edited Loading

jbreiden commented Mar 1, 2018

amitdo commented Mar 1, 2018

jbreiden commented Mar 1, 2018

Shreeshrii commented Mar 2, 2018 via email

amitdo commented Mar 2, 2018

Shreeshrii commented Jan 13, 2018 •

edited

Loading

Shreeshrii commented Feb 26, 2018 via email •

edited

Loading

Shreeshrii commented Feb 27, 2018 via email •

edited

Loading

Shreeshrii commented Feb 28, 2018 via email •

edited

Loading