Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic Page segmentation mode is not working in Tesseract 4.0 #1273

Closed
nirajan-pant opened this issue Jan 13, 2018 · 24 comments
Closed

Automatic Page segmentation mode is not working in Tesseract 4.0 #1273

nirajan-pant opened this issue Jan 13, 2018 · 24 comments

Comments

@nirajan-pant
Copy link

Environment

  • Tesseract Version: tesseract 4.00.00alpha (windows executables from UB Mannheim)
  • Platform: Windows 10 64-bit

Current Behavior:

I am using Tesseract with nep.trainneddata but the automatic page segmentation results with empty result. Only --psm 4 and 6 (single column or uniform block of text) are working with the Nepali text image attached here. Both automatic page segmentation modes with or without OSD gives no result.

The discussion on tesseract-ocr forum is here.

Expected Behavior:

It is expected that the automatic page segmentation mode must automatically check whether to treat the image as single column or multi-column document.

Suggested Fix:

Fixing of automatic psm modes.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jan 13, 2018

I investigated this further with latest code from github.

  1. eng (tessdata_best and tessdata_fast) work ok with phototest.tif.
  2. Latin (tessdata_best and tessdata_fast) work ok with phototest.tif.

The image with Nepali text in Devanagari script provided as test case by Niranjan.

  1. san (tessdata_best and tessdata_fast) work ok.
  2. Devanagari (tessdata_best and tessdata_fast) BOTH DO NOT WORK for psm 1 and 3.
  3. hin, mar, nep (tessdata_best) works ok.
  4. hin, mar, nep (tessdata_fast) DO NOT WORK for psm 1 and 3.

I have not tested for other languages.

@jbreiden You may want to check on this before bundling 'fast' traineddata for debian.

@jbreiden
Copy link
Contributor

jbreiden commented Jan 13, 2018 via email

@Shreeshrii
Copy link
Collaborator

For sake of completeness, I also tried with traineddata from tessdata repo, using --oem 1.

  1. Devanagari - traineddata does not exist
  2. hin - psm 1, 3, 4, 6, 11 - ALL work ok
  3. mar - psm 1 and 3 don't work
  4. nep - psm 1 and 3 don't work
  5. san - psm 1 and 3 don't work.

'work' means it produces OCR output, I have not compared the accuracy of the recognition.


I also looked at the version strings in the traineddata in tessdata_fast and tessdata_best.

I had assumed (incorrectly, it seems) that tessdata_fast traineddata files were integer versions of the tessdata_best files. However, that might not the case - at least for ALL languages.

eg. for Nepali

tessdata_best
4.00.00alpha:nep:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]

tessdata_fast
4.00.00alpha:nep:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx128O1c1]

best has Lfx512 vs Lfx128 in fast.

@Shreeshrii
Copy link
Collaborator

#1167 could be related

Segfault on using -psm 0 when using fast eng.traineddata

Shreeshrii referenced this issue in nguyenq/VietOCR3 Jan 14, 2018
@amitdo
Copy link
Collaborator

amitdo commented Feb 8, 2018

@stweil,
Shree's Comments are related to your question about fast vs best data in the dev forum.

@stweil
Copy link
Contributor

stweil commented Feb 8, 2018

I see, thank you. That makes things even more confusing. So there is no general rule how the fast data was produced. The version string for fast English not even tells me the LSTM parameters (it is 4.00.00alpha:eng:synth20170629).

@amitdo
Copy link
Collaborator

amitdo commented Feb 8, 2018

best has Lfx512 vs Lfx128 in fast.

https://groups.google.com/forum/?hl=en#!topic/tesseract-dev/6lcwffUpK1U

Far greater performance improvements can be made by making the network smaller. As I already indicated, I have had some very good results in this area, with a network 3x faster than the legacy code (for English) and much faster than the legacy code for complex scripts.

@Shreeshrii
Copy link
Collaborator

@jbreiden @AlexanderP

I figured out the problem for Devanagari script languages traineddata using default PSM from tessdata_fast. None of these (except san) have a lang.config file.

config files are used for layout analysis. tessdata_best has the config files for san, hin, nep and mar but not for Devanagari.

Adding the config files to these traineddata files fixes the problem. I will make PRs for the tessdata_fast (Devanagari, hin, nep and mar) and tessdata_best (Devanagari).

It is possible that config files are needed for other languages also. I haven't looked beyond this set of files.

@Shreeshrii
Copy link
Collaborator

  1. Not able to upload Devanagari.traineddata for tessdata_best as it is greater than 25MB.

  2. Created PR for tessdata_fast

tesseract-ocr/tessdata_fast#10

@stweil
Copy link
Contributor

stweil commented Feb 26, 2018

@Shreeshrii, I just tested git push from a local git clone and had no problem with the size, see my test commit. If that does not work for you, you could perhaps provide your patch in a different form, so I can upload it?

@jbreiden
Copy link
Contributor

Good news, package migration to Ubuntu went much faster than expected. If this is a critical change, I think we can attempt another packaging iteration.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Feb 26, 2018 via email

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Feb 26, 2018 via email

@amitdo
Copy link
Collaborator

amitdo commented Feb 26, 2018

The current code in Ubuntu 18.04 is based on 766b7bd.
It misses a few fixes from master.

@AlexanderP
Copy link

I updated tesseract-lang and tesseract in PPA.
AlexanderP/tesseract-lang-debian@4c7bf8a
AlexanderP/tesseract-debian@412b06c

@jbreiden
Copy link
Contributor

Great! Done!

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Feb 27, 2018 via email

@AlexanderP
Copy link

@Shreeshrii Yes

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Feb 28, 2018 via email

@jbreiden
Copy link
Contributor

jbreiden commented Mar 1, 2018

Today is the Ubuntu 18.04 cutoff date. Alexander's latest packages made it. Total success.

https://launchpad.net/ubuntu/+source/tesseract
https://launchpad.net/ubuntu/+source/tesseract-lang

@amitdo
Copy link
Collaborator

amitdo commented Mar 1, 2018

https://packages.ubuntu.com/search?keywords=libtesseract&searchon=names&suite=bionic&section=all
libtesseract4 & libtesseract-dev
4.00~git2207-766b7bd6-3.1

@jbreiden
Copy link
Contributor

jbreiden commented Mar 1, 2018

Oh no, I read the dashboard wrong!

The Bionic Beaver (active development)Tesseract trunk series
--
4.00~git2207-766b7bd6-3.1 | release (universe) | 2018-02-23
4.00~git2219-40f43111-1.2 | proposed (universe) | 2018-02-28

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Mar 2, 2018 via email

@amitdo
Copy link
Collaborator

amitdo commented Mar 2, 2018

@jbreiden, an update:

https://packages.ubuntu.com/search?keywords=libtesseract&searchon=names&suite=bionic&section=all
libtesseract4 & libtesseract-dev
4.00~git2219-40f43111-1.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants