Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page segmentation output ocr_float #42

Closed
JamesOwers opened this issue Jul 6, 2015 · 6 comments
Closed

Page segmentation output ocr_float #42

JamesOwers opened this issue Jul 6, 2015 · 6 comments

Comments

@JamesOwers
Copy link

I'm trying to reproduce results achieved at the ICDAR page segmentation competitions [1,2] with tesseract. I'm struggling to get the tool to output the hOCR tags that I'm expecting for tables and figures etc [3]. At the moment I'm calling tesseract with pagesegmode 1. Should I be adding other options via a config file to achieve the full extent of tesseracts segmentation and labelling ability (I'm not interested in the character recognition element as much).

  1. Antonacopoulos (2013, ICDAR) ICDAR2013 Competition on Historical Book Recognition – HBR2013
  2. Antonacopoulos (2013, ICDAR) ICDAR2013 Competition on Historical Newspaper Layout Analysis – HNLA2013
  3. Breuel (2010) The hOCR Embedded OCR Workflow and Output Format
@mittagessen
Copy link

You can use the C-API to only retrieve the page segmentation without doing character recognition. Use TessBaseAPISetPageSegMode to set the segmentation mode, call TessBaseAPIProcessPages, and finally retrieve the page iterator using TessBaseAPIAnalyseLayout. Iterate using the TessPageIteratorNext function at the lowest level and check with TessPageIteratorIsAtBeginningOf if the current symbol is at the start of a new block. All in all it shouldn't be more than a few lines of C code and you're skipping the recognition part of tesseract completely.

@zdenop
Copy link
Contributor

zdenop commented Jul 10, 2015

For support please use tesseract-ocr user forum. See FAQ[1]

[1] https://github.com/tesseract-ocr/tesseract/wiki/FAQ#rules-and-advice

@zdenop zdenop closed this as completed Jul 10, 2015
@JamesOwers
Copy link
Author

@zdenop thank's for clarifying. Here is the link to my forum post (which contains another answer): https://groups.google.com/forum/#!topic/tesseract-ocr/1Frh-5ggNxg

@jimregan
Copy link
Contributor

This issue is currently the top search result for 'ocr_float'; it lacks a simple summary: Tesseract (currently) does not support ocr_float.

@JamesOwers
Copy link
Author

@jimregan Cheers! I'm reproducing your answer on the linked forum page (the preferred help location).

@jimregan
Copy link
Contributor

That seems a bit redundant; I was merely summarising what you were told there :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants