Page segmentation output ocr_float #42

JamesOwers · 2015-07-06T10:31:07Z

I'm trying to reproduce results achieved at the ICDAR page segmentation competitions [1,2] with tesseract. I'm struggling to get the tool to output the hOCR tags that I'm expecting for tables and figures etc [3]. At the moment I'm calling tesseract with pagesegmode 1. Should I be adding other options via a config file to achieve the full extent of tesseracts segmentation and labelling ability (I'm not interested in the character recognition element as much).

Antonacopoulos (2013, ICDAR) ICDAR2013 Competition on Historical Book Recognition – HBR2013
Antonacopoulos (2013, ICDAR) ICDAR2013 Competition on Historical Newspaper Layout Analysis – HNLA2013
Breuel (2010) The hOCR Embedded OCR Workflow and Output Format

mittagessen · 2015-07-09T18:56:07Z

You can use the C-API to only retrieve the page segmentation without doing character recognition. Use TessBaseAPISetPageSegMode to set the segmentation mode, call TessBaseAPIProcessPages, and finally retrieve the page iterator using TessBaseAPIAnalyseLayout. Iterate using the TessPageIteratorNext function at the lowest level and check with TessPageIteratorIsAtBeginningOf if the current symbol is at the start of a new block. All in all it shouldn't be more than a few lines of C code and you're skipping the recognition part of tesseract completely.

zdenop · 2015-07-10T14:53:14Z

For support please use tesseract-ocr user forum. See FAQ[1]

[1] https://github.com/tesseract-ocr/tesseract/wiki/FAQ#rules-and-advice

JamesOwers · 2015-07-13T15:08:37Z

@zdenop thank's for clarifying. Here is the link to my forum post (which contains another answer): https://groups.google.com/forum/#!topic/tesseract-ocr/1Frh-5ggNxg

jimregan · 2015-07-18T18:36:10Z

This issue is currently the top search result for 'ocr_float'; it lacks a simple summary: Tesseract (currently) does not support ocr_float.

JamesOwers · 2015-07-20T12:22:07Z

@jimregan Cheers! I'm reproducing your answer on the linked forum page (the preferred help location).

jimregan · 2015-07-20T12:53:09Z

That seems a bit redundant; I was merely summarising what you were told there :)

zdenop closed this as completed Jul 10, 2015

phyrumsk mentioned this issue Nov 22, 2017

Cannot produce final khm.traineddata using lstmtraining from the scratch #1216

Closed

sameearif88 mentioned this issue Oct 29, 2021

Deserialize header failed: UrdJN.lstmf #3615

Closed

erezarnon mentioned this issue May 8, 2023

Should tesseract work on handwritten text? #4069

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Page segmentation output ocr_float #42

Page segmentation output ocr_float #42

JamesOwers commented Jul 6, 2015

mittagessen commented Jul 9, 2015

zdenop commented Jul 10, 2015

JamesOwers commented Jul 13, 2015

jimregan commented Jul 18, 2015

JamesOwers commented Jul 20, 2015

jimregan commented Jul 20, 2015

Page segmentation output ocr_float #42

Page segmentation output ocr_float #42

Comments

JamesOwers commented Jul 6, 2015

mittagessen commented Jul 9, 2015

zdenop commented Jul 10, 2015

JamesOwers commented Jul 13, 2015

jimregan commented Jul 18, 2015

JamesOwers commented Jul 20, 2015

jimregan commented Jul 20, 2015