Handle image and other non-text regions in output formats #3715

stweil · 2022-01-08T23:38:30Z

Internally Tesseract detects different kinds of regions, not only text regions.

Currently regions for images and horizontal or vertical lines are also written to ALTO, hOCR and text output as paragraphs, lines and (empty) words which unnecessarily increases the output file size and hides the relevant information.

For text files such regions should be skipped.

For ALTO and hOCR that regions are useful, but need the correct representation.

PDF output still has to be examined. Maybe skipping the non-text regions is reasonable there, too.

gunnar-ifp · 2022-01-11T11:47:00Z

I have been dabbling with PDF for the last year due to work and I am combining PDF and HOCR with some magic code to create marked content tags in the PDF (for headings, paragraphs, I even add the word confidence to each word for later filtering). It would of course be easier if the PDFRenderer did this all directly. I wanted to make a fork and do this to give back to you guys if I have some time. I haven't done C++ in a long time, so the hurdle is a bit high.

This allows for simple text extraction w/o layout analysis and helps with screen readers and such. You can give the language and even the text direction (arabic text) as well as font hints. The tags alone don't hurt anybody and can be added without much work, but one needs a "tagged PDF" for this to work "officially" and I have been looking into this, too. All it would need is a structure tree root and probably add MCIDs to the root level tags on each page and add these to the strutcture tree. Once this step has been reached, adding image references to the structure tree root is the next step.

I really would like that, my plan is to selectively compress the PDF like commercial tools do, where you use oversampled b/w image masks for the text and store the images areas as pictures. This way you can really reduce file size a lot.

For now I might simply use the java binding for testing this, once I get this figured out there should be a way back to tesseract. If it is stored in the HOCR, I could extract it from there and wouldn't need to go the java binding route.

As said, the hurdle with C++ is a bit high, like what container classes to use...

stweil · 2022-02-10T13:45:22Z

Non-text regions are now handled for text, ALTO and hOCR output (see commit 424b17f).

TSV ~~and PDF~~ output still has to be done.

amitdo · 2022-11-16T16:09:28Z

TSV and PDF output still has to be done.

PDF: #3959.

stweil created this issue from a note in Tesseract next (In progress) Jan 8, 2022

stweil moved this from In progress to To do: Bug fixes for release 5 in Tesseract next Jan 8, 2022

amitdo added the output issues related output formats label Jan 9, 2022

stweil moved this from To do: Bug fixes for release 5 to Done in Tesseract next Feb 10, 2022

amitdo added the feature request label Jun 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle image and other non-text regions in output formats #3715

Handle image and other non-text regions in output formats #3715

stweil commented Jan 8, 2022

gunnar-ifp commented Jan 11, 2022

stweil commented Feb 10, 2022 •

edited

Loading

amitdo commented Nov 16, 2022

Handle image and other non-text regions in output formats #3715

Handle image and other non-text regions in output formats #3715

Comments

stweil commented Jan 8, 2022

gunnar-ifp commented Jan 11, 2022

stweil commented Feb 10, 2022 • edited Loading

amitdo commented Nov 16, 2022

stweil commented Feb 10, 2022 •

edited

Loading