Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle image and other non-text regions in output formats #3715

Open
stweil opened this issue Jan 8, 2022 · 3 comments
Open

Handle image and other non-text regions in output formats #3715

stweil opened this issue Jan 8, 2022 · 3 comments
Labels
feature request output issues related output formats

Comments

@stweil
Copy link
Contributor

stweil commented Jan 8, 2022

Internally Tesseract detects different kinds of regions, not only text regions.

Currently regions for images and horizontal or vertical lines are also written to ALTO, hOCR and text output as paragraphs, lines and (empty) words which unnecessarily increases the output file size and hides the relevant information.

For text files such regions should be skipped.

For ALTO and hOCR that regions are useful, but need the correct representation.

PDF output still has to be examined. Maybe skipping the non-text regions is reasonable there, too.

@stweil stweil created this issue from a note in Tesseract next (In progress) Jan 8, 2022
@stweil stweil moved this from In progress to To do: Bug fixes for release 5 in Tesseract next Jan 8, 2022
@amitdo amitdo added the output issues related output formats label Jan 9, 2022
@gunnar-ifp
Copy link

I have been dabbling with PDF for the last year due to work and I am combining PDF and HOCR with some magic code to create marked content tags in the PDF (for headings, paragraphs, I even add the word confidence to each word for later filtering). It would of course be easier if the PDFRenderer did this all directly. I wanted to make a fork and do this to give back to you guys if I have some time. I haven't done C++ in a long time, so the hurdle is a bit high.

This allows for simple text extraction w/o layout analysis and helps with screen readers and such. You can give the language and even the text direction (arabic text) as well as font hints. The tags alone don't hurt anybody and can be added without much work, but one needs a "tagged PDF" for this to work "officially" and I have been looking into this, too. All it would need is a structure tree root and probably add MCIDs to the root level tags on each page and add these to the strutcture tree. Once this step has been reached, adding image references to the structure tree root is the next step.

I really would like that, my plan is to selectively compress the PDF like commercial tools do, where you use oversampled b/w image masks for the text and store the images areas as pictures. This way you can really reduce file size a lot.

For now I might simply use the java binding for testing this, once I get this figured out there should be a way back to tesseract. If it is stored in the HOCR, I could extract it from there and wouldn't need to go the java binding route.

As said, the hurdle with C++ is a bit high, like what container classes to use...

@stweil
Copy link
Contributor Author

stweil commented Feb 10, 2022

Non-text regions are now handled for text, ALTO and hOCR output (see commit 424b17f).

TSV and PDF output still has to be done.

@stweil stweil moved this from To do: Bug fixes for release 5 to Done in Tesseract next Feb 10, 2022
@amitdo
Copy link
Collaborator

amitdo commented Nov 16, 2022

TSV and PDF output still has to be done.

PDF: #3959.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request output issues related output formats
Projects
No open projects
Development

No branches or pull requests

3 participants