Skip to content

Add a text layer over OCR-ed PDF files #12400

Open
@laurent22

Description

@laurent22

When we OCR a PDF file we essentially make a screenshot of each page and process it. During the OCR process, Tesseract also provides coordinates to tell us where in the image the text should be, and we already save this information in the ocr_details column.

Using this information, we should recreate the PDF, and overlay the text on top of it. The PDF appearance needs to be identical to the original document, so it will be basically a file made of multiple screenshots. On top there should be an overlay, probably an invisible one. And it should be possible to select and copy and paste the text.

The aim is to make PDF files processed by Joplin more accessible. Some users have to deal with PDF documents that are essentially images which can't be read by screen reader tools. So by adding the actual text to it it should work.

For now, we can expose the feature with a right-click on the document with a menu item next to "View OCR text", for example "Create accessible document". It would take the PDF file, combine it with ocr_details, and output a new document.

Example of document: Arrete22pages.pdf

Metadata

Metadata

Assignees

Labels

accessibilityRelated to accessibilitydesktopAll desktop platformsenhancementFeature requests and code enhancementshighHigh priority issues

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions