Add a text layer over OCR-ed PDF files

When we OCR a PDF file we essentially make a screenshot of each page and process it. During the OCR process, Tesseract also provides coordinates to tell us where in the image the text should be, and we already save this information in the `ocr_details` column.

Using this information, we should recreate the PDF, and overlay the text on top of it. The PDF appearance needs to be identical to the original document, so it will be basically a file made of multiple screenshots. On top there should be an overlay, probably an invisible one. And it should be possible to select and copy and paste the text.

The aim is to make PDF files processed by Joplin more accessible. Some users have to deal with PDF documents that are essentially images which can't be read by screen reader tools. So by adding the actual text to it it should work.

For now, we can expose the feature with a right-click on the document with a menu item next to "View OCR text", for example "Create accessible document". It would take the PDF file, combine it with `ocr_details`, and output a new document.

Example of document: [Arrete22pages.pdf](https://github.com/user-attachments/files/20624513/Arrete22pages.pdf)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add a text layer over OCR-ed PDF files #12400

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Add a text layer over OCR-ed PDF files #12400

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions