Description
When we OCR a PDF file we essentially make a screenshot of each page and process it. During the OCR process, Tesseract also provides coordinates to tell us where in the image the text should be, and we already save this information in the ocr_details
column.
Using this information, we should recreate the PDF, and overlay the text on top of it. The PDF appearance needs to be identical to the original document, so it will be basically a file made of multiple screenshots. On top there should be an overlay, probably an invisible one. And it should be possible to select and copy and paste the text.
The aim is to make PDF files processed by Joplin more accessible. Some users have to deal with PDF documents that are essentially images which can't be read by screen reader tools. So by adding the actual text to it it should work.
For now, we can expose the feature with a right-click on the document with a menu item next to "View OCR text", for example "Create accessible document". It would take the PDF file, combine it with ocr_details
, and output a new document.
Example of document: Arrete22pages.pdf