Desktop: Resolves #12400: Add option to recreate PDF with text transcription #12565

pedr · 2025-06-19T20:57:18Z

Resolves #12400

Summary

I'm adding a new feature that combines the existing PDF with the transcription generate by the tesseract library.

Since the library output the bounding box of each word, it is possible to recreate the PDF by:

splitting each page on its own image
positioning the text over the image with css
adding some other html wrapper
sending the html to be "printed" as PDF

Images

When the resource is ready to be created

When the resource already has text annotation

After pressing "Reprocess file"

When the resource is not a PDF

Somethings to consider:

The generate PDF is much bigger than the original, we probably should consider decreasing the image quality before printing
Should we replace the original resource with the modified one, seems like it could be useful, but it could also be annoying, depending the user perspective.
Right now I'm using a font-size of 8px for any text, maybe we could make some calculation about the real size of the text. This could be useful for when someone wants to copy/paste the content of a single line, for example.
- Try the match the size of the font by counting pixels from the bounding box

Comparing outputs

Configuration	Size	File
original	590kB	scanned_doc_3.pdf
scaleFactor 1, JPG, quality 0.7	1.1MB	scale_factor_1.pdf
scaleFactor 1.5, JPG, quality 0.7	1.9MB	scale_factor_1_5.pdf
scaleFactor 2, JPG, quality 0.7	2.9MB	scale_factor_2.pdf
scaleFactor 2, webp, quality 0.1	1.9MB	webp_quality_01.pdf

Testing

To test it I'm using the PDF linked in the parent issue. I should look for other PDFs to ensure it is working with more cases.

pedr added 5 commits June 19, 2025 10:35

reach working point

3f711c5

refactor

fa579c7

fix and improve htmlgeneration

e2e28b2

fixing margin issue

29e9240

renaming function

f0ff79c

pedr requested a review from personalizedrefrigerator June 19, 2025 20:57

pedr added enhancement desktop labels Jun 19, 2025

pedr added 2 commits June 23, 2025 14:31

changing reset ocr to method that already exist

4f14b39

remove unnecessary css

03b95dc

pedr marked this pull request as ready for review June 30, 2025 14:24

pedr requested a review from laurent22 June 30, 2025 14:24

pedr added 17 commits July 1, 2025 16:48

workable font size

11160ee

improve word boxing by using canvas textmeasure

efd60fd

renaming things

d84369f

simplest way to add customPageSizes

544c3c3

fix bounding box driffting apart in the document

c70e093

add sanitization

f1fc17e

small refactor to make print work again

e66d324

fixing parameter of function

fd7a244

fix landscape not being created properly

eb8eb30

add page property to keep track of which page the lines belongs

956aafb

refactoring ocrservice to be called on contextmenu

6f80493

refactor code to make it easier to understand

c322d90

move function to ocr utils

2570d61

refactoring code

a5e134a

removing unnecessary fields and checks

825c919

refactoring name of variables

1c3011b

renaming variables

12c7154

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Desktop: Resolves #12400: Add option to recreate PDF with text transcription #12565

Desktop: Resolves #12400: Add option to recreate PDF with text transcription #12565

Uh oh!

pedr commented Jun 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Desktop: Resolves #12400: Add option to recreate PDF with text transcription #12565

Are you sure you want to change the base?

Desktop: Resolves #12400: Add option to recreate PDF with text transcription #12565

Uh oh!

Conversation

pedr commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Images

When the resource is ready to be created

When the resource already has text annotation

After pressing "Reprocess file"

When the resource is not a PDF

Comparing outputs

Testing

Uh oh!

Uh oh!

pedr commented Jun 19, 2025 •

edited

Loading