Skip to content

Desktop: Resolves #12400: Add option to recreate PDF with text transcription #12565

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 24 commits into
base: dev
Choose a base branch
from

Conversation

pedr
Copy link
Collaborator

@pedr pedr commented Jun 19, 2025

Resolves #12400

Summary

I'm adding a new feature that combines the existing PDF with the transcription generate by the tesseract library.

Since the library output the bounding box of each word, it is possible to recreate the PDF by:

  • splitting each page on its own image
  • positioning the text over the image with css
  • adding some other html wrapper
  • sending the html to be "printed" as PDF

Images

When the resource is ready to be created

2025-07-14_09-57

When the resource already has text annotation

2025-07-14_09-56

After pressing "Reprocess file"

2025-07-14_09-55

When the resource is not a PDF

image

Somethings to consider:

  • The generate PDF is much bigger than the original, we probably should consider decreasing the image quality before printing
  • Should we replace the original resource with the modified one, seems like it could be useful, but it could also be annoying, depending the user perspective.
  • Right now I'm using a font-size of 8px for any text, maybe we could make some calculation about the real size of the text. This could be useful for when someone wants to copy/paste the content of a single line, for example.
    • Try the match the size of the font by counting pixels from the bounding box

Comparing outputs

Configuration Size File
original 590kB scanned_doc_3.pdf
scaleFactor 1, JPG, quality 0.7 1.1MB scale_factor_1.pdf
scaleFactor 1.5, JPG, quality 0.7 1.9MB scale_factor_1_5.pdf
scaleFactor 2, JPG, quality 0.7 2.9MB scale_factor_2.pdf
scaleFactor 2, webp, quality 0.1 1.9MB webp_quality_01.pdf

Testing

To test it I'm using the PDF linked in the parent issue. I should look for other PDFs to ensure it is working with more cases.

@pedr pedr added enhancement Feature requests and code enhancements desktop All desktop platforms labels Jun 19, 2025
@pedr pedr marked this pull request as ready for review June 30, 2025 14:24
@pedr pedr requested a review from laurent22 June 30, 2025 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
desktop All desktop platforms enhancement Feature requests and code enhancements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a text layer over OCR-ed PDF files
1 participant