Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: overlaying OCR'd text in package scope? #2

Closed
treysp opened this issue Mar 27, 2019 · 0 comments
Closed

Question: overlaying OCR'd text in package scope? #2

treysp opened this issue Mar 27, 2019 · 0 comments

Comments

@treysp
Copy link

treysp commented Mar 27, 2019

Hello,

Thanks so much for pdftools, qpdf, and all the other ropensci packages!

I recently received scanned pdfs and needed to make them searchable. The OCRmyPDF library accomplishes that by running OCR with Tesseract then adding an invisible text layer over the base raster layer.

It appears that OCRmyPDF uses pikepdf as its primary PDF manipulation tool, and pikepdf is built on QPDF.

I'm not sure if making PDFs searchable is common enough to warrant building, but if it were is that in scope for this package or would it belong somewhere else?

Best,
Trey

EDIT:
Tesseract has a text-only PDF output option that may allow using qpdf's overlay function to create the searchable text layer. Discussion at Tesseract issue 660.

Apparently OCRmyPDF uses that Tesseract output to create the overlaid PDF page. I can't quite figure out if the "sandwich renderer" is a name they came up with or an actual external tool.

@treysp treysp closed this as completed Nov 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant