-
Notifications
You must be signed in to change notification settings - Fork 500
option to disable the OCR'ing of a PDF #271
Comments
This is definitely something worth doing, but I don't know where to start with it. If you can point me to a Python library that can capture the embedded text, then I'm happy to write a short-circuit for the consumption process in this case. Either that, or just submit a PR of your own ;-) |
On Debian, you could use |
@danielquinn I just read your comment in #158 too.. I think this snippet of code does the trick with PDFMiner as well. I just tested it on an already OCR'd pdf 👍 I shamelessly altered https://github.com/euske/pdfminer/blob/8150458718e9024c80b00e74965510b20206e588/tools/pdf2txt.py for simplicity. @MasterofJOKers thoughts? (particularly #!/usr/bin/env python
import sys
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
# main
def main(argv):
import getopt
def usage():
print ('usage: %s [-o output] [-t text]'
' file ...' % argv[0])
return 100
try:
(opts, args) = getopt.getopt(argv[1:], 'o:t:')
except getopt.GetoptError:
return usage()
if not args:
return usage()
# output option
outfile = None
laparams = LAParams()
for (k, val) in opts:
if k == '-o':
outfile = val
rsrcmgr = PDFResourceManager()
if outfile:
outfp = file(outfile, 'w')
else:
outfp = sys.stdout
device = TextConverter(rsrcmgr, outfp, laparams=laparams)
for fname in args:
pdf = file(fname, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(pdf):
interpreter.process_page(page)
pdf.close()
device.close()
outfp.close()
return
if __name__ == '__main__':
sys.exit(main(sys.argv)) |
Doesn't look simple at all, but seems to work. 👍 How would the short-circuit work? Does it disable OCR by option or as soon as some text is in the PDF. There might be mixed content PDFs ... |
Looks even simpler https://github.com/jalan/pdftotext Regarding where... I'll have to take a look at the Paperless codebase but presumably it'd be right before invoking the existing OCR functionality. |
Maybe it's as simple as adding an |
Looks to be the right spot :) |
Yup, @MasterofJOKers is right, modifying the Updating Paperless to do this will require more than just a few Python tweaks: the documentation will have to be amended to include instructions on how to get the dependencies installed, and adding some tests to the RasterisedDocumentParser would be a good idea as well. But I like this and see no reason why it shouldn't happen. PRs are welcome, otherwise, it'll come along in due time. |
Update: @BastianPoe has issued a PR to do this: #290. There's still a few things that need to be done to ease the transition, but it looks like this is going to happen in the next week or so. |
From the description of #290 it seems that this issue can be closed. Or is there anything still missing that this issue requests? |
Hooray for closing issues! |
I have a Doxie Go scanner and after retrieving the scans and transferring them to the native os x/macos app, the software can produce an OCR'd pdf from those scanned images.
I then want to transfer them over to Paperless but the problem is that paperless' ocr process kinda ruins the already pretty well done (better than what comes out when it goes into paperless) searchable pdf. Is there a way this can be disabled, or it can detect the text that might already exist in a PDF?
The text was updated successfully, but these errors were encountered: