Skip to content
This repository has been archived by the owner on Feb 19, 2021. It is now read-only.

option to disable the OCR'ing of a PDF #271

Closed
stgarf opened this issue Nov 20, 2017 · 11 comments
Closed

option to disable the OCR'ing of a PDF #271

stgarf opened this issue Nov 20, 2017 · 11 comments

Comments

@stgarf
Copy link
Contributor

stgarf commented Nov 20, 2017

I have a Doxie Go scanner and after retrieving the scans and transferring them to the native os x/macos app, the software can produce an OCR'd pdf from those scanned images.

I then want to transfer them over to Paperless but the problem is that paperless' ocr process kinda ruins the already pretty well done (better than what comes out when it goes into paperless) searchable pdf. Is there a way this can be disabled, or it can detect the text that might already exist in a PDF?

@danielquinn
Copy link
Collaborator

This is definitely something worth doing, but I don't know where to start with it. If you can point me to a Python library that can capture the embedded text, then I'm happy to write a short-circuit for the consumption process in this case. Either that, or just submit a PR of your own ;-)

@MasterofJOKers
Copy link
Contributor

On Debian, you could use pdftotext to extract the text part of a PDF. It's included in the poppler-utils package. Haven't found any library bindings for it, though.

@stgarf
Copy link
Contributor Author

stgarf commented Nov 20, 2017

@danielquinn I just read your comment in #158 too.. I think this snippet of code does the trick with PDFMiner as well. I just tested it on an already OCR'd pdf 👍

I shamelessly altered https://github.com/euske/pdfminer/blob/8150458718e9024c80b00e74965510b20206e588/tools/pdf2txt.py for simplicity. @MasterofJOKers thoughts?

(particularly interpreter.process_page(page) which outputs the PDF text)

#!/usr/bin/env python
import sys
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams


# main
def main(argv):
    import getopt
    def usage():
        print ('usage: %s [-o output] [-t text]'
               ' file ...' % argv[0])
        return 100

    try:
        (opts, args) = getopt.getopt(argv[1:], 'o:t:')
    except getopt.GetoptError:
        return usage()

    if not args:
        return usage()
    # output option
    outfile = None
    laparams = LAParams()
    for (k, val) in opts:
        if k == '-o':
            outfile = val

    rsrcmgr = PDFResourceManager()
    if outfile:
        outfp = file(outfile, 'w')
    else:
        outfp = sys.stdout
    device = TextConverter(rsrcmgr, outfp, laparams=laparams)
    for fname in args:
        pdf = file(fname, 'rb')
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.get_pages(pdf):
            interpreter.process_page(page)
        pdf.close()
    device.close()
    outfp.close()
    return

if __name__ == '__main__':
    sys.exit(main(sys.argv))

@MasterofJOKers
Copy link
Contributor

Doesn't look simple at all, but seems to work. 👍

How would the short-circuit work? Does it disable OCR by option or as soon as some text is in the PDF. There might be mixed content PDFs ...

@stgarf
Copy link
Contributor Author

stgarf commented Nov 20, 2017

Looks even simpler https://github.com/jalan/pdftotext

Regarding where... I'll have to take a look at the Paperless codebase but presumably it'd be right before invoking the existing OCR functionality.

@MasterofJOKers
Copy link
Contributor

Maybe it's as simple as adding an if to this method: https://github.com/danielquinn/paperless/blob/master/src/paperless_tesseract/parsers.py#L49

@stgarf
Copy link
Contributor Author

stgarf commented Nov 20, 2017

Looks to be the right spot :)

@danielquinn
Copy link
Collaborator

Yup, @MasterofJOKers is right, modifying the RasterisedDocumentParser class is definitely the way to go for this one. As for which library to use, @stgarf's suggestion certainly looks like the cleanest, though the dependencies are a bit heafty.

Updating Paperless to do this will require more than just a few Python tweaks: the documentation will have to be amended to include instructions on how to get the dependencies installed, and adding some tests to the RasterisedDocumentParser would be a good idea as well.

But I like this and see no reason why it shouldn't happen. PRs are welcome, otherwise, it'll come along in due time.

@danielquinn
Copy link
Collaborator

Update: @BastianPoe has issued a PR to do this: #290. There's still a few things that need to be done to ease the transition, but it looks like this is going to happen in the next week or so.

@retog
Copy link

retog commented Jun 16, 2018

From the description of #290 it seems that this issue can be closed. Or is there anything still missing that this issue requests?

@danielquinn
Copy link
Collaborator

Hooray for closing issues!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants