option to disable the OCR'ing of a PDF #271

stgarf · 2017-11-20T09:12:02Z

I have a Doxie Go scanner and after retrieving the scans and transferring them to the native os x/macos app, the software can produce an OCR'd pdf from those scanned images.

I then want to transfer them over to Paperless but the problem is that paperless' ocr process kinda ruins the already pretty well done (better than what comes out when it goes into paperless) searchable pdf. Is there a way this can be disabled, or it can detect the text that might already exist in a PDF?

danielquinn · 2017-11-20T10:16:06Z

This is definitely something worth doing, but I don't know where to start with it. If you can point me to a Python library that can capture the embedded text, then I'm happy to write a short-circuit for the consumption process in this case. Either that, or just submit a PR of your own ;-)

MasterofJOKers · 2017-11-20T10:22:04Z

On Debian, you could use pdftotext to extract the text part of a PDF. It's included in the poppler-utils package. Haven't found any library bindings for it, though.

stgarf · 2017-11-20T10:27:00Z

@danielquinn I just read your comment in #158 too.. I think this snippet of code does the trick with PDFMiner as well. I just tested it on an already OCR'd pdf 👍

I shamelessly altered https://github.com/euske/pdfminer/blob/8150458718e9024c80b00e74965510b20206e588/tools/pdf2txt.py for simplicity. @MasterofJOKers thoughts?

(particularly interpreter.process_page(page) which outputs the PDF text)

#!/usr/bin/env python
import sys
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams


# main
def main(argv):
    import getopt
    def usage():
        print ('usage: %s [-o output] [-t text]'
               ' file ...' % argv[0])
        return 100

    try:
        (opts, args) = getopt.getopt(argv[1:], 'o:t:')
    except getopt.GetoptError:
        return usage()

    if not args:
        return usage()
    # output option
    outfile = None
    laparams = LAParams()
    for (k, val) in opts:
        if k == '-o':
            outfile = val

    rsrcmgr = PDFResourceManager()
    if outfile:
        outfp = file(outfile, 'w')
    else:
        outfp = sys.stdout
    device = TextConverter(rsrcmgr, outfp, laparams=laparams)
    for fname in args:
        pdf = file(fname, 'rb')
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.get_pages(pdf):
            interpreter.process_page(page)
        pdf.close()
    device.close()
    outfp.close()
    return

if __name__ == '__main__':
    sys.exit(main(sys.argv))

MasterofJOKers · 2017-11-20T10:36:56Z

Doesn't look simple at all, but seems to work. 👍

How would the short-circuit work? Does it disable OCR by option or as soon as some text is in the PDF. There might be mixed content PDFs ...

stgarf · 2017-11-20T10:55:29Z

Looks even simpler https://github.com/jalan/pdftotext

Regarding where... I'll have to take a look at the Paperless codebase but presumably it'd be right before invoking the existing OCR functionality.

MasterofJOKers · 2017-11-20T10:57:18Z

Maybe it's as simple as adding an if to this method: https://github.com/danielquinn/paperless/blob/master/src/paperless_tesseract/parsers.py#L49

stgarf · 2017-11-20T11:04:15Z

Looks to be the right spot :)

danielquinn · 2017-11-20T11:15:31Z

Yup, @MasterofJOKers is right, modifying the RasterisedDocumentParser class is definitely the way to go for this one. As for which library to use, @stgarf's suggestion certainly looks like the cleanest, though the dependencies are a bit heafty.

Updating Paperless to do this will require more than just a few Python tweaks: the documentation will have to be amended to include instructions on how to get the dependencies installed, and adding some tests to the RasterisedDocumentParser would be a good idea as well.

But I like this and see no reason why it shouldn't happen. PRs are welcome, otherwise, it'll come along in due time.

danielquinn · 2018-01-29T11:01:15Z

Update: @BastianPoe has issued a PR to do this: #290. There's still a few things that need to be done to ease the transition, but it looks like this is going to happen in the next week or so.

retog · 2018-06-16T10:17:32Z

From the description of #290 it seems that this issue can be closed. Or is there anything still missing that this issue requests?

danielquinn · 2018-07-08T15:51:02Z

Hooray for closing issues!

danielquinn added the help wanted label Nov 20, 2017

danielquinn added the enhancement label Nov 20, 2017

bkanuka mentioned this issue Nov 26, 2017

Hybridise PDFs with combined OCR'd text #19

Closed

danielquinn closed this as completed Jul 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

option to disable the OCR'ing of a PDF #271

option to disable the OCR'ing of a PDF #271

stgarf commented Nov 20, 2017

danielquinn commented Nov 20, 2017

MasterofJOKers commented Nov 20, 2017

stgarf commented Nov 20, 2017 •

edited

MasterofJOKers commented Nov 20, 2017

stgarf commented Nov 20, 2017

MasterofJOKers commented Nov 20, 2017

stgarf commented Nov 20, 2017

danielquinn commented Nov 20, 2017

danielquinn commented Jan 29, 2018

retog commented Jun 16, 2018

danielquinn commented Jul 8, 2018

option to disable the OCR'ing of a PDF #271

option to disable the OCR'ing of a PDF #271

Comments

stgarf commented Nov 20, 2017

danielquinn commented Nov 20, 2017

MasterofJOKers commented Nov 20, 2017

stgarf commented Nov 20, 2017 • edited

MasterofJOKers commented Nov 20, 2017

stgarf commented Nov 20, 2017

MasterofJOKers commented Nov 20, 2017

stgarf commented Nov 20, 2017

danielquinn commented Nov 20, 2017

danielquinn commented Jan 29, 2018

retog commented Jun 16, 2018

danielquinn commented Jul 8, 2018

stgarf commented Nov 20, 2017 •

edited