Trouble processing multi-page documents #345
Comments
Hi again! Here is an update I can give after several tries. What solution could be figurable to address having app stucked when facing a blank page ? Many thanks, |
Paperless correctly handle blank page on my installation, this is definitively a bug. There seem to be no error in the log, could you increase the level to DEBUG ? It might be a bug with the underlaying OCR library. |
Sure, I would be glad to perform a deeper debug report on the issue. Thx, |
Here is my
and
|
Are all raspberry pi ARM ? Seems like issue #337 could be related I'm not sure where / how the logging level is setup. Putting |
Okay, I also use a rpi3 ARMv7 to test paperless, maybe it would have been better passing by a server, but it was really easier to just give try this way as I saw it was doable using docker-compose. And yes all RPI's are ARM based. Some more info,
Then,
Finally,
|
I will reboot docker-compose stack inserting this environment variable. |
The .env file is updated and stack reboot performed.
|
I removed the
So it looks like the consumer is crashing and restarting each time. Someone will have to look deeper into that error. Would you be willing to share that file so we can verify the fix (or another one that produce the same result) ? |
Basically it occured each document I fed the consumer with containing a blank page. |
Hello fellows, Here is the log related to a new document I tried to processed using paperless (I removed webserver parts):
Please find enclosed the afformentioned pdf, I hope it will help understanding why this is occuring ^^ Thanks a lot, |
|
I also managed to create a simple pdf file wich causes the 2018-04-22T171610_Scan_000250.pdf But it is a single page. |
Handling all exception on line 288 in src/paperless_tesseract/parsers.py try:
orientation = ocr.detect_orientation(f, lang=lang)
f = f.rotate(orientation["angle"], expand=1)
except:
pass works, but not might be an ideal solution ( python noobie :) ). |
Ah ok I've found the problem. It looks like a bug in pyocr. Basically an exception is being triggered because orientation can't be found (totally normal) and that exception is being caught, but the handling of that exception is breaking on newer versions of Python that don't have a I've just patched Paperless to also account for an |
@danielquinn works for me 👍 |
Hooray! |
Thanks for looking into this @danielquinn! Paperless works fine again now. I also very much appreciate your pull request in pyocr rather than just working around the issue in Paperless. |
PyOCR has been patched, and when they do a release, I'll update Paperless to require that version. |
Yeehaa! |
It works perfectly ! |
Hi everyone,
First of all, big up for the project, really interesting.
I am giving it a try right now.
The solution works under a raspberry-docker environment.
All was doing fine so far, I could have my first single pages pdf's processed correctly.
Then I dropped into the consumer_folder two 2-page recto-verso pdf's (so 4 sheets per doc) plus a new single page document for which one I created a specific 'correspondent' rule in the web-app to see how the feature works.
Finally I ended up with an infinite loop over the first 2-page pdf:
Then I deleted it from the folder and the consumer started to process the second 2-page pdf:
It was taking the same direction, then I deleted that one too manually from the consumer's folder.
Then it started to process the single page pdf with success:
Then I did not had an auto correspondent match because the literal expression I indicated was not reflected in the final OCR'ed output, but this is an other topic.
My question is, does paperless is able to process multi-page pdf's and if positive what could cause the loops on this type of document I experienced ?
Many thanks,
The text was updated successfully, but these errors were encountered: