Trouble processing multi-page documents #345

GarethFox · 2018-04-26T10:25:37Z

Hi everyone,

First of all, big up for the project, really interesting.
I am giving it a try right now.
The solution works under a raspberry-docker environment.
All was doing fine so far, I could have my first single pages pdf's processed correctly.
Then I dropped into the consumer_folder two 2-page recto-verso pdf's (so 4 sheets per doc) plus a new single page document for which one I created a specific 'correspondent' rule in the web-app to see how the feature works.
Finally I ended up with an infinite loop over the first 2-page pdf:

  | Parsing for eng | Informational
  | OCRing the document | Informational
  | Consuming /consume/SKM_C454e18042611000_0001.pdf | Informational
  | Parsers available: RasterisedDocumentParser | Informational
  | Starting document consumer at /consume | Informational
  | Parsing for fra | Informational
  | Parsing for eng | Informational
  | OCRing the document | Informational
  | Consuming /consume/SKM_C454e18042611000_0001.pdf | Informational
  | Parsers available: RasterisedDocumentParser | Informational
  | Starting document consumer at /consume | Informational
  | Parsing for fra | Informational
  | Parsing for eng | Informational
  | OCRing the document | Informational
  | Consuming /consume/SKM_C454e18042611000_0001.pdf | Informational
  | Parsers available: RasterisedDocumentParser | Informational
  | Starting document consumer at /consume | Informational
  | Parsing for fra | Informational
  | Parsing for eng | Informational
  | OCRing the document | Informational
  | Consuming /consume/SKM_C454e18042611000_0001.pdf | Informational
  | Parsers available: RasterisedDocumentParser | Informational
  | Starting document consumer at /consume | Informational
  | Parsing for fra | Informational
  | Parsing for eng | Informational
  | OCRing the document | Informational
  | Consuming /consume/SKM_C454e18042611000_0001.pdf | Informational
  | Parsers available: RasterisedDocumentParser | Informational
  | Starting document consumer at /consume | Informational
  | Parsing for fra | Informational
  | Parsing for eng | Informational
  | OCRing the document | Informational
  | Consuming /consume/SKM_C454e18042611000_0001.pdf | Informational
  | Parsers available: RasterisedDocumentParser | Informational
  | Starting document consumer at /consume | Informational
  | Parsing for fra | Informational
  | Parsing for eng | Informational
  | OCRing the document | Informational
  | Consuming /consume/SKM_C454e18042611000_0001.pdf | Informational
  | Parsers available: RasterisedDocumentParser | Informational

Then I deleted it from the folder and the consumer started to process the second 2-page pdf:

April 26, 2018, 10 a.m. | Parsing for fra | Informational
  | Parsing for eng | Informational
  | OCRing the document | Informational
  | Consuming /consume/SKM_C454e18042611000_0002.pdf | Informational
  | Parsers available: RasterisedDocumentParser | Informational
  | Starting document consumer at /consume | Informational
  | Parsing for fra | Informational
  | Parsing for eng | Informational
  | OCRing the document | Informational
  | Consuming /consume/SKM_C454e18042611000_0002.pdf | Informational
  | Parsers available: RasterisedDocumentParser | Informational
  | Starting document consumer at /consume | Informational

It was taking the same direction, then I deleted that one too manually from the consumer's folder.
Then it started to process the single page pdf with success:

April 26, 2018, 10:05 a.m. | Document 20180329000000: SKM_C454e18042611340 consumption finished | Informational
  | Completed | Informational
  | Detected document date 2018-03-29T00:00:00+00:00 based on string 29/03/2018 | Informational
  | Parsing for fra | Informational
  | Parsing for eng | Informational
  | OCRing the document | Informational
  | Consuming /consume/SKM_C454e18042611340.pdf | Informational
  | Parsers available: RasterisedDocumentParser | Informational

Then I did not had an auto correspondent match because the literal expression I indicated was not reflected in the final OCR'ed output, but this is an other topic.

My question is, does paperless is able to process multi-page pdf's and if positive what could cause the loops on this type of document I experienced ?

Many thanks,

The text was updated successfully, but these errors were encountered:

GarethFox · 2018-04-26T13:59:40Z

Hi again!

Here is an update I can give after several tries.
It seems to be due to blank pages inside documents which is a real common case in the usage I am planning for paperless.
To be sure I tried with a pdf containing 4 grouped single pages with blank verso.
Result is having a loop inside the log related to the doc.
Then once deleted from consumer folder, I recreated the pdf excluding the blank pages.
This time result is positive and document fully processed.
Finally I went to split a filled recto and a blank verso as 2 distincts pdf's.
Once dropped in the consumer folder, it started looping over the second page (the blank one), I deleted it, and it finally processed correctly the recto containing the informations.

What solution could be figurable to address having app stucked when facing a blank page ?

Many thanks,

ovv · 2018-04-26T14:09:43Z

Paperless correctly handle blank page on my installation, this is definitively a bug.

There seem to be no error in the log, could you increase the level to DEBUG ? It might be a bug with the underlaying OCR library.

GarethFox · 2018-04-26T14:13:25Z

Sure,

I would be glad to perform a deeper debug report on the issue.
I am using a raspberry pi with the docker-compose setup.
What commands or log-file you advise me to perform ?

Thx,

GarethFox · 2018-04-26T14:17:28Z

Here is my docker-compose.yml

version: '2.1'

services:
    webserver:
        build: ./
        # uncomment the following line to start automatically on system boot
        restart: unless-stopped
        ports:
            # You can adapt the port you want Paperless to listen on by
            # modifying the part before the `:`.
            - "8000:8000"
        healthcheck:
            test: ["CMD", "curl" , "-f", "http://localhost:8000"]
            interval: 30s
            timeout: 10s
            retries: 5
        volumes:
            - data:/usr/src/paperless/data
            - media:/usr/src/paperless/media
        env_file: docker-compose.env
        # The reason the line is here is so that the webserver that doesn't do
        # any text recognition and doesn't have to install unnecessary
        # languages the user might have set in the env-file by overwriting the
        # value with nothing.
        environment:
            - PAPERLESS_OCR_LANGUAGES=
        command: ["runserver", "--insecure", "--noreload", "0.0.0.0:8000"]

    consumer:
        build: ./
        # uncomment the following line to start automatically on system boot
        restart: unless-stopped
        depends_on:
            webserver:
                condition: service_healthy
        volumes:
            - data:/usr/src/paperless/data
            - media:/usr/src/paperless/media
            # You have to adapt the local path you want the consumption
            # directory to mount to by modifying the part before the ':'.
            - ./consume:/consume
            # Likewise, you can add a local path to mount a directory for
            # exporting. This is not strictly needed for paperless to
            # function, only if you're exporting your files: uncomment
            # it and fill in a local path if you know you're going to
            # want to export your documents.
            # - /path/to/another/arbitrary/place:/export
        env_file: docker-compose.env
        command: ["document_consumer"]

volumes:
    data:
    media:

and docker-compose.env file:

# Environment variables to set for Paperless
# Commented out variables will be replaced by a default within Paperless.

# Passphrase Paperless uses to encrypt and decrypt your documents
PAPERLESS_PASSPHRASE=123456

# The amount of threads to use for text recognition
# PAPERLESS_OCR_THREADS=4

# Additional languages to install for text recognition
PAPERLESS_OCR_LANGUAGES=deu eng fra

# You can change the default user and group id to a custom one
# USERMAP_UID=1000
# USERMAP_GID=1000

ovv · 2018-04-26T14:19:18Z

Are all raspberry pi ARM ? Seems like issue #337 could be related

I'm not sure where / how the logging level is setup. Putting DJANGO_LOG_LEVEL=DEBUG in the docker-compose.env might do it

GarethFox · 2018-04-26T14:26:20Z

Okay,

I also use a rpi3 ARMv7 to test paperless, maybe it would have been better passing by a server, but it was really easier to just give try this way as I saw it was doable using docker-compose. And yes all RPI's are ARM based.
By the way except that issue, all is rolling fine to me so far ^^, great job !

Some more info, sudo docker version output:

Client:
 Version:       18.04.0-ce
 API version:   1.37
 Go version:    go1.9.4
 Git commit:    3d479c0
 Built: Tue Apr 10 18:25:24 2018
 OS/Arch:       linux/arm
 Experimental:  false
 Orchestrator:  swarm

Server:
 Engine:
  Version:      18.04.0-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.4
  Git commit:   3d479c0
  Built:        Tue Apr 10 18:21:25 2018
  OS/Arch:      linux/arm
  Experimental: false

Then, sudo docker info output:

Live Restore Enabled: false
WARNING: No memory limit support
WARNING: No swap limit support
WARNING: No kernel memory limit support
WARNING: No oom kill disable support
WARNING: No cpu cfs quota support
WARNING: No cpu cfs period support

Finally, sudo docker-compose version output:

docker-compose version 1.21.0, build 5920eb0
docker-py version: 3.3.0
CPython version: 3.5.3
OpenSSL version: OpenSSL 1.1.0f  25 May 2017

GarethFox · 2018-04-26T14:29:36Z

I will reboot docker-compose stack inserting this environment variable.
Where can I access to the debug log to give you the output once done?

GarethFox · 2018-04-26T14:57:01Z

The .env file is updated and stack reboot performed.
I can't see more detailed information through the web-app/documents/logs, but I found these line related to the current in-loop document running sudo docker-compose logs:

...
adapted bellow
...

ovv · 2018-04-26T16:02:06Z

I removed the webserver_1 logs to have a better look at consumer_1.

consumer_1   | Operations to perform:
consumer_1   |   Apply all migrations: admin, auth, contenttypes, documents, reminders, sessions
consumer_1   | Running migrations:
consumer_1   |   No migrations to apply.
consumer_1   | fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/main/armhf/APKINDEX.tar.gz
consumer_1   | fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/main/armhf/APKINDEX.tar.gz
consumer_1   | fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/community/armhf/APKINDEX.tar.gz
consumer_1   | fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/community/armhf/APKINDEX.tar.gz
consumer_1   | (1/1) Installing tesseract-ocr-data-deu (3.05.01-r2)
consumer_1   | OK: 263 MiB in 122 packages
consumer_1   | fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/main/armhf/APKINDEX.tar.gz
consumer_1   | fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/main/armhf/APKINDEX.tar.gz
consumer_1   | fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/community/armhf/APKINDEX.tar.gz
consumer_1   | fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/community/armhf/APKINDEX.tar.gz
consumer_1   | (1/1) Installing tesseract-ocr-data-fra (3.05.01-r2)
consumer_1   | OK: 299 MiB in 123 packages
consumer_1   | Starting document consumer at /consume
consumer_1   | Parsers available: RasterisedDocumentParser
consumer_1   | Consuming /consume/SKM_C454e18042615130_0001.pdf
consumer_1   | convert: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `/tmp/paperless/paperless-fru4z33m/convert-%04d.png' @ warning/png.c/MagickPNGWarningHandler/1744.
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-fru4z33m/convert-0001.pnm -> /tmp/paperless/paperless-fru4z33m/convert-0001.unpaper.pnm
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-fru4z33m/convert-0004.pnm -> /tmp/paperless/paperless-fru4z33m/convert-0004.unpaper.pnm
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-fru4z33m/convert-0000.pnm -> /tmp/paperless/paperless-fru4z33m/convert-0000.unpaper.pnm
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-fru4z33m/convert-0005.pnm -> /tmp/paperless/paperless-fru4z33m/convert-0005.unpaper.pnm
consumer_1   | [pgm_pipe @ 0x74baa490] Stream #0: not enough frames to estimate rate; consider increasing probesize
consumer_1   | [pgm_pipe @ 0x74b92490] Stream #0: not enough frames to estimate rate; consider increasing probesize
consumer_1   | [pgm_pipe @ 0x74bdf490] Stream #0: not enough frames to estimate rate; consider increasing probesize
consumer_1   | [pgm_pipe @ 0x74bd1490] Stream #0: not enough frames to estimate rate; consider increasing probesize                                                                                                                                                                                                            
consumer_1   | [image2 @ 0x749909d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.                                                                                                                                                                                     
consumer_1   | [image2 @ 0x749909d0] Encoder did not produce proper pts, making some up.                                                                                                                                                                                                                                       
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-fru4z33m/convert-0003.pnm -> /tmp/paperless/paperless-fru4z33m/convert-0003.unpaper.pnm                                                                                                                                                                           
consumer_1   | [pgm_pipe @ 0x74bc3490] Stream #0: not enough frames to estimate rate; consider increasing probesize                                                                                                                                                                                                            
consumer_1   | [image2 @ 0x749dd9d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.                                                                                                                                                                                     
consumer_1   | [image2 @ 0x749dd9d0] Encoder did not produce proper pts, making some up.                                                                                                                                                                                                                                       
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-fru4z33m/convert-0002.pnm -> /tmp/paperless/paperless-fru4z33m/convert-0002.unpaper.pnm                                                                                                                                                                           
consumer_1   | [pgm_pipe @ 0x74b84490] Stream #0: not enough frames to estimate rate; consider increasing probesize                                                                                                                                                                                                            
consumer_1   | [image2 @ 0x749a89d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.                                                                                                                                                                                     
consumer_1   | [image2 @ 0x749a89d0] Encoder did not produce proper pts, making some up.                                                                                                                                                                                                                                       
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-fru4z33m/convert-0006.pnm -> /tmp/paperless/paperless-fru4z33m/convert-0006.unpaper.pnm                                                                                                                                                                           
consumer_1   | [pgm_pipe @ 0x74b80490] Stream #0: not enough frames to estimate rate; consider increasing probesize                                                                                                                                                                                                            
consumer_1   | [image2 @ 0x749cf9d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.                                                                                                                                                                                     
consumer_1   | [image2 @ 0x749cf9d0] Encoder did not produce proper pts, making some up.                                                                                                                                                                                                                                       
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-fru4z33m/convert-0007.pnm -> /tmp/paperless/paperless-fru4z33m/convert-0007.unpaper.pnm                                                                                                                                                                           
consumer_1   | [pgm_pipe @ 0x74b7c490] Stream #0: not enough frames to estimate rate; consider increasing probesize                                                                                                                                                                                                            
consumer_1   | [image2 @ 0x749c19d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.                                                                                                                                                                                     
consumer_1   | [image2 @ 0x749c19d0] Encoder did not produce proper pts, making some up.                                                                                                                                                                                                                                       
consumer_1   | [image2 @ 0x7497e9d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.                                                                                                                                                                                     
consumer_1   | [image2 @ 0x7497e9d0] Encoder did not produce proper pts, making some up.                                                                                                                                                                                                                                       
consumer_1   | [image2 @ 0x749829d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
consumer_1   | [image2 @ 0x749829d0] Encoder did not produce proper pts, making some up.
consumer_1   | [image2 @ 0x7497a9d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
consumer_1   | [image2 @ 0x7497a9d0] Encoder did not produce proper pts, making some up.
consumer_1   | OCRing the document
consumer_1   | Parsing for eng
consumer_1   | Parsing for deu
consumer_1   | multiprocessing.pool.RemoteTraceback:
consumer_1   | """
consumer_1   | Traceback (most recent call last):
consumer_1   |   File "/usr/lib/python3.6/site-packages/pyocr/tesseract.py", line 214, in detect_orientation
consumer_1   |     angle = int(output.get('Rotate', output['Orientation in degrees']))
consumer_1   | KeyError: 'Orientation in degrees'
consumer_1   |
consumer_1   | During handling of the above exception, another exception occurred:
consumer_1   |
consumer_1   | Traceback (most recent call last):
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
consumer_1   |     result = (True, func(*args, **kwds))
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
consumer_1   |     return list(map(*args))
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 286, in image_to_string
consumer_1   |     orientation = ocr.detect_orientation(f, lang=lang)
consumer_1   |   File "/usr/lib/python3.6/site-packages/pyocr/tesseract.py", line 224, in detect_orientation
consumer_1   |     % (ex.message, original_output))
consumer_1   | AttributeError: 'KeyError' object has no attribute 'message'
consumer_1   | """
consumer_1   |
consumer_1   | The above exception was the direct cause of the following exception:
consumer_1   |
consumer_1   | Traceback (most recent call last):
consumer_1   |   File "/usr/src/paperless/src/manage.py", line 18, in <module>
consumer_1   |     execute_from_command_line(sys.argv)
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/__init__.py", line 364, in execute_from_command_line
consumer_1   |     utility.execute()
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/__init__.py", line 356, in execute
consumer_1   |     self.fetch_command(subcommand).run_from_argv(self.argv)
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/base.py", line 283, in run_from_argv
consumer_1   |     self.execute(*args, **cmd_options)
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/base.py", line 330, in execute
consumer_1   |     output = self.handle(*args, **options)
consumer_1   |   File "/usr/src/paperless/src/documents/management/commands/document_consumer.py", line 85, in handle
consumer_1   |     self.loop(mail_delta=mail_delta)
consumer_1   |   File "/usr/src/paperless/src/documents/management/commands/document_consumer.py", line 105, in loop
consumer_1   |     self.file_consumer.run()
consumer_1   |   File "/usr/src/paperless/src/documents/consumer.py", line 123, in run
consumer_1   |     date = parsed_document.get_date()
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 209, in get_date
consumer_1   |     text = self.get_text()
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 80, in get_text
consumer_1   |     self._text = self._get_ocr(images)
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 161, in _get_ocr
consumer_1   |     return self._ocr(imgs, ISO639[guessed_language])
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 189, in _ocr
consumer_1   |     r = pool.map(image_to_string, itertools.product(imgs, [lang]))
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map
consumer_1   |     return self._map_async(func, iterable, mapstar, chunksize).get()
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
consumer_1   |     raise self._value
consumer_1   | AttributeError: 'KeyError' object has no attribute 'message'
consumer_1   | Operations to perform:
consumer_1   |   Apply all migrations: admin, auth, contenttypes, documents, reminders, sessions
consumer_1   | Running migrations:
consumer_1   |   No migrations to apply.
consumer_1   | Starting document consumer at /consume
consumer_1   | Parsers available: RasterisedDocumentParser
consumer_1   | Consuming /consume/SKM_C454e18042615130_0001.pdf
consumer_1   | convert: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `/tmp/paperless/paperless-f80ru5f_/convert-%04d.png' @ warning/png.c/MagickPNGWarningHandler/1744.

So it looks like the consumer is crashing and restarting each time. Someone will have to look deeper into that error. Would you be willing to share that file so we can verify the fix (or another one that produce the same result) ?

GarethFox · 2018-04-26T17:20:34Z

Basically it occured each document I fed the consumer with containing a blank page.
As it was sensitive invoices documents, I will give another try tomorrow with a neutral document source.
No problem to share that kind of course ^^

GarethFox · 2018-04-27T07:12:48Z

Hello fellows,

Here is the log related to a new document I tried to processed using paperless (I removed webserver parts):

consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map
consumer_1   |     return self._map_async(func, iterable, mapstar, chunksize).get()
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
consumer_1   |     raise self._value
consumer_1   | AttributeError: 'KeyError' object has no attribute 'message'
consumer_1   | Operations to perform:
consumer_1   |   Apply all migrations: admin, auth, contenttypes, documents, reminders, sessions
consumer_1   | Running migrations:
consumer_1   |   No migrations to apply.
consumer_1   | Starting document consumer at /consume
consumer_1   | Parsers available: RasterisedDocumentParser
consumer_1   | Consuming /consume/SKM_C454e18042708530_0001.pdf
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-pikj79ce/convert-0000.pnm -> /tmp/paperless/paperless-pikj79ce/convert-0000.unpaper.pnm
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-pikj79ce/convert-0001.pnm -> /tmp/paperless/paperless-pikj79ce/convert-0001.unpaper.pnm
consumer_1   | [pgm_pipe @ 0x74bb2490] Stream #0: not enough frames to estimate rate; consider increasing probesize
consumer_1   | [pgm_pipe @ 0x74b25490] Stream #0: not enough frames to estimate rate; consider increasing probesize
consumer_1   | [image2 @ 0x749b09d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
consumer_1   | [image2 @ 0x749b09d0] Encoder did not produce proper pts, making some up.
consumer_1   | [image2 @ 0x749239d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
consumer_1   | [image2 @ 0x749239d0] Encoder did not produce proper pts, making some up.
consumer_1   | OCRing the document
consumer_1   | Parsing for eng
consumer_1   | multiprocessing.pool.RemoteTraceback:
consumer_1   | """
consumer_1   | Traceback (most recent call last):
consumer_1   |   File "/usr/lib/python3.6/site-packages/pyocr/tesseract.py", line 214, in detect_orientation
consumer_1   |     angle = int(output.get('Rotate', output['Orientation in degrees']))
consumer_1   | KeyError: 'Orientation in degrees'
consumer_1   |
consumer_1   | During handling of the above exception, another exception occurred:
consumer_1   |
consumer_1   | Traceback (most recent call last):
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
consumer_1   |     result = (True, func(*args, **kwds))
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
consumer_1   |     return list(map(*args))
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 286, in image_to_string
consumer_1   |     orientation = ocr.detect_orientation(f, lang=lang)
consumer_1   |   File "/usr/lib/python3.6/site-packages/pyocr/tesseract.py", line 224, in detect_orientation
consumer_1   |     % (ex.message, original_output))
consumer_1   | AttributeError: 'KeyError' object has no attribute 'message'
consumer_1   | """
consumer_1   |
consumer_1   | The above exception was the direct cause of the following exception:
consumer_1   |
consumer_1   | Traceback (most recent call last):
consumer_1   |   File "/usr/src/paperless/src/manage.py", line 18, in <module>
consumer_1   |     execute_from_command_line(sys.argv)
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/__init__.py", line 364, in execute_from_command_line
consumer_1   |     utility.execute()
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/__init__.py", line 356, in execute
consumer_1   |     self.fetch_command(subcommand).run_from_argv(self.argv)
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/base.py", line 283, in run_from_argv
consumer_1   |     self.execute(*args, **cmd_options)
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/base.py", line 330, in execute
consumer_1   |     output = self.handle(*args, **options)
consumer_1   |   File "/usr/src/paperless/src/documents/management/commands/document_consumer.py", line 85, in handle
consumer_1   |     self.loop(mail_delta=mail_delta)
consumer_1   |   File "/usr/src/paperless/src/documents/management/commands/document_consumer.py", line 105, in loop
consumer_1   |     self.file_consumer.run()
consumer_1   |   File "/usr/src/paperless/src/documents/consumer.py", line 123, in run
consumer_1   |     date = parsed_document.get_date()
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 209, in get_date
consumer_1   |     text = self.get_text()
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 80, in get_text
consumer_1   |     self._text = self._get_ocr(images)
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 140, in _get_ocr
consumer_1   |     raw_text = self._ocr([imgs[middle]], self.DEFAULT_OCR_LANGUAGE)
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 189, in _ocr
consumer_1   |     r = pool.map(image_to_string, itertools.product(imgs, [lang]))
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map
consumer_1   |     return self._map_async(func, iterable, mapstar, chunksize).get()
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
consumer_1   |     raise self._value
consumer_1   | AttributeError: 'KeyError' object has no attribute 'message'
consumer_1   | Operations to perform:
consumer_1   |   Apply all migrations: admin, auth, contenttypes, documents, reminders, sessions
consumer_1   | Running migrations:
consumer_1   |   No migrations to apply.
consumer_1   | Starting document consumer at /consume
consumer_1   | Parsers available: RasterisedDocumentParser
consumer_1   | Consuming /consume/SKM_C454e18042708530_0001.pdf
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-5cnilohm/convert-0001.pnm -> /tmp/paperless/paperless-5cnilohm/convert-0001.unpaper.pnm
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-5cnilohm/convert-0000.pnm -> /tmp/paperless/paperless-5cnilohm/convert-0000.unpaper.pnm
consumer_1   | [pgm_pipe @ 0x74b95490] Stream #0: not enough frames to estimate rate; consider increasing probesize
consumer_1   | [pgm_pipe @ 0x74b40490] Stream #0: not enough frames to estimate rate; consider increasing probesize
consumer_1   | [image2 @ 0x7493e9d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
consumer_1   | [image2 @ 0x7493e9d0] Encoder did not produce proper pts, making some up.
consumer_1   | [image2 @ 0x749939d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
consumer_1   | [image2 @ 0x749939d0] Encoder did not produce proper pts, making some up.
consumer_1   | OCRing the document
consumer_1   | Parsing for eng
consumer_1   | multiprocessing.pool.RemoteTraceback:
consumer_1   | """
consumer_1   | Traceback (most recent call last):
consumer_1   |   File "/usr/lib/python3.6/site-packages/pyocr/tesseract.py", line 214, in detect_orientation
consumer_1   |     angle = int(output.get('Rotate', output['Orientation in degrees']))
consumer_1   | KeyError: 'Orientation in degrees'
consumer_1   |
consumer_1   | During handling of the above exception, another exception occurred:
consumer_1   |
consumer_1   | Traceback (most recent call last):
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
consumer_1   |     result = (True, func(*args, **kwds))
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
consumer_1   |     return list(map(*args))
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 286, in image_to_string
consumer_1   |     orientation = ocr.detect_orientation(f, lang=lang)
consumer_1   |   File "/usr/lib/python3.6/site-packages/pyocr/tesseract.py", line 224, in detect_orientation
consumer_1   |     % (ex.message, original_output))
consumer_1   | AttributeError: 'KeyError' object has no attribute 'message'
consumer_1   | """
consumer_1   |
consumer_1   | The above exception was the direct cause of the following exception:
consumer_1   |
consumer_1   | Traceback (most recent call last):
consumer_1   |   File "/usr/src/paperless/src/manage.py", line 18, in <module>
consumer_1   |     execute_from_command_line(sys.argv)
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/__init__.py", line 364, in execute_from_command_line
consumer_1   |     utility.execute()
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/__init__.py", line 356, in execute
consumer_1   |     self.fetch_command(subcommand).run_from_argv(self.argv)
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/base.py", line 283, in run_from_argv
consumer_1   |     self.execute(*args, **cmd_options)
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/base.py", line 330, in execute
consumer_1   |     output = self.handle(*args, **options)
consumer_1   |   File "/usr/src/paperless/src/documents/management/commands/document_consumer.py", line 85, in handle
consumer_1   |     self.loop(mail_delta=mail_delta)
consumer_1   |   File "/usr/src/paperless/src/documents/management/commands/document_consumer.py", line 105, in loop
consumer_1   |     self.file_consumer.run()
consumer_1   |   File "/usr/src/paperless/src/documents/consumer.py", line 123, in run
consumer_1   |     date = parsed_document.get_date()
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 209, in get_date
consumer_1   |     text = self.get_text()
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 80, in get_text
consumer_1   |     self._text = self._get_ocr(images)
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 140, in _get_ocr
consumer_1   |     raw_text = self._ocr([imgs[middle]], self.DEFAULT_OCR_LANGUAGE)
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 189, in _ocr
consumer_1   |     r = pool.map(image_to_string, itertools.product(imgs, [lang]))
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map
consumer_1   |     return self._map_async(func, iterable, mapstar, chunksize).get()
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
consumer_1   |     raise self._value
consumer_1   | AttributeError: 'KeyError' object has no attribute 'message'
consumer_1   | Operations to perform:
consumer_1   |   Apply all migrations: admin, auth, contenttypes, documents, reminders, sessions
consumer_1   | Running migrations:
consumer_1   |   No migrations to apply.
consumer_1   | Starting document consumer at /consume
consumer_1   | Parsers available: RasterisedDocumentParser
consumer_1   | Consuming /consume/SKM_C454e18042708530_0001.pdf
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-9b53x_g5/convert-0000.pnm -> /tmp/paperless/paperless-9b53x_g5/convert-0000.unpaper.pnm
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-9b53x_g5/convert-0001.pnm -> /tmp/paperless/paperless-9b53x_g5/convert-0001.unpaper.pnm
consumer_1   | [pgm_pipe @ 0x74b60490] Stream #0: not enough frames to estimate rate; consider increasing probesize
consumer_1   | [pgm_pipe @ 0x74bee490] Stream #0: not enough frames to estimate rate; consider increasing probesize
consumer_1   | [image2 @ 0x7495e9d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
consumer_1   | [image2 @ 0x7495e9d0] Encoder did not produce proper pts, making some up.
consumer_1   | [image2 @ 0x749ec9d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
consumer_1   | [image2 @ 0x749ec9d0] Encoder did not produce proper pts, making some up.
consumer_1   | OCRing the document
consumer_1   | Parsing for eng
consumer_1   | multiprocessing.pool.RemoteTraceback:
consumer_1   | """
consumer_1   | Traceback (most recent call last):
consumer_1   |   File "/usr/lib/python3.6/site-packages/pyocr/tesseract.py", line 214, in detect_orientation
consumer_1   |     angle = int(output.get('Rotate', output['Orientation in degrees']))
consumer_1   | KeyError: 'Orientation in degrees'
consumer_1   |
consumer_1   | During handling of the above exception, another exception occurred:
consumer_1   |
consumer_1   | Traceback (most recent call last):
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
consumer_1   |     result = (True, func(*args, **kwds))
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
consumer_1   |     return list(map(*args))
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 286, in image_to_string
consumer_1   |     orientation = ocr.detect_orientation(f, lang=lang)
consumer_1   |   File "/usr/lib/python3.6/site-packages/pyocr/tesseract.py", line 224, in detect_orientation
consumer_1   |     % (ex.message, original_output))
consumer_1   | AttributeError: 'KeyError' object has no attribute 'message'
consumer_1   | """
consumer_1   |
consumer_1   | The above exception was the direct cause of the following exception:
consumer_1   |
consumer_1   | Traceback (most recent call last):
consumer_1   |   File "/usr/src/paperless/src/manage.py", line 18, in <module>
consumer_1   |     execute_from_command_line(sys.argv)
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/__init__.py", line 364, in execute_from_command_line
consumer_1   |     utility.execute()
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/__init__.py", line 356, in execute
consumer_1   |     self.fetch_command(subcommand).run_from_argv(self.argv)
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/base.py", line 283, in run_from_argv
consumer_1   |     self.execute(*args, **cmd_options)
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/base.py", line 330, in execute
consumer_1   |     output = self.handle(*args, **options)
consumer_1   |   File "/usr/src/paperless/src/documents/management/commands/document_consumer.py", line 85, in handle
consumer_1   |     self.loop(mail_delta=mail_delta)
consumer_1   |   File "/usr/src/paperless/src/documents/management/commands/document_consumer.py", line 105, in loop
consumer_1   |     self.file_consumer.run()
consumer_1   |   File "/usr/src/paperless/src/documents/consumer.py", line 123, in run
consumer_1   |     date = parsed_document.get_date()
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 209, in get_date
consumer_1   |     text = self.get_text()
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 80, in get_text
consumer_1   |     self._text = self._get_ocr(images)
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 140, in _get_ocr
consumer_1   |     raw_text = self._ocr([imgs[middle]], self.DEFAULT_OCR_LANGUAGE)
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 189, in _ocr
consumer_1   |     r = pool.map(image_to_string, itertools.product(imgs, [lang]))
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map
consumer_1   |     return self._map_async(func, iterable, mapstar, chunksize).get()
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
consumer_1   |     raise self._value
consumer_1   | AttributeError: 'KeyError' object has no attribute 'message'
consumer_1   | Operations to perform:
consumer_1   |   Apply all migrations: admin, auth, contenttypes, documents, reminders, sessions
consumer_1   | Running migrations:
consumer_1   |   No migrations to apply.
consumer_1   | Starting document consumer at /consume
consumer_1   | Parsers available: RasterisedDocumentParser
consumer_1   | Consuming /consume/SKM_C454e18042708530_0001.pdf
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-6ikf7uep/convert-0001.pnm -> /tmp/paperless/paperless-6ikf7uep/convert-0001.unpaper.pnm
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-6ikf7uep/convert-0000.pnm -> /tmp/paperless/paperless-6ikf7uep/convert-0000.unpaper.pnm
consumer_1   | [pgm_pipe @ 0x74b25490] Stream #0: not enough frames to estimate rate; consider increasing probesize
consumer_1   | [pgm_pipe @ 0x74b7c490] Stream #0: not enough frames to estimate rate; consider increasing probesize
consumer_1   | [image2 @ 0x749239d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
consumer_1   | [image2 @ 0x749239d0] Encoder did not produce proper pts, making some up.

Please find enclosed the afformentioned pdf, I hope it will help understanding why this is occuring ^^

Thanks a lot,

SKM_C454e18042708530_0001.pdf

thepill · 2018-04-27T09:08:46Z

AttributeError: 'KeyError' object has no attribute 'message' occured also yesterday in my test-setup. If i can share a document without sensitiv information i will do.

thepill · 2018-04-28T09:38:02Z

I also managed to create a simple pdf file wich causes the KeyError: 'Orientation in degrees' error.

2018-04-22T171610_Scan_000250.pdf

But it is a single page.

thepill · 2018-04-28T10:13:13Z

Handling all exception on line 288 in src/paperless_tesseract/parsers.py

 try:
     orientation = ocr.detect_orientation(f, lang=lang)
      f = f.rotate(orientation["angle"], expand=1)
except:
      pass

works, but not might be an ideal solution ( python noobie :) ).
Should we just accept the KeyError additionally to the given (TesseractError, OtherTesseractError)?

danielquinn · 2018-04-28T11:21:29Z

Ah ok I've found the problem. It looks like a bug in pyocr. Basically an exception is being triggered because orientation can't be found (totally normal) and that exception is being caught, but the handling of that exception is breaking on newer versions of Python that don't have a .message attribute on a KeyError instance.

I've just patched Paperless to also account for an AttributeError which should fix the problem for you all. Please give it a try and let me know if this is still happening and we can tinker with it from there.

thepill · 2018-04-28T12:05:14Z

@danielquinn works for me 👍

danielquinn · 2018-04-28T12:11:55Z

Hooray!

ddddavidmartin · 2018-04-29T06:08:10Z

Thanks for looking into this @danielquinn! Paperless works fine again now. I also very much appreciate your pull request in pyocr rather than just working around the issue in Paperless.

danielquinn · 2018-04-30T09:08:32Z

PyOCR has been patched, and when they do a release, I'll update Paperless to require that version.

GarethFox · 2018-04-30T11:28:41Z

Yeehaa!
Thanks @danielquinn !
Can't personnaly test on my setup this week.
I will try to pull the patch next week and let you know what's the outcome is now on my side ;)

GarethFox · 2018-05-07T11:22:18Z

It works perfectly !
Thanks a lot !

danielquinn added a commit that referenced this issue Apr 28, 2018

Account for KeyError problem in #345

c983e73

danielquinn added a commit that referenced this issue Apr 28, 2018

Account for KeyError problem in #345

82f9dde

danielquinn closed this as completed Apr 28, 2018

danielquinn mentioned this issue Apr 28, 2018

Consumer shuts down when malformated filename are present #341

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trouble processing multi-page documents #345

Trouble processing multi-page documents #345

GarethFox commented Apr 26, 2018 •

edited

GarethFox commented Apr 26, 2018

ovv commented Apr 26, 2018

GarethFox commented Apr 26, 2018

GarethFox commented Apr 26, 2018 •

edited

ovv commented Apr 26, 2018 •

edited

GarethFox commented Apr 26, 2018 •

edited

GarethFox commented Apr 26, 2018

GarethFox commented Apr 26, 2018 •

edited

ovv commented Apr 26, 2018 •

edited

GarethFox commented Apr 26, 2018

GarethFox commented Apr 27, 2018

thepill commented Apr 27, 2018

thepill commented Apr 28, 2018 •

edited

thepill commented Apr 28, 2018

danielquinn commented Apr 28, 2018

thepill commented Apr 28, 2018

danielquinn commented Apr 28, 2018

ddddavidmartin commented Apr 29, 2018

danielquinn commented Apr 30, 2018

GarethFox commented Apr 30, 2018

GarethFox commented May 7, 2018

Trouble processing multi-page documents #345

Trouble processing multi-page documents #345

Comments

GarethFox commented Apr 26, 2018 • edited

GarethFox commented Apr 26, 2018

ovv commented Apr 26, 2018

GarethFox commented Apr 26, 2018

GarethFox commented Apr 26, 2018 • edited

ovv commented Apr 26, 2018 • edited

GarethFox commented Apr 26, 2018 • edited

GarethFox commented Apr 26, 2018

GarethFox commented Apr 26, 2018 • edited

ovv commented Apr 26, 2018 • edited

GarethFox commented Apr 26, 2018

GarethFox commented Apr 27, 2018

thepill commented Apr 27, 2018

thepill commented Apr 28, 2018 • edited

thepill commented Apr 28, 2018

danielquinn commented Apr 28, 2018

thepill commented Apr 28, 2018

danielquinn commented Apr 28, 2018

ddddavidmartin commented Apr 29, 2018

danielquinn commented Apr 30, 2018

GarethFox commented Apr 30, 2018

GarethFox commented May 7, 2018

GarethFox commented Apr 26, 2018 •

edited

GarethFox commented Apr 26, 2018 •

edited

ovv commented Apr 26, 2018 •

edited

GarethFox commented Apr 26, 2018 •

edited

GarethFox commented Apr 26, 2018 •

edited

ovv commented Apr 26, 2018 •

edited

thepill commented Apr 28, 2018 •

edited