Skip to content
This repository has been archived by the owner on Feb 19, 2021. It is now read-only.

Trouble processing multi-page documents #345

Closed
GarethFox opened this issue Apr 26, 2018 · 21 comments
Closed

Trouble processing multi-page documents #345

GarethFox opened this issue Apr 26, 2018 · 21 comments

Comments

@GarethFox
Copy link

GarethFox commented Apr 26, 2018

Hi everyone,

First of all, big up for the project, really interesting.
I am giving it a try right now.
The solution works under a raspberry-docker environment.
All was doing fine so far, I could have my first single pages pdf's processed correctly.
Then I dropped into the consumer_folder two 2-page recto-verso pdf's (so 4 sheets per doc) plus a new single page document for which one I created a specific 'correspondent' rule in the web-app to see how the feature works.
Finally I ended up with an infinite loop over the first 2-page pdf:

  | Parsing for eng | Informational
  | OCRing the document | Informational
  | Consuming /consume/SKM_C454e18042611000_0001.pdf | Informational
  | Parsers available: RasterisedDocumentParser | Informational
  | Starting document consumer at /consume | Informational
  | Parsing for fra | Informational
  | Parsing for eng | Informational
  | OCRing the document | Informational
  | Consuming /consume/SKM_C454e18042611000_0001.pdf | Informational
  | Parsers available: RasterisedDocumentParser | Informational
  | Starting document consumer at /consume | Informational
  | Parsing for fra | Informational
  | Parsing for eng | Informational
  | OCRing the document | Informational
  | Consuming /consume/SKM_C454e18042611000_0001.pdf | Informational
  | Parsers available: RasterisedDocumentParser | Informational
  | Starting document consumer at /consume | Informational
  | Parsing for fra | Informational
  | Parsing for eng | Informational
  | OCRing the document | Informational
  | Consuming /consume/SKM_C454e18042611000_0001.pdf | Informational
  | Parsers available: RasterisedDocumentParser | Informational
  | Starting document consumer at /consume | Informational
  | Parsing for fra | Informational
  | Parsing for eng | Informational
  | OCRing the document | Informational
  | Consuming /consume/SKM_C454e18042611000_0001.pdf | Informational
  | Parsers available: RasterisedDocumentParser | Informational
  | Starting document consumer at /consume | Informational
  | Parsing for fra | Informational
  | Parsing for eng | Informational
  | OCRing the document | Informational
  | Consuming /consume/SKM_C454e18042611000_0001.pdf | Informational
  | Parsers available: RasterisedDocumentParser | Informational
  | Starting document consumer at /consume | Informational
  | Parsing for fra | Informational
  | Parsing for eng | Informational
  | OCRing the document | Informational
  | Consuming /consume/SKM_C454e18042611000_0001.pdf | Informational
  | Parsers available: RasterisedDocumentParser | Informational

Then I deleted it from the folder and the consumer started to process the second 2-page pdf:

April 26, 2018, 10 a.m. | Parsing for fra | Informational
  | Parsing for eng | Informational
  | OCRing the document | Informational
  | Consuming /consume/SKM_C454e18042611000_0002.pdf | Informational
  | Parsers available: RasterisedDocumentParser | Informational
  | Starting document consumer at /consume | Informational
  | Parsing for fra | Informational
  | Parsing for eng | Informational
  | OCRing the document | Informational
  | Consuming /consume/SKM_C454e18042611000_0002.pdf | Informational
  | Parsers available: RasterisedDocumentParser | Informational
  | Starting document consumer at /consume | Informational

It was taking the same direction, then I deleted that one too manually from the consumer's folder.
Then it started to process the single page pdf with success:

April 26, 2018, 10:05 a.m. | Document 20180329000000: SKM_C454e18042611340 consumption finished | Informational
  | Completed | Informational
  | Detected document date 2018-03-29T00:00:00+00:00 based on string 29/03/2018 | Informational
  | Parsing for fra | Informational
  | Parsing for eng | Informational
  | OCRing the document | Informational
  | Consuming /consume/SKM_C454e18042611340.pdf | Informational
  | Parsers available: RasterisedDocumentParser | Informational

Then I did not had an auto correspondent match because the literal expression I indicated was not reflected in the final OCR'ed output, but this is an other topic.

My question is, does paperless is able to process multi-page pdf's and if positive what could cause the loops on this type of document I experienced ?

Many thanks,

@GarethFox
Copy link
Author

Hi again!

Here is an update I can give after several tries.
It seems to be due to blank pages inside documents which is a real common case in the usage I am planning for paperless.
To be sure I tried with a pdf containing 4 grouped single pages with blank verso.
Result is having a loop inside the log related to the doc.
Then once deleted from consumer folder, I recreated the pdf excluding the blank pages.
This time result is positive and document fully processed.
Finally I went to split a filled recto and a blank verso as 2 distincts pdf's.
Once dropped in the consumer folder, it started looping over the second page (the blank one), I deleted it, and it finally processed correctly the recto containing the informations.

What solution could be figurable to address having app stucked when facing a blank page ?

Many thanks,

@ovv
Copy link
Contributor

ovv commented Apr 26, 2018

Paperless correctly handle blank page on my installation, this is definitively a bug.

There seem to be no error in the log, could you increase the level to DEBUG ? It might be a bug with the underlaying OCR library.

@GarethFox
Copy link
Author

Sure,

I would be glad to perform a deeper debug report on the issue.
I am using a raspberry pi with the docker-compose setup.
What commands or log-file you advise me to perform ?

Thx,

@GarethFox
Copy link
Author

GarethFox commented Apr 26, 2018

Here is my docker-compose.yml

version: '2.1'

services:
    webserver:
        build: ./
        # uncomment the following line to start automatically on system boot
        restart: unless-stopped
        ports:
            # You can adapt the port you want Paperless to listen on by
            # modifying the part before the `:`.
            - "8000:8000"
        healthcheck:
            test: ["CMD", "curl" , "-f", "http://localhost:8000"]
            interval: 30s
            timeout: 10s
            retries: 5
        volumes:
            - data:/usr/src/paperless/data
            - media:/usr/src/paperless/media
        env_file: docker-compose.env
        # The reason the line is here is so that the webserver that doesn't do
        # any text recognition and doesn't have to install unnecessary
        # languages the user might have set in the env-file by overwriting the
        # value with nothing.
        environment:
            - PAPERLESS_OCR_LANGUAGES=
        command: ["runserver", "--insecure", "--noreload", "0.0.0.0:8000"]

    consumer:
        build: ./
        # uncomment the following line to start automatically on system boot
        restart: unless-stopped
        depends_on:
            webserver:
                condition: service_healthy
        volumes:
            - data:/usr/src/paperless/data
            - media:/usr/src/paperless/media
            # You have to adapt the local path you want the consumption
            # directory to mount to by modifying the part before the ':'.
            - ./consume:/consume
            # Likewise, you can add a local path to mount a directory for
            # exporting. This is not strictly needed for paperless to
            # function, only if you're exporting your files: uncomment
            # it and fill in a local path if you know you're going to
            # want to export your documents.
            # - /path/to/another/arbitrary/place:/export
        env_file: docker-compose.env
        command: ["document_consumer"]

volumes:
    data:
    media:

and docker-compose.env file:

# Environment variables to set for Paperless
# Commented out variables will be replaced by a default within Paperless.

# Passphrase Paperless uses to encrypt and decrypt your documents
PAPERLESS_PASSPHRASE=123456

# The amount of threads to use for text recognition
# PAPERLESS_OCR_THREADS=4

# Additional languages to install for text recognition
PAPERLESS_OCR_LANGUAGES=deu eng fra

# You can change the default user and group id to a custom one
# USERMAP_UID=1000
# USERMAP_GID=1000

@ovv
Copy link
Contributor

ovv commented Apr 26, 2018

Are all raspberry pi ARM ? Seems like issue #337 could be related

I'm not sure where / how the logging level is setup. Putting DJANGO_LOG_LEVEL=DEBUG in the docker-compose.env might do it

@GarethFox
Copy link
Author

GarethFox commented Apr 26, 2018

Okay,

I also use a rpi3 ARMv7 to test paperless, maybe it would have been better passing by a server, but it was really easier to just give try this way as I saw it was doable using docker-compose. And yes all RPI's are ARM based.
By the way except that issue, all is rolling fine to me so far ^^, great job !

Some more info, sudo docker version output:

Client:
 Version:       18.04.0-ce
 API version:   1.37
 Go version:    go1.9.4
 Git commit:    3d479c0
 Built: Tue Apr 10 18:25:24 2018
 OS/Arch:       linux/arm
 Experimental:  false
 Orchestrator:  swarm

Server:
 Engine:
  Version:      18.04.0-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.4
  Git commit:   3d479c0
  Built:        Tue Apr 10 18:21:25 2018
  OS/Arch:      linux/arm
  Experimental: false

Then, sudo docker info output:

Live Restore Enabled: false
WARNING: No memory limit support
WARNING: No swap limit support
WARNING: No kernel memory limit support
WARNING: No oom kill disable support
WARNING: No cpu cfs quota support
WARNING: No cpu cfs period support

Finally, sudo docker-compose version output:

docker-compose version 1.21.0, build 5920eb0
docker-py version: 3.3.0
CPython version: 3.5.3
OpenSSL version: OpenSSL 1.1.0f  25 May 2017

@GarethFox
Copy link
Author

I will reboot docker-compose stack inserting this environment variable.
Where can I access to the debug log to give you the output once done?

@GarethFox
Copy link
Author

GarethFox commented Apr 26, 2018

The .env file is updated and stack reboot performed.
I can't see more detailed information through the web-app/documents/logs, but I found these line related to the current in-loop document running sudo docker-compose logs:

...
adapted bellow
...

@ovv
Copy link
Contributor

ovv commented Apr 26, 2018

I removed the webserver_1 logs to have a better look at consumer_1.

consumer_1   | Operations to perform:
consumer_1   |   Apply all migrations: admin, auth, contenttypes, documents, reminders, sessions
consumer_1   | Running migrations:
consumer_1   |   No migrations to apply.
consumer_1   | fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/main/armhf/APKINDEX.tar.gz
consumer_1   | fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/main/armhf/APKINDEX.tar.gz
consumer_1   | fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/community/armhf/APKINDEX.tar.gz
consumer_1   | fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/community/armhf/APKINDEX.tar.gz
consumer_1   | (1/1) Installing tesseract-ocr-data-deu (3.05.01-r2)
consumer_1   | OK: 263 MiB in 122 packages
consumer_1   | fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/main/armhf/APKINDEX.tar.gz
consumer_1   | fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/main/armhf/APKINDEX.tar.gz
consumer_1   | fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/community/armhf/APKINDEX.tar.gz
consumer_1   | fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/community/armhf/APKINDEX.tar.gz
consumer_1   | (1/1) Installing tesseract-ocr-data-fra (3.05.01-r2)
consumer_1   | OK: 299 MiB in 123 packages
consumer_1   | Starting document consumer at /consume
consumer_1   | Parsers available: RasterisedDocumentParser
consumer_1   | Consuming /consume/SKM_C454e18042615130_0001.pdf
consumer_1   | convert: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `/tmp/paperless/paperless-fru4z33m/convert-%04d.png' @ warning/png.c/MagickPNGWarningHandler/1744.
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-fru4z33m/convert-0001.pnm -> /tmp/paperless/paperless-fru4z33m/convert-0001.unpaper.pnm
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-fru4z33m/convert-0004.pnm -> /tmp/paperless/paperless-fru4z33m/convert-0004.unpaper.pnm
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-fru4z33m/convert-0000.pnm -> /tmp/paperless/paperless-fru4z33m/convert-0000.unpaper.pnm
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-fru4z33m/convert-0005.pnm -> /tmp/paperless/paperless-fru4z33m/convert-0005.unpaper.pnm
consumer_1   | [pgm_pipe @ 0x74baa490] Stream #0: not enough frames to estimate rate; consider increasing probesize
consumer_1   | [pgm_pipe @ 0x74b92490] Stream #0: not enough frames to estimate rate; consider increasing probesize
consumer_1   | [pgm_pipe @ 0x74bdf490] Stream #0: not enough frames to estimate rate; consider increasing probesize
consumer_1   | [pgm_pipe @ 0x74bd1490] Stream #0: not enough frames to estimate rate; consider increasing probesize                                                                                                                                                                                                            
consumer_1   | [image2 @ 0x749909d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.                                                                                                                                                                                     
consumer_1   | [image2 @ 0x749909d0] Encoder did not produce proper pts, making some up.                                                                                                                                                                                                                                       
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-fru4z33m/convert-0003.pnm -> /tmp/paperless/paperless-fru4z33m/convert-0003.unpaper.pnm                                                                                                                                                                           
consumer_1   | [pgm_pipe @ 0x74bc3490] Stream #0: not enough frames to estimate rate; consider increasing probesize                                                                                                                                                                                                            
consumer_1   | [image2 @ 0x749dd9d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.                                                                                                                                                                                     
consumer_1   | [image2 @ 0x749dd9d0] Encoder did not produce proper pts, making some up.                                                                                                                                                                                                                                       
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-fru4z33m/convert-0002.pnm -> /tmp/paperless/paperless-fru4z33m/convert-0002.unpaper.pnm                                                                                                                                                                           
consumer_1   | [pgm_pipe @ 0x74b84490] Stream #0: not enough frames to estimate rate; consider increasing probesize                                                                                                                                                                                                            
consumer_1   | [image2 @ 0x749a89d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.                                                                                                                                                                                     
consumer_1   | [image2 @ 0x749a89d0] Encoder did not produce proper pts, making some up.                                                                                                                                                                                                                                       
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-fru4z33m/convert-0006.pnm -> /tmp/paperless/paperless-fru4z33m/convert-0006.unpaper.pnm                                                                                                                                                                           
consumer_1   | [pgm_pipe @ 0x74b80490] Stream #0: not enough frames to estimate rate; consider increasing probesize                                                                                                                                                                                                            
consumer_1   | [image2 @ 0x749cf9d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.                                                                                                                                                                                     
consumer_1   | [image2 @ 0x749cf9d0] Encoder did not produce proper pts, making some up.                                                                                                                                                                                                                                       
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-fru4z33m/convert-0007.pnm -> /tmp/paperless/paperless-fru4z33m/convert-0007.unpaper.pnm                                                                                                                                                                           
consumer_1   | [pgm_pipe @ 0x74b7c490] Stream #0: not enough frames to estimate rate; consider increasing probesize                                                                                                                                                                                                            
consumer_1   | [image2 @ 0x749c19d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.                                                                                                                                                                                     
consumer_1   | [image2 @ 0x749c19d0] Encoder did not produce proper pts, making some up.                                                                                                                                                                                                                                       
consumer_1   | [image2 @ 0x7497e9d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.                                                                                                                                                                                     
consumer_1   | [image2 @ 0x7497e9d0] Encoder did not produce proper pts, making some up.                                                                                                                                                                                                                                       
consumer_1   | [image2 @ 0x749829d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
consumer_1   | [image2 @ 0x749829d0] Encoder did not produce proper pts, making some up.
consumer_1   | [image2 @ 0x7497a9d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
consumer_1   | [image2 @ 0x7497a9d0] Encoder did not produce proper pts, making some up.
consumer_1   | OCRing the document
consumer_1   | Parsing for eng
consumer_1   | Parsing for deu
consumer_1   | multiprocessing.pool.RemoteTraceback:
consumer_1   | """
consumer_1   | Traceback (most recent call last):
consumer_1   |   File "/usr/lib/python3.6/site-packages/pyocr/tesseract.py", line 214, in detect_orientation
consumer_1   |     angle = int(output.get('Rotate', output['Orientation in degrees']))
consumer_1   | KeyError: 'Orientation in degrees'
consumer_1   |
consumer_1   | During handling of the above exception, another exception occurred:
consumer_1   |
consumer_1   | Traceback (most recent call last):
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
consumer_1   |     result = (True, func(*args, **kwds))
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
consumer_1   |     return list(map(*args))
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 286, in image_to_string
consumer_1   |     orientation = ocr.detect_orientation(f, lang=lang)
consumer_1   |   File "/usr/lib/python3.6/site-packages/pyocr/tesseract.py", line 224, in detect_orientation
consumer_1   |     % (ex.message, original_output))
consumer_1   | AttributeError: 'KeyError' object has no attribute 'message'
consumer_1   | """
consumer_1   |
consumer_1   | The above exception was the direct cause of the following exception:
consumer_1   |
consumer_1   | Traceback (most recent call last):
consumer_1   |   File "/usr/src/paperless/src/manage.py", line 18, in <module>
consumer_1   |     execute_from_command_line(sys.argv)
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/__init__.py", line 364, in execute_from_command_line
consumer_1   |     utility.execute()
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/__init__.py", line 356, in execute
consumer_1   |     self.fetch_command(subcommand).run_from_argv(self.argv)
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/base.py", line 283, in run_from_argv
consumer_1   |     self.execute(*args, **cmd_options)
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/base.py", line 330, in execute
consumer_1   |     output = self.handle(*args, **options)
consumer_1   |   File "/usr/src/paperless/src/documents/management/commands/document_consumer.py", line 85, in handle
consumer_1   |     self.loop(mail_delta=mail_delta)
consumer_1   |   File "/usr/src/paperless/src/documents/management/commands/document_consumer.py", line 105, in loop
consumer_1   |     self.file_consumer.run()
consumer_1   |   File "/usr/src/paperless/src/documents/consumer.py", line 123, in run
consumer_1   |     date = parsed_document.get_date()
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 209, in get_date
consumer_1   |     text = self.get_text()
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 80, in get_text
consumer_1   |     self._text = self._get_ocr(images)
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 161, in _get_ocr
consumer_1   |     return self._ocr(imgs, ISO639[guessed_language])
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 189, in _ocr
consumer_1   |     r = pool.map(image_to_string, itertools.product(imgs, [lang]))
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map
consumer_1   |     return self._map_async(func, iterable, mapstar, chunksize).get()
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
consumer_1   |     raise self._value
consumer_1   | AttributeError: 'KeyError' object has no attribute 'message'
consumer_1   | Operations to perform:
consumer_1   |   Apply all migrations: admin, auth, contenttypes, documents, reminders, sessions
consumer_1   | Running migrations:
consumer_1   |   No migrations to apply.
consumer_1   | Starting document consumer at /consume
consumer_1   | Parsers available: RasterisedDocumentParser
consumer_1   | Consuming /consume/SKM_C454e18042615130_0001.pdf
consumer_1   | convert: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `/tmp/paperless/paperless-f80ru5f_/convert-%04d.png' @ warning/png.c/MagickPNGWarningHandler/1744.

So it looks like the consumer is crashing and restarting each time. Someone will have to look deeper into that error. Would you be willing to share that file so we can verify the fix (or another one that produce the same result) ?

@GarethFox
Copy link
Author

Basically it occured each document I fed the consumer with containing a blank page.
As it was sensitive invoices documents, I will give another try tomorrow with a neutral document source.
No problem to share that kind of course ^^

@GarethFox
Copy link
Author

Hello fellows,

Here is the log related to a new document I tried to processed using paperless (I removed webserver parts):

consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map
consumer_1   |     return self._map_async(func, iterable, mapstar, chunksize).get()
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
consumer_1   |     raise self._value
consumer_1   | AttributeError: 'KeyError' object has no attribute 'message'
consumer_1   | Operations to perform:
consumer_1   |   Apply all migrations: admin, auth, contenttypes, documents, reminders, sessions
consumer_1   | Running migrations:
consumer_1   |   No migrations to apply.
consumer_1   | Starting document consumer at /consume
consumer_1   | Parsers available: RasterisedDocumentParser
consumer_1   | Consuming /consume/SKM_C454e18042708530_0001.pdf
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-pikj79ce/convert-0000.pnm -> /tmp/paperless/paperless-pikj79ce/convert-0000.unpaper.pnm
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-pikj79ce/convert-0001.pnm -> /tmp/paperless/paperless-pikj79ce/convert-0001.unpaper.pnm
consumer_1   | [pgm_pipe @ 0x74bb2490] Stream #0: not enough frames to estimate rate; consider increasing probesize
consumer_1   | [pgm_pipe @ 0x74b25490] Stream #0: not enough frames to estimate rate; consider increasing probesize
consumer_1   | [image2 @ 0x749b09d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
consumer_1   | [image2 @ 0x749b09d0] Encoder did not produce proper pts, making some up.
consumer_1   | [image2 @ 0x749239d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
consumer_1   | [image2 @ 0x749239d0] Encoder did not produce proper pts, making some up.
consumer_1   | OCRing the document
consumer_1   | Parsing for eng
consumer_1   | multiprocessing.pool.RemoteTraceback:
consumer_1   | """
consumer_1   | Traceback (most recent call last):
consumer_1   |   File "/usr/lib/python3.6/site-packages/pyocr/tesseract.py", line 214, in detect_orientation
consumer_1   |     angle = int(output.get('Rotate', output['Orientation in degrees']))
consumer_1   | KeyError: 'Orientation in degrees'
consumer_1   |
consumer_1   | During handling of the above exception, another exception occurred:
consumer_1   |
consumer_1   | Traceback (most recent call last):
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
consumer_1   |     result = (True, func(*args, **kwds))
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
consumer_1   |     return list(map(*args))
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 286, in image_to_string
consumer_1   |     orientation = ocr.detect_orientation(f, lang=lang)
consumer_1   |   File "/usr/lib/python3.6/site-packages/pyocr/tesseract.py", line 224, in detect_orientation
consumer_1   |     % (ex.message, original_output))
consumer_1   | AttributeError: 'KeyError' object has no attribute 'message'
consumer_1   | """
consumer_1   |
consumer_1   | The above exception was the direct cause of the following exception:
consumer_1   |
consumer_1   | Traceback (most recent call last):
consumer_1   |   File "/usr/src/paperless/src/manage.py", line 18, in <module>
consumer_1   |     execute_from_command_line(sys.argv)
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/__init__.py", line 364, in execute_from_command_line
consumer_1   |     utility.execute()
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/__init__.py", line 356, in execute
consumer_1   |     self.fetch_command(subcommand).run_from_argv(self.argv)
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/base.py", line 283, in run_from_argv
consumer_1   |     self.execute(*args, **cmd_options)
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/base.py", line 330, in execute
consumer_1   |     output = self.handle(*args, **options)
consumer_1   |   File "/usr/src/paperless/src/documents/management/commands/document_consumer.py", line 85, in handle
consumer_1   |     self.loop(mail_delta=mail_delta)
consumer_1   |   File "/usr/src/paperless/src/documents/management/commands/document_consumer.py", line 105, in loop
consumer_1   |     self.file_consumer.run()
consumer_1   |   File "/usr/src/paperless/src/documents/consumer.py", line 123, in run
consumer_1   |     date = parsed_document.get_date()
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 209, in get_date
consumer_1   |     text = self.get_text()
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 80, in get_text
consumer_1   |     self._text = self._get_ocr(images)
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 140, in _get_ocr
consumer_1   |     raw_text = self._ocr([imgs[middle]], self.DEFAULT_OCR_LANGUAGE)
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 189, in _ocr
consumer_1   |     r = pool.map(image_to_string, itertools.product(imgs, [lang]))
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map
consumer_1   |     return self._map_async(func, iterable, mapstar, chunksize).get()
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
consumer_1   |     raise self._value
consumer_1   | AttributeError: 'KeyError' object has no attribute 'message'
consumer_1   | Operations to perform:
consumer_1   |   Apply all migrations: admin, auth, contenttypes, documents, reminders, sessions
consumer_1   | Running migrations:
consumer_1   |   No migrations to apply.
consumer_1   | Starting document consumer at /consume
consumer_1   | Parsers available: RasterisedDocumentParser
consumer_1   | Consuming /consume/SKM_C454e18042708530_0001.pdf
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-5cnilohm/convert-0001.pnm -> /tmp/paperless/paperless-5cnilohm/convert-0001.unpaper.pnm
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-5cnilohm/convert-0000.pnm -> /tmp/paperless/paperless-5cnilohm/convert-0000.unpaper.pnm
consumer_1   | [pgm_pipe @ 0x74b95490] Stream #0: not enough frames to estimate rate; consider increasing probesize
consumer_1   | [pgm_pipe @ 0x74b40490] Stream #0: not enough frames to estimate rate; consider increasing probesize
consumer_1   | [image2 @ 0x7493e9d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
consumer_1   | [image2 @ 0x7493e9d0] Encoder did not produce proper pts, making some up.
consumer_1   | [image2 @ 0x749939d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
consumer_1   | [image2 @ 0x749939d0] Encoder did not produce proper pts, making some up.
consumer_1   | OCRing the document
consumer_1   | Parsing for eng
consumer_1   | multiprocessing.pool.RemoteTraceback:
consumer_1   | """
consumer_1   | Traceback (most recent call last):
consumer_1   |   File "/usr/lib/python3.6/site-packages/pyocr/tesseract.py", line 214, in detect_orientation
consumer_1   |     angle = int(output.get('Rotate', output['Orientation in degrees']))
consumer_1   | KeyError: 'Orientation in degrees'
consumer_1   |
consumer_1   | During handling of the above exception, another exception occurred:
consumer_1   |
consumer_1   | Traceback (most recent call last):
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
consumer_1   |     result = (True, func(*args, **kwds))
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
consumer_1   |     return list(map(*args))
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 286, in image_to_string
consumer_1   |     orientation = ocr.detect_orientation(f, lang=lang)
consumer_1   |   File "/usr/lib/python3.6/site-packages/pyocr/tesseract.py", line 224, in detect_orientation
consumer_1   |     % (ex.message, original_output))
consumer_1   | AttributeError: 'KeyError' object has no attribute 'message'
consumer_1   | """
consumer_1   |
consumer_1   | The above exception was the direct cause of the following exception:
consumer_1   |
consumer_1   | Traceback (most recent call last):
consumer_1   |   File "/usr/src/paperless/src/manage.py", line 18, in <module>
consumer_1   |     execute_from_command_line(sys.argv)
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/__init__.py", line 364, in execute_from_command_line
consumer_1   |     utility.execute()
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/__init__.py", line 356, in execute
consumer_1   |     self.fetch_command(subcommand).run_from_argv(self.argv)
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/base.py", line 283, in run_from_argv
consumer_1   |     self.execute(*args, **cmd_options)
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/base.py", line 330, in execute
consumer_1   |     output = self.handle(*args, **options)
consumer_1   |   File "/usr/src/paperless/src/documents/management/commands/document_consumer.py", line 85, in handle
consumer_1   |     self.loop(mail_delta=mail_delta)
consumer_1   |   File "/usr/src/paperless/src/documents/management/commands/document_consumer.py", line 105, in loop
consumer_1   |     self.file_consumer.run()
consumer_1   |   File "/usr/src/paperless/src/documents/consumer.py", line 123, in run
consumer_1   |     date = parsed_document.get_date()
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 209, in get_date
consumer_1   |     text = self.get_text()
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 80, in get_text
consumer_1   |     self._text = self._get_ocr(images)
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 140, in _get_ocr
consumer_1   |     raw_text = self._ocr([imgs[middle]], self.DEFAULT_OCR_LANGUAGE)
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 189, in _ocr
consumer_1   |     r = pool.map(image_to_string, itertools.product(imgs, [lang]))
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map
consumer_1   |     return self._map_async(func, iterable, mapstar, chunksize).get()
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
consumer_1   |     raise self._value
consumer_1   | AttributeError: 'KeyError' object has no attribute 'message'
consumer_1   | Operations to perform:
consumer_1   |   Apply all migrations: admin, auth, contenttypes, documents, reminders, sessions
consumer_1   | Running migrations:
consumer_1   |   No migrations to apply.
consumer_1   | Starting document consumer at /consume
consumer_1   | Parsers available: RasterisedDocumentParser
consumer_1   | Consuming /consume/SKM_C454e18042708530_0001.pdf
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-9b53x_g5/convert-0000.pnm -> /tmp/paperless/paperless-9b53x_g5/convert-0000.unpaper.pnm
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-9b53x_g5/convert-0001.pnm -> /tmp/paperless/paperless-9b53x_g5/convert-0001.unpaper.pnm
consumer_1   | [pgm_pipe @ 0x74b60490] Stream #0: not enough frames to estimate rate; consider increasing probesize
consumer_1   | [pgm_pipe @ 0x74bee490] Stream #0: not enough frames to estimate rate; consider increasing probesize
consumer_1   | [image2 @ 0x7495e9d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
consumer_1   | [image2 @ 0x7495e9d0] Encoder did not produce proper pts, making some up.
consumer_1   | [image2 @ 0x749ec9d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
consumer_1   | [image2 @ 0x749ec9d0] Encoder did not produce proper pts, making some up.
consumer_1   | OCRing the document
consumer_1   | Parsing for eng
consumer_1   | multiprocessing.pool.RemoteTraceback:
consumer_1   | """
consumer_1   | Traceback (most recent call last):
consumer_1   |   File "/usr/lib/python3.6/site-packages/pyocr/tesseract.py", line 214, in detect_orientation
consumer_1   |     angle = int(output.get('Rotate', output['Orientation in degrees']))
consumer_1   | KeyError: 'Orientation in degrees'
consumer_1   |
consumer_1   | During handling of the above exception, another exception occurred:
consumer_1   |
consumer_1   | Traceback (most recent call last):
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
consumer_1   |     result = (True, func(*args, **kwds))
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
consumer_1   |     return list(map(*args))
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 286, in image_to_string
consumer_1   |     orientation = ocr.detect_orientation(f, lang=lang)
consumer_1   |   File "/usr/lib/python3.6/site-packages/pyocr/tesseract.py", line 224, in detect_orientation
consumer_1   |     % (ex.message, original_output))
consumer_1   | AttributeError: 'KeyError' object has no attribute 'message'
consumer_1   | """
consumer_1   |
consumer_1   | The above exception was the direct cause of the following exception:
consumer_1   |
consumer_1   | Traceback (most recent call last):
consumer_1   |   File "/usr/src/paperless/src/manage.py", line 18, in <module>
consumer_1   |     execute_from_command_line(sys.argv)
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/__init__.py", line 364, in execute_from_command_line
consumer_1   |     utility.execute()
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/__init__.py", line 356, in execute
consumer_1   |     self.fetch_command(subcommand).run_from_argv(self.argv)
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/base.py", line 283, in run_from_argv
consumer_1   |     self.execute(*args, **cmd_options)
consumer_1   |   File "/usr/lib/python3.6/site-packages/django/core/management/base.py", line 330, in execute
consumer_1   |     output = self.handle(*args, **options)
consumer_1   |   File "/usr/src/paperless/src/documents/management/commands/document_consumer.py", line 85, in handle
consumer_1   |     self.loop(mail_delta=mail_delta)
consumer_1   |   File "/usr/src/paperless/src/documents/management/commands/document_consumer.py", line 105, in loop
consumer_1   |     self.file_consumer.run()
consumer_1   |   File "/usr/src/paperless/src/documents/consumer.py", line 123, in run
consumer_1   |     date = parsed_document.get_date()
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 209, in get_date
consumer_1   |     text = self.get_text()
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 80, in get_text
consumer_1   |     self._text = self._get_ocr(images)
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 140, in _get_ocr
consumer_1   |     raw_text = self._ocr([imgs[middle]], self.DEFAULT_OCR_LANGUAGE)
consumer_1   |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 189, in _ocr
consumer_1   |     r = pool.map(image_to_string, itertools.product(imgs, [lang]))
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map
consumer_1   |     return self._map_async(func, iterable, mapstar, chunksize).get()
consumer_1   |   File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
consumer_1   |     raise self._value
consumer_1   | AttributeError: 'KeyError' object has no attribute 'message'
consumer_1   | Operations to perform:
consumer_1   |   Apply all migrations: admin, auth, contenttypes, documents, reminders, sessions
consumer_1   | Running migrations:
consumer_1   |   No migrations to apply.
consumer_1   | Starting document consumer at /consume
consumer_1   | Parsers available: RasterisedDocumentParser
consumer_1   | Consuming /consume/SKM_C454e18042708530_0001.pdf
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-6ikf7uep/convert-0001.pnm -> /tmp/paperless/paperless-6ikf7uep/convert-0001.unpaper.pnm
consumer_1   | Processing sheet #1: /tmp/paperless/paperless-6ikf7uep/convert-0000.pnm -> /tmp/paperless/paperless-6ikf7uep/convert-0000.unpaper.pnm
consumer_1   | [pgm_pipe @ 0x74b25490] Stream #0: not enough frames to estimate rate; consider increasing probesize
consumer_1   | [pgm_pipe @ 0x74b7c490] Stream #0: not enough frames to estimate rate; consider increasing probesize
consumer_1   | [image2 @ 0x749239d0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
consumer_1   | [image2 @ 0x749239d0] Encoder did not produce proper pts, making some up.

Please find enclosed the afformentioned pdf, I hope it will help understanding why this is occuring ^^

Thanks a lot,

SKM_C454e18042708530_0001.pdf

@thepill
Copy link

thepill commented Apr 27, 2018

AttributeError: 'KeyError' object has no attribute 'message' occured also yesterday in my test-setup. If i can share a document without sensitiv information i will do.

@thepill
Copy link

thepill commented Apr 28, 2018

I also managed to create a simple pdf file wich causes the KeyError: 'Orientation in degrees' error.

2018-04-22T171610_Scan_000250.pdf

But it is a single page.

@thepill
Copy link

thepill commented Apr 28, 2018

Handling all exception on line 288 in src/paperless_tesseract/parsers.py

 try:
     orientation = ocr.detect_orientation(f, lang=lang)
      f = f.rotate(orientation["angle"], expand=1)
except:
      pass

works, but not might be an ideal solution ( python noobie :) ).
Should we just accept the KeyError additionally to the given (TesseractError, OtherTesseractError)?

@danielquinn
Copy link
Collaborator

Ah ok I've found the problem. It looks like a bug in pyocr. Basically an exception is being triggered because orientation can't be found (totally normal) and that exception is being caught, but the handling of that exception is breaking on newer versions of Python that don't have a .message attribute on a KeyError instance.

I've just patched Paperless to also account for an AttributeError which should fix the problem for you all. Please give it a try and let me know if this is still happening and we can tinker with it from there.

@thepill
Copy link

thepill commented Apr 28, 2018

@danielquinn works for me 👍

@danielquinn
Copy link
Collaborator

Hooray!

@ddddavidmartin
Copy link
Contributor

Thanks for looking into this @danielquinn! Paperless works fine again now. I also very much appreciate your pull request in pyocr rather than just working around the issue in Paperless.

@danielquinn
Copy link
Collaborator

PyOCR has been patched, and when they do a release, I'll update Paperless to require that version.

@GarethFox
Copy link
Author

Yeehaa!
Thanks @danielquinn !
Can't personnaly test on my setup this week.
I will try to pull the patch next week and let you know what's the outcome is now on my side ;)

@GarethFox
Copy link
Author

It works perfectly !
Thanks a lot !

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants