Skip to content
This repository has been archived by the owner on Feb 19, 2021. It is now read-only.

Tag on whole words #48

Closed
ghost opened this issue Feb 18, 2016 · 11 comments
Closed

Tag on whole words #48

ghost opened this issue Feb 18, 2016 · 11 comments

Comments

@ghost
Copy link

ghost commented Feb 18, 2016

Hi Daniel,

I want to tag documents by year but I have a bank account with 2002 in the number. You can guess what happens ;-)

I patched Tag.matches to match only on whole words using this:

text = re.sub(r'\W+', ' ', text.lower()).split()

I wonder if it would be useful to add an option for whole word matching.

Cheers,
Jason.

@danielquinn
Copy link
Collaborator

Heh. Yeah that's a good point. I've just been using the Regex matcher, but I think it's probably best to make all the matching algorithms smart enough to match on word boundary. I'll try to do this in the next couple days.

@danielquinn
Copy link
Collaborator

Hey there, I patched master with some better logic on the .matches() function. When you have a minute, please give it a shot with the document you had the problem with, and if I don't hear from you in a few days, I'll consider this fixed :-)

@ghost
Copy link
Author

ghost commented Feb 19, 2016

Hi Daniel, I pulled the latest version and now the consumer dies with this:

consumer_1 |    **** Warning: considering '0000000000 XXXXX n' as a free entry.
consumer_1 |    **** Warning: considering '0000000000 XXXXX n' as a free entry.
consumer_1 |
consumer_1 |    **** This file had errors that were repaired or ignored.
consumer_1 |    **** The file was produced by:
consumer_1 |    **** >>>> Mac OS X 10.8.2 Quartz PDFContext <<<<
consumer_1 |    **** Please notify the author of the software that produced this
consumer_1 |    **** file that it does not conform to Adobe's published PDF
consumer_1 |    **** specification.
consumer_1 |
consumer_1 | multiprocessing.pool.RemoteTraceback:
consumer_1 | """
consumer_1 | multiprocessing.pool.RemoteTraceback:
consumer_1 | """
consumer_1 | Traceback (most recent call last):
consumer_1 |   File "/usr/local/lib/python3.5/site-packages/pyocr/tesseract.py", line 171, in detect_orientation
consumer_1 |     angle = int(output['Orientation in degrees'])
consumer_1 | KeyError: 'Orientation in degrees'
consumer_1 |
consumer_1 | During handling of the above exception, another exception occurred:
consumer_1 |
consumer_1 | Traceback (most recent call last):
consumer_1 |   File "/usr/local/lib/python3.5/multiprocessing/pool.py", line 119, in worker
consumer_1 |     result = (True, func(*args, **kwds))
consumer_1 |   File "/usr/local/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
consumer_1 |     return list(map(*args))
consumer_1 |   File "/usr/src/paperless/src/documents/consumer.py", line 32, in image_to_string
consumer_1 |     orientation = self.OCR.detect_orientation(f, lang=lang)
consumer_1 |   File "/usr/local/lib/python3.5/site-packages/pyocr/tesseract.py", line 180, in detect_orientation
consumer_1 |     % original_output)
consumer_1 | pyocr.tesseract.TesseractError: (-1, 'No script found in image (Too few characters. Skipping this page)')
consumer_1 | """
consumer_1 |
consumer_1 | The above exception was the direct cause of the following exception:
consumer_1 |
consumer_1 | Traceback (most recent call last):
consumer_1 |   File "/usr/src/paperless/src/documents/consumer.py", line 196, in _get_ocr
consumer_1 |     return self._ocr(pngs, ISO639[guessed_language])
consumer_1 |   File "/usr/src/paperless/src/documents/consumer.py", line 231, in _ocr
consumer_1 |     image_to_string, itertools.product([self], pngs, [lang]))
consumer_1 |   File "/usr/local/lib/python3.5/multiprocessing/pool.py", line 260, in map
consumer_1 |     return self._map_async(func, iterable, mapstar, chunksize).get()
consumer_1 |   File "/usr/local/lib/python3.5/multiprocessing/pool.py", line 608, in get
consumer_1 |     raise self._value
consumer_1 | pyocr.tesseract.TesseractError: (-1, 'No script found in image (Too few characters. Skipping this page)')

It looks like an error in PyOCR. I'll file an issue there but I guess it might be worth wrapping the call in a try.

Cheers,
Jason.

@ghost
Copy link
Author

ghost commented Feb 19, 2016

Filed as openpaperwork/pyocr#33

@pitkley
Copy link
Member

pitkley commented Feb 19, 2016

This is not an issue with PyOCR as far as I can tell, but actually expected behaviour on their side. If a document page doesn't have anough data to identify the orientation of the page, it throws an error.

The fault is actually on our part, since we should catch and -- in this case -- ignore the error. I'll get a PR going, and you should probably close the issue with PyOCR.

pitkley added a commit to pitkley/paperless that referenced this issue Feb 19, 2016
Fixes an additional issue that came up in the-paperless-project#48.
@pitkley
Copy link
Member

pitkley commented Feb 19, 2016

I have opened #52 to fix the orientation issue.

@ghost
Copy link
Author

ghost commented Feb 19, 2016

Okay, but I guess we can wait to see what the PyOCR dev says. The error mentions 'Orientation in degrees' and perhaps a sane default might be of use there assuming most documents are portrait.

@ghost
Copy link
Author

ghost commented Feb 19, 2016

Thanks @pitkley - I'll give it a try except I don't have much git foo. How do I switch to the PR?

@danielquinn
Copy link
Collaborator

I've merged @pitkley's fix so hopefully you should be able to get things working now. Let me know if you're cool with me closing this issue.

@ghost
Copy link
Author

ghost commented Feb 19, 2016

Hi Daniel,

it looks like it's working. I ran a bank statement with 91 occurrences of 2002 in the account number/ and it didn't tag it. Perfect!

Cheers,
Jason.

@danielquinn
Copy link
Collaborator

\o/

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants