Tag on whole words #48

ghost · 2016-02-18T09:38:48Z

Hi Daniel,

I want to tag documents by year but I have a bank account with 2002 in the number. You can guess what happens ;-)

I patched Tag.matches to match only on whole words using this:

text = re.sub(r'\W+', ' ', text.lower()).split()

I wonder if it would be useful to add an option for whole word matching.

Cheers,
Jason.

danielquinn · 2016-02-18T11:37:30Z

Heh. Yeah that's a good point. I've just been using the Regex matcher, but I think it's probably best to make all the matching algorithms smart enough to match on word boundary. I'll try to do this in the next couple days.

danielquinn · 2016-02-19T00:46:19Z

Hey there, I patched master with some better logic on the .matches() function. When you have a minute, please give it a shot with the document you had the problem with, and if I don't hear from you in a few days, I'll consider this fixed :-)

ghost · 2016-02-19T07:36:33Z

Hi Daniel, I pulled the latest version and now the consumer dies with this:

consumer_1 |    **** Warning: considering '0000000000 XXXXX n' as a free entry.
consumer_1 |    **** Warning: considering '0000000000 XXXXX n' as a free entry.
consumer_1 |
consumer_1 |    **** This file had errors that were repaired or ignored.
consumer_1 |    **** The file was produced by:
consumer_1 |    **** >>>> Mac OS X 10.8.2 Quartz PDFContext <<<<
consumer_1 |    **** Please notify the author of the software that produced this
consumer_1 |    **** file that it does not conform to Adobe's published PDF
consumer_1 |    **** specification.
consumer_1 |
consumer_1 | multiprocessing.pool.RemoteTraceback:
consumer_1 | """
consumer_1 | multiprocessing.pool.RemoteTraceback:
consumer_1 | """
consumer_1 | Traceback (most recent call last):
consumer_1 |   File "/usr/local/lib/python3.5/site-packages/pyocr/tesseract.py", line 171, in detect_orientation
consumer_1 |     angle = int(output['Orientation in degrees'])
consumer_1 | KeyError: 'Orientation in degrees'
consumer_1 |
consumer_1 | During handling of the above exception, another exception occurred:
consumer_1 |
consumer_1 | Traceback (most recent call last):
consumer_1 |   File "/usr/local/lib/python3.5/multiprocessing/pool.py", line 119, in worker
consumer_1 |     result = (True, func(*args, **kwds))
consumer_1 |   File "/usr/local/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
consumer_1 |     return list(map(*args))
consumer_1 |   File "/usr/src/paperless/src/documents/consumer.py", line 32, in image_to_string
consumer_1 |     orientation = self.OCR.detect_orientation(f, lang=lang)
consumer_1 |   File "/usr/local/lib/python3.5/site-packages/pyocr/tesseract.py", line 180, in detect_orientation
consumer_1 |     % original_output)
consumer_1 | pyocr.tesseract.TesseractError: (-1, 'No script found in image (Too few characters. Skipping this page)')
consumer_1 | """
consumer_1 |
consumer_1 | The above exception was the direct cause of the following exception:
consumer_1 |
consumer_1 | Traceback (most recent call last):
consumer_1 |   File "/usr/src/paperless/src/documents/consumer.py", line 196, in _get_ocr
consumer_1 |     return self._ocr(pngs, ISO639[guessed_language])
consumer_1 |   File "/usr/src/paperless/src/documents/consumer.py", line 231, in _ocr
consumer_1 |     image_to_string, itertools.product([self], pngs, [lang]))
consumer_1 |   File "/usr/local/lib/python3.5/multiprocessing/pool.py", line 260, in map
consumer_1 |     return self._map_async(func, iterable, mapstar, chunksize).get()
consumer_1 |   File "/usr/local/lib/python3.5/multiprocessing/pool.py", line 608, in get
consumer_1 |     raise self._value
consumer_1 | pyocr.tesseract.TesseractError: (-1, 'No script found in image (Too few characters. Skipping this page)')

It looks like an error in PyOCR. I'll file an issue there but I guess it might be worth wrapping the call in a try.

Cheers,
Jason.

ghost · 2016-02-19T07:43:52Z

Filed as openpaperwork/pyocr#33

pitkley · 2016-02-19T08:49:08Z

This is not an issue with PyOCR as far as I can tell, but actually expected behaviour on their side. If a document page doesn't have anough data to identify the orientation of the page, it throws an error.

The fault is actually on our part, since we should catch and -- in this case -- ignore the error. I'll get a PR going, and you should probably close the issue with PyOCR.

Fixes an additional issue that came up in the-paperless-project#48.

pitkley · 2016-02-19T08:54:36Z

I have opened #52 to fix the orientation issue.

ghost · 2016-02-19T08:54:41Z

Okay, but I guess we can wait to see what the PyOCR dev says. The error mentions 'Orientation in degrees' and perhaps a sane default might be of use there assuming most documents are portrait.

ghost · 2016-02-19T08:58:16Z

Thanks @pitkley - I'll give it a try except I don't have much git foo. How do I switch to the PR?

danielquinn · 2016-02-19T09:15:48Z

I've merged @pitkley's fix so hopefully you should be able to get things working now. Let me know if you're cool with me closing this issue.

ghost · 2016-02-19T09:55:05Z

Hi Daniel,

it looks like it's working. I ran a bank statement with 91 occurrences of 2002 in the account number/ and it didn't tag it. Perfect!

Cheers,
Jason.

danielquinn · 2016-02-19T11:46:43Z

\o/

danielquinn added a commit that referenced this issue Feb 19, 2016

#48: make the tag matching smarter

ec88ea7

pitkley added a commit to pitkley/paperless that referenced this issue Feb 19, 2016

Ignore error if orientation detection fails

c45f951

Fixes an additional issue that came up in the-paperless-project#48.

pitkley mentioned this issue Feb 19, 2016

Ignore error if orientation detection fails #52

Merged

danielquinn closed this as completed Feb 19, 2016

ghost mentioned this issue Feb 20, 2016

Make sure file is preserved on import failure #57

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tag on whole words #48

Tag on whole words #48

ghost commented Feb 18, 2016

danielquinn commented Feb 18, 2016

danielquinn commented Feb 19, 2016

ghost commented Feb 19, 2016

ghost commented Feb 19, 2016

pitkley commented Feb 19, 2016

pitkley commented Feb 19, 2016

ghost commented Feb 19, 2016

ghost commented Feb 19, 2016

danielquinn commented Feb 19, 2016

ghost commented Feb 19, 2016

danielquinn commented Feb 19, 2016

Tag on whole words #48

Tag on whole words #48

Comments

ghost commented Feb 18, 2016

danielquinn commented Feb 18, 2016

danielquinn commented Feb 19, 2016

ghost commented Feb 19, 2016

ghost commented Feb 19, 2016

pitkley commented Feb 19, 2016

pitkley commented Feb 19, 2016

ghost commented Feb 19, 2016

ghost commented Feb 19, 2016

danielquinn commented Feb 19, 2016

ghost commented Feb 19, 2016

danielquinn commented Feb 19, 2016