-
Notifications
You must be signed in to change notification settings - Fork 501
Tag on whole words #48
Comments
Heh. Yeah that's a good point. I've just been using the Regex matcher, but I think it's probably best to make all the matching algorithms smart enough to match on word boundary. I'll try to do this in the next couple days. |
Hey there, I patched master with some better logic on the |
Hi Daniel, I pulled the latest version and now the consumer dies with this:
It looks like an error in PyOCR. I'll file an issue there but I guess it might be worth wrapping the call in a try. Cheers, |
Filed as openpaperwork/pyocr#33 |
This is not an issue with PyOCR as far as I can tell, but actually expected behaviour on their side. If a document page doesn't have anough data to identify the orientation of the page, it throws an error. The fault is actually on our part, since we should catch and -- in this case -- ignore the error. I'll get a PR going, and you should probably close the issue with PyOCR. |
Fixes an additional issue that came up in the-paperless-project#48.
I have opened #52 to fix the orientation issue. |
Okay, but I guess we can wait to see what the PyOCR dev says. The error mentions 'Orientation in degrees' and perhaps a sane default might be of use there assuming most documents are portrait. |
Thanks @pitkley - I'll give it a try except I don't have much git foo. How do I switch to the PR? |
I've merged @pitkley's fix so hopefully you should be able to get things working now. Let me know if you're cool with me closing this issue. |
Hi Daniel, it looks like it's working. I ran a bank statement with 91 occurrences of 2002 in the account number/ and it didn't tag it. Perfect! Cheers, |
\o/ |
Hi Daniel,
I want to tag documents by year but I have a bank account with 2002 in the number. You can guess what happens ;-)
I patched Tag.matches to match only on whole words using this:
text = re.sub(r'\W+', ' ', text.lower()).split()
I wonder if it would be useful to add an option for whole word matching.
Cheers,
Jason.
The text was updated successfully, but these errors were encountered: