New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AttributeError: 'NoneType' object has no attribute 'group' (negative bbox attr) #127
Comments
It would be good to have an example for negative bounding boxes. Is it possible to extract the PDF page which causes that and provide it here? Or has someone else an example? |
How was the input document created? All values of an hocr It is more sensible to fix this upstream in whatever tool created the hocr, e.g. by clipping values to zero. |
I didn't know that about the hocr bbox. I believe the issue may reside in gcv2hocr tool I use to convert google vision responses to hocr. I'm going to do some further testing and report back with my findings. I cannot share the PDF file I am using because it is confidential. I found another PDF that has 50 pages that is having a similar issue. It is a 50-page scanned document and the issue occurs a couple pages in. The weird thing is the page that has the issue doesn't have any text near the edge of the page. I wonder how I am getting negative values.. I will report back after further tesitng. |
It looks like google vision is giving me negative vertices for one of my pages. I then use gcv2hocr tool to convert the google vision json responses to hocr data. This tool is the one that is converting the mistake from google vision into a negative I just don't know why google vision is giving me negative vertices for this page: Here is the JSON snippet from the OCR data (search for -5 and you will find the negative values): |
I had issues when trying to generate the ocr overlay using hocr-pdf tool from hocr-tools (https://github.com/tmbdev/hocr-tools) because `bbox` should be an unsigned integer but gcv2hocr.py generate bbox with negative numbers (if the JSON vision request has negative numbers). Using this fix generates the HOCR correctly by following spec (and the OCR invisible text shows up in the correct place so it does not break anything). Here is the HOCR bbox spec for clarification: http://kba.cloud/hocr-spec/1.2/#bbox And here is an issue and PR I created for the hocr-tools before I realized this was an issue with this project instead: ocropus/hocr-tools#127 ocropus/hocr-tools#128
I've opened a PR at gcv2hocr project. This fixes the error by generating HOCR that is in compliance with the spec (forcing bbox to use unsigned integers or default to 0). Closing this issue as my problem is resolved. I hope this helps others that run into the same issue. |
Thank you for clarifying this issue. |
Looks to have to do with this line not matching negative bounding boxes:
https://github.com/tmbdev/hocr-tools/blob/master/hocr-pdf#L76
I hit this issue on a 60 page PDF around page 14 or so. I don't know why I got negative bounding boxes but It's a thing and caused my code to fail because of it. Updating the above mentioned line to match negative numbers fixes this issue and allows my PDF to be created correctly.
The text was updated successfully, but these errors were encountered: