Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError: 'NoneType' object has no attribute 'group' (negative bbox attr) #127

Closed
skylord123 opened this issue Jul 18, 2018 · 6 comments

Comments

@skylord123
Copy link
Contributor

	Message: Error from hocr-pdf.exe: [10920] Failed to execute script hocr-pdf
	Traceback (most recent call last):
	  File "hocr-pdf", line 163, in 
	  File "hocr-pdf", line 69, in export_pdf
	  File "hocr-pdf", line 81, in add_text_layer
	AttributeError: 'NoneType' object has no attribute 'group'
	StackTrace: coldfusion.runtime.CustomException: Error from hocr-pdf.exe: [10920] Failed to execute script hocr-pdf
	Traceback (most recent call last):
	  File "hocr-pdf", line 163, in 
	  File "hocr-pdf", line 69, in export_pdf
	  File "hocr-pdf", line 81, in add_text_layer

Looks to have to do with this line not matching negative bounding boxes:
https://github.com/tmbdev/hocr-tools/blob/master/hocr-pdf#L76

I hit this issue on a 60 page PDF around page 14 or so. I don't know why I got negative bounding boxes but It's a thing and caused my code to fail because of it. Updating the above mentioned line to match negative numbers fixes this issue and allows my PDF to be created correctly.

@stweil
Copy link
Collaborator

stweil commented Jul 19, 2018

It would be good to have an example for negative bounding boxes. Is it possible to extract the PDF page which causes that and provide it here? Or has someone else an example?

@kba
Copy link
Contributor

kba commented Jul 19, 2018

How was the input document created? All values of an hocr bbox must be unsigned integers, absolute coordinates on the page. It doesn't make sense to have a bounding box extend beyond the page. Are you creating bounding boxes across double pages?

It is more sensible to fix this upstream in whatever tool created the hocr, e.g. by clipping values to zero.

@skylord123
Copy link
Contributor Author

I didn't know that about the hocr bbox. I believe the issue may reside in gcv2hocr tool I use to convert google vision responses to hocr. I'm going to do some further testing and report back with my findings. I cannot share the PDF file I am using because it is confidential.

I found another PDF that has 50 pages that is having a similar issue. It is a 50-page scanned document and the issue occurs a couple pages in. The weird thing is the page that has the issue doesn't have any text near the edge of the page. I wonder how I am getting negative values..

I will report back after further tesitng.

@skylord123
Copy link
Contributor Author

skylord123 commented Jul 19, 2018

It looks like google vision is giving me negative vertices for one of my pages. I then use gcv2hocr tool to convert the google vision json responses to hocr data. This tool is the one that is converting the mistake from google vision into a negative bbox. Since bbox requires unsigned characters I will be opening a PR on gcv2hocr to fix this.

I just don't know why google vision is giving me negative vertices for this page:
page_021

Here is the JSON snippet from the OCR data (search for -5 and you will find the negative values):
https://gist.github.com/skylord123/52a4eb219da687a2f654b080088c56a0#file-page_021-json-L5679

skylord123 added a commit to skylord123/gcv2hocr that referenced this issue Jul 19, 2018
I had issues when trying to generate the ocr overlay using hocr-pdf tool from hocr-tools (https://github.com/tmbdev/hocr-tools) because `bbox` should be an unsigned integer but gcv2hocr.py generate bbox with negative numbers (if the JSON vision request has negative numbers). Using this fix generates the HOCR correctly by following spec (and the OCR invisible text shows up in the correct place so it does not break anything).

Here is the HOCR bbox spec for clarification:
http://kba.cloud/hocr-spec/1.2/#bbox

And here is an issue and PR I created for the hocr-tools before I realized this was an issue with this project instead:
ocropus/hocr-tools#127
ocropus/hocr-tools#128
@skylord123
Copy link
Contributor Author

I've opened a PR at gcv2hocr project. This fixes the error by generating HOCR that is in compliance with the spec (forcing bbox to use unsigned integers or default to 0).

Closing this issue as my problem is resolved. I hope this helps others that run into the same issue.

@stweil
Copy link
Collaborator

stweil commented Jul 19, 2018

Thank you for clarifying this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants