AttributeError: 'NoneType' object has no attribute 'group' (negative bbox attr) #127

skylord123 · 2018-07-18T16:24:44Z

	Message: Error from hocr-pdf.exe: [10920] Failed to execute script hocr-pdf
	Traceback (most recent call last):
	  File "hocr-pdf", line 163, in 
	  File "hocr-pdf", line 69, in export_pdf
	  File "hocr-pdf", line 81, in add_text_layer
	AttributeError: 'NoneType' object has no attribute 'group'
	StackTrace: coldfusion.runtime.CustomException: Error from hocr-pdf.exe: [10920] Failed to execute script hocr-pdf
	Traceback (most recent call last):
	  File "hocr-pdf", line 163, in 
	  File "hocr-pdf", line 69, in export_pdf
	  File "hocr-pdf", line 81, in add_text_layer

Looks to have to do with this line not matching negative bounding boxes:
https://github.com/tmbdev/hocr-tools/blob/master/hocr-pdf#L76

I hit this issue on a 60 page PDF around page 14 or so. I don't know why I got negative bounding boxes but It's a thing and caused my code to fail because of it. Updating the above mentioned line to match negative numbers fixes this issue and allows my PDF to be created correctly.

The text was updated successfully, but these errors were encountered:

stweil · 2018-07-19T05:12:32Z

It would be good to have an example for negative bounding boxes. Is it possible to extract the PDF page which causes that and provide it here? Or has someone else an example?

kba · 2018-07-19T08:14:19Z

How was the input document created? All values of an hocr bbox must be unsigned integers, absolute coordinates on the page. It doesn't make sense to have a bounding box extend beyond the page. Are you creating bounding boxes across double pages?

It is more sensible to fix this upstream in whatever tool created the hocr, e.g. by clipping values to zero.

skylord123 · 2018-07-19T16:57:53Z

I didn't know that about the hocr bbox. I believe the issue may reside in gcv2hocr tool I use to convert google vision responses to hocr. I'm going to do some further testing and report back with my findings. I cannot share the PDF file I am using because it is confidential.

I found another PDF that has 50 pages that is having a similar issue. It is a 50-page scanned document and the issue occurs a couple pages in. The weird thing is the page that has the issue doesn't have any text near the edge of the page. I wonder how I am getting negative values..

I will report back after further tesitng.

skylord123 · 2018-07-19T18:24:03Z

It looks like google vision is giving me negative vertices for one of my pages. I then use gcv2hocr tool to convert the google vision json responses to hocr data. This tool is the one that is converting the mistake from google vision into a negative bbox. Since bbox requires unsigned characters I will be opening a PR on gcv2hocr to fix this.

I just don't know why google vision is giving me negative vertices for this page:

Here is the JSON snippet from the OCR data (search for -5 and you will find the negative values):
https://gist.github.com/skylord123/52a4eb219da687a2f654b080088c56a0#file-page_021-json-L5679

I had issues when trying to generate the ocr overlay using hocr-pdf tool from hocr-tools (https://github.com/tmbdev/hocr-tools) because `bbox` should be an unsigned integer but gcv2hocr.py generate bbox with negative numbers (if the JSON vision request has negative numbers). Using this fix generates the HOCR correctly by following spec (and the OCR invisible text shows up in the correct place so it does not break anything). Here is the HOCR bbox spec for clarification: http://kba.cloud/hocr-spec/1.2/#bbox And here is an issue and PR I created for the hocr-tools before I realized this was an issue with this project instead: ocropus/hocr-tools#127 ocropus/hocr-tools#128

skylord123 · 2018-07-19T18:41:47Z

I've opened a PR at gcv2hocr project. This fixes the error by generating HOCR that is in compliance with the spec (forcing bbox to use unsigned integers or default to 0).

Closing this issue as my problem is resolved. I hope this helps others that run into the same issue.

stweil · 2018-07-19T19:25:52Z

Thank you for clarifying this issue.

skylord123 mentioned this issue Jul 18, 2018

Fix negative bounding boxes #128

Closed

skylord123 mentioned this issue Jul 19, 2018

Fix negative bbox values (so they do not occur) dinosauria123/gcv2hocr#18

Merged

skylord123 closed this as completed Jul 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AttributeError: 'NoneType' object has no attribute 'group' (negative bbox attr) #127

AttributeError: 'NoneType' object has no attribute 'group' (negative bbox attr) #127

skylord123 commented Jul 18, 2018

stweil commented Jul 19, 2018

kba commented Jul 19, 2018

skylord123 commented Jul 19, 2018

skylord123 commented Jul 19, 2018 •

edited

skylord123 commented Jul 19, 2018

stweil commented Jul 19, 2018

AttributeError: 'NoneType' object has no attribute 'group' (negative bbox attr) #127

AttributeError: 'NoneType' object has no attribute 'group' (negative bbox attr) #127

Comments

skylord123 commented Jul 18, 2018

stweil commented Jul 19, 2018

kba commented Jul 19, 2018

skylord123 commented Jul 19, 2018

skylord123 commented Jul 19, 2018 • edited

skylord123 commented Jul 19, 2018

stweil commented Jul 19, 2018

skylord123 commented Jul 19, 2018 •

edited