Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

makebox doesn't output horizontal coordinates of textangle 90 content #3590

Open
rmast opened this issue Oct 4, 2021 · 2 comments
Open

makebox doesn't output horizontal coordinates of textangle 90 content #3590

rmast opened this issue Oct 4, 2021 · 2 comments

Comments

@rmast
Copy link

rmast commented Oct 4, 2021

Environment

tesseract 4.1.1 and 5.0.0-beta-20210916
Linux 5.4.0-81-generic #91-Ubuntu SMP Thu Jul 15 19:09:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

using nld language.
the language nld (and eng) from https://github.com/tesseract-ocr/tessdata with these sizes:
15400601 eng.traineddata
8903736 nld.traineddata

Makebox does output this for a part in the image that has vertically oriented text (textangle 90):
(all horizontal coordinates and widths are 0)

2 1968 0 1982 0 0
0 1985 0 1998 0 0
8 2001 0 2014 0 0
4 2016 0 2030 0 0
- 2041 0 2049 0 0
2 2059 0 2073 0 0
/ 2074 0 2082 0 0
2 2083 0 2097 0 0
2 2116 0 2130 0 0
5 2133 0 2146 0 0
1 2150 0 2158 0 0
9 2165 0 2179 0 0
8 2181 0 2195 0 0
0 2197 0 2211 0 0

This is the image that was used for this data:
210913.nog.2-000na.zip

A similar issue was filed earlier #2340, but the issuer https://github.com/dev884 didn't provide any pointer to his fix, he has no code at all in his account.

Expected Behavior:

I would expect the horizontal coordinates to resemble the ones in the word oriented hocr-output of the same region of the picture.
<span class=\'ocr_line\' id=\'line_1_1\' title="bbox 111 1289 133 1532; textangle 90; x_size 28.416666; x_descenders 7.1041665; x_ascenders 7.1041665">\n <span class=\'ocrx_word\' id=\'word_1_1\' title=\'bbox 112 1470 133 1532; x_wconf 88\'>2084</span>\n <span class=\'ocrx_word\' id=\'word_1_2\' title=\'bbox 124 1451 127 1459; x_wconf 88\'>-</span>\n <span class=\'ocrx_word\' id=\'word_1_3\' title=\'bbox 111 1403 133 1441; x_wconf 96\'>2/2</span>\n <span class=\'ocrx_word\' id=\'word_1_4\' title=\'bbox 112 1289 133 1384; x_wconf 96\'>251980</span>\n </span>\n

Suggested Fix:

Not thought of any yet. I don't know if the workaround of the previous issuer could be made watertight.

@wollmers
Copy link

wollmers commented Oct 5, 2021

Reproduced with:

$ tesseract 210913_nog_-000na.tif 210913_nog_-000na -l nld -c hocr_font_info=1 -c hocr_char_boxes=1  --tessdata-dir /usr/local/share/tessdata txt hocr
Tesseract Open Source OCR Engine v5.0.0-alpha-773-gd33ed with Leptonica

That gives in hocr:

  <div class='ocr_page' id='page_1' title='image "210913_nog_-000na.tif"; bbox 0 0 2480 3500; ppageno 0'>
[...]
     <span class='ocr_line' id='line_1_2' title="bbox 111 1289 133 1532; textangle 90; x_size 28.416666; x_descenders 7.1041665; x_ascenders 7.1041665">
      <span class='ocrx_word' id='word_1_3' title='bbox 112 1470 133 1532; x_wconf 79; x_fsize 28'>
       <span class='ocrx_cinfo' title='x_bboxes 1968 3500 1982 3500; x_conf 99.563431'>2</span>
       <span class='ocrx_cinfo' title='x_bboxes 1985 3500 1998 3500; x_conf 99.56974'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 2001 3500 2014 3500; x_conf 99.574326'>8</span>
       <span class='ocrx_cinfo' title='x_bboxes 2016 3500 2030 3500; x_conf 97.113152'>4</span>
      </span>
      <span class='ocrx_word' id='word_1_4' title='bbox 124 1451 127 1459; x_wconf 85; x_fsize 28'>
       <span class='ocrx_cinfo' title='x_bboxes 2041 3500 2049 3500; x_conf 99.021004'>-</span>
      </span>
      <span class='ocrx_word' id='word_1_5' title='bbox 111 1403 133 1441; x_wconf 96; x_fsize 28'>
       <span class='ocrx_cinfo' title='x_bboxes 2059 3500 2073 3500; x_conf 99.569565'>2</span>
       <span class='ocrx_cinfo' title='x_bboxes 2074 3500 2082 3500; x_conf 99.55378'>/</span>
       <span class='ocrx_cinfo' title='x_bboxes 2083 3500 2097 3500; x_conf 99.532211'>2</span>
      </span>
      <span class='ocrx_word' id='word_1_6' title='bbox 112 1289 133 1384; x_wconf 96; x_fsize 28'>
       <span class='ocrx_cinfo' title='x_bboxes 2116 3500 2130 3500; x_conf 99.557686'>2</span>
       <span class='ocrx_cinfo' title='x_bboxes 2133 3500 2146 3500; x_conf 99.572243'>5</span>
       <span class='ocrx_cinfo' title='x_bboxes 2150 3500 2158 3500; x_conf 99.570076'>1</span>
       <span class='ocrx_cinfo' title='x_bboxes 2165 3500 2179 3500; x_conf 99.573807'>9</span>
       <span class='ocrx_cinfo' title='x_bboxes 2181 3500 2195 3500; x_conf 99.573814'>8</span>
       <span class='ocrx_cinfo' title='x_bboxes 2197 3500 2211 3500; x_conf 99.562088'>0</span>
      </span>
     </span>

The coordinates of the bounding boxes on character level are wrong in the output. I would expect bounding boxes in the coordinate system of the page image like the bounding boxes at word level. That means, they inherit the textangle of enclosing line.

@rmast
Copy link
Author

rmast commented Oct 18, 2021

I introduced a pull request for the solution:
#3599

   <div class='ocr_carea' id='block_1_2' title="bbox 111 1289 133 1532">
    <p class='ocr_par' id='par_1_2' lang='nld' title="bbox 111 1289 133 1532">
     <span class='ocr_line' id='line_1_2' title="bbox 111 1289 133 1532; textangle 90; x_size 28.416666; x_descenders 7.1041665; x_ascenders 7.1041665">
      <span class='ocrx_word' id='word_1_3' title='bbox 112 1470 133 1532; x_wconf 95; x_fsize 7'>
       <span class='ocrx_cinfo' title='x_bboxes 112 1518 133 1532; x_conf 99.56881'>2</span>
       <span class='ocrx_cinfo' title='x_bboxes 112 1502 132 1515; x_conf 99.563797'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 112 1486 132 1499; x_conf 99.56678'>8</span>
       <span class='ocrx_cinfo' title='x_bboxes 112 1470 132 1484; x_conf 99.568611'>4</span>
      </span>
      <span class='ocrx_word' id='word_1_4' title='bbox 124 1451 127 1459; x_wconf 88; x_fsize 7'>
       <span class='ocrx_cinfo' title='x_bboxes 124 1451 127 1459; x_conf 99.513245'>-</span>
      </span>
      <span class='ocrx_word' id='word_1_5' title='bbox 111 1403 133 1441; x_wconf 88; x_fsize 7'>
       <span class='ocrx_cinfo' title='x_bboxes 112 1427 133 1441; x_conf 98.927139'>2</span>
       <span class='ocrx_cinfo' title='x_bboxes 111 1418 132 1426; x_conf 99.018196'>/</span>
       <span class='ocrx_cinfo' title='x_bboxes 112 1403 133 1417; x_conf 99.005127'>2</span>
      </span>
      <span class='ocrx_word' id='word_1_6' title='bbox 112 1289 133 1384; x_wconf 96; x_fsize 7'>
       <span class='ocrx_cinfo' title='x_bboxes 112 1370 133 1384; x_conf 99.544136'>2</span>
       <span class='ocrx_cinfo' title='x_bboxes 112 1354 133 1367; x_conf 99.573975'>5</span>
       <span class='ocrx_cinfo' title='x_bboxes 112 1342 133 1350; x_conf 99.509567'>1</span>
       <span class='ocrx_cinfo' title='x_bboxes 112 1321 133 1335; x_conf 99.54306'>9</span>
       <span class='ocrx_cinfo' title='x_bboxes 112 1305 133 1319; x_conf 99.544922'>8</span>
       <span class='ocrx_cinfo' title='x_bboxes 112 1289 133 1303; x_conf 99.564034'>0</span>
      </span>
     </span>
    </p>
   </div>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants