Description
Environment
tesseract 4.1.1 and 5.0.0-beta-20210916
Linux 5.4.0-81-generic #91-Ubuntu SMP Thu Jul 15 19:09:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
using nld language.
the language nld (and eng) from https://github.com/tesseract-ocr/tessdata with these sizes:
15400601 eng.traineddata
8903736 nld.traineddata
Makebox does output this for a part in the image that has vertically oriented text (textangle 90):
(all horizontal coordinates and widths are 0)
2 1968 0 1982 0 0
0 1985 0 1998 0 0
8 2001 0 2014 0 0
4 2016 0 2030 0 0
- 2041 0 2049 0 0
2 2059 0 2073 0 0
/ 2074 0 2082 0 0
2 2083 0 2097 0 0
2 2116 0 2130 0 0
5 2133 0 2146 0 0
1 2150 0 2158 0 0
9 2165 0 2179 0 0
8 2181 0 2195 0 0
0 2197 0 2211 0 0
This is the image that was used for this data:
210913.nog.2-000na.zip
A similar issue was filed earlier #2340, but the issuer https://github.com/dev884 didn't provide any pointer to his fix, he has no code at all in his account.
Expected Behavior:
I would expect the horizontal coordinates to resemble the ones in the word oriented hocr-output of the same region of the picture.
<span class=\'ocr_line\' id=\'line_1_1\' title="bbox 111 1289 133 1532; textangle 90; x_size 28.416666; x_descenders 7.1041665; x_ascenders 7.1041665">\n <span class=\'ocrx_word\' id=\'word_1_1\' title=\'bbox 112 1470 133 1532; x_wconf 88\'>2084</span>\n <span class=\'ocrx_word\' id=\'word_1_2\' title=\'bbox 124 1451 127 1459; x_wconf 88\'>-</span>\n <span class=\'ocrx_word\' id=\'word_1_3\' title=\'bbox 111 1403 133 1441; x_wconf 96\'>2/2</span>\n <span class=\'ocrx_word\' id=\'word_1_4\' title=\'bbox 112 1289 133 1384; x_wconf 96\'>251980</span>\n </span>\n
Suggested Fix:
Not thought of any yet. I don't know if the workaround of the previous issuer could be made watertight.