Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Textline Box Files Tesseract 4.0 bad wording? #2357

Closed
banderlog opened this issue Mar 27, 2019 · 9 comments
Closed

Textline Box Files Tesseract 4.0 bad wording? #2357

banderlog opened this issue Mar 27, 2019 · 9 comments

Comments

@banderlog
Copy link

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 states that:

the required format is still the tiff/box file pair, except that the boxes only need to cover a textline instead of individual characters.

https://github.com/tesseract-ocr/tesseract/wiki/Making-Box-Files---4.0 states pretty the same:

The required format for LSTM 4.0alpha is still the tiff/box file pair, except that the boxes only need to cover a textline instead of individual characters. 'Newline' boxes with tab as the character must be inserted between textlines to indicate the end-of-line.

But example box file has individual character bboxes:

T 112 4663 140 4696 0
e 140 4662 160 4686 0
s 163 4662 179 4686 0
s 182 4661 198 4686 0
e 200 4661 220 4685 0
r 221 4662 238 4685 0
a 239 4661 260 4685 0
c 261 4661 281 4685 0
t 281 4661 296 4691 0
  296 4661 311 4696 0
O 311 4661 344 4696 0
C 347 4661 377 4696 0
R 378 4661 414 4695 0
     414 4694 415 4695 0

Does it means that we need to create character bboxes AND textline bboxes?
Thus, suggested wording needs to be something like:
"the required format is still the tiff/box file pair, except that the boxes need to cover a textline in addition to individual characters."?

Or example box file is wrong?

@amitdo
Copy link
Collaborator

amitdo commented Mar 27, 2019

The lstm training does not really need individual char coordinates.

For each char, you can give coordinates of its entire line.

@amitdo
Copy link
Collaborator

amitdo commented Mar 27, 2019

Or example box file is wrong?

Well, it's not wrong. Tesseract will accept it.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Mar 27, 2019

Multiple formats of box files are accepted by Tesseract4 for LSTM training, though they are different from the one used by Tesseract 3.

  • text2image generated box file using font files and training text
I 114 4655 120 4691 0
n 127 4655 150 4682 0
f 152 4655 169 4692 0
o 168 4654 193 4682 0
r 197 4654 213 4681 0
m 214 4654 250 4681 0
a 255 4654 280 4681 0
t 282 4654 295 4689 0
i 298 4654 304 4690 0
o 308 4654 333 4681 0
n 337 4654 360 4681 0
  360 4653 378 4691 0
G 378 4653 413 4691 0
r 418 4653 434 4680 0
o 434 4653 459 4680 0
u 463 4653 486 4679 0
p 491 4643 515 4680 0
s 517 4653 540 4680 0
  540 4653 555 4690 0
  • Generated by tesseract using lstmbox config from image files - each char uses coordinates of its entire line.
I 114 4640 1912 4692 0
n 114 4640 1912 4692 0
f 114 4640 1912 4692 0
o 114 4640 1912 4692 0
r 114 4640 1912 4692 0
m 114 4640 1912 4692 0
a 114 4640 1912 4692 0
t 114 4640 1912 4692 0
i 114 4640 1912 4692 0
o 114 4640 1912 4692 0
n 114 4640 1912 4692 0
  114 4640 1912 4692 0
G 114 4640 1912 4692 0
r 114 4640 1912 4692 0
o 114 4640 1912 4692 0
u 114 4640 1912 4692 0
p 114 4640 1912 4692 0
s 114 4640 1912 4692 0
  114 4640 1912 4692 0
  • Generated by tesseract using wordstrbox config from image files - Uses Wordstr and text for whole line
WordStr 114 4640 1907 4692 0 #Information Groups for public OPTIONAL, jaundice Proterozoic Have LOCATION 
	 1908 4640 1912 4692 0
WordStr 112 4544 2015 4592 0 #mixed, Male By TEXT Cove... ¥ INSTABILITY About WERE Crimson THAT HOPKINS 
	 2016 4544 2020 4592 0

Please note that box files generated using makebox config file are OK for Tesseract3 but not for Tesseract4 LSTM training.

I 114 4654 120 4691 0
n 127 4654 150 4682 0
f 152 4654 169 4692 0
o 168 4654 193 4682 0
r 197 4654 213 4682 0
m 214 4654 250 4682 0
a 230 4653 270 4692 0
t 255 4654 280 4682 0
i 282 4653 304 4691 0
o 308 4653 333 4681 0
n 337 4653 360 4681 0
G 378 4653 413 4691 0
r 395 4643 435 4691 0
o 418 4653 434 4681 0
u 434 4653 459 4681 0
p 463 4653 486 4680 0
s 491 4643 540 4681 0

Attached zip file has a sample tif file and the different types of box files for it so that it is easy to see the additional line with TAB character used to mark EOL in the box files for Tesseract4.

boxfiles.zip

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Mar 27, 2019

WordStr box files

create phototest-wordstr.box from phototest.tif

cd tesseract/test/testing
tesseract phototest.tif phototest-wordstr  -l eng --psm 6 wordstrbox

Review and edit phototest-wordstr.box to have the correct text for each line
Save as phototest.box
Use phototest.tif and phototest.box to create phototest.lstmf files

In case a groundtruth text file is available for the image, you can try to automate the edit process.
This will work if groundtruth textlines match image lines.

Remove the OCRed text from the box file
Delete any blank lines from ground truth file
Add blank line after every textline in ground truth file
paste both files together

sed -i -e 's/\([0-9] \#\).*$/\1/g'  phototest-wordstr.box
sed '/^$/d' phototest.gold.txt > phototest-gt.txt
sed -i -e 's/$/\n/g' phototest-gt.txt
paste --delimiters="\0"  phototest-wordstr.box  phototest-gt.txt > phototest.box

Review phototest.box to make sure that the lines match.

@banderlog
Copy link
Author

@Shreeshrii many thanks for great and detailed explanation!

I''ll put some changes into wiki to clarify this question.

@Shreeshrii
Copy link
Collaborator

The lstmbox and wordstrbox options have been added recently. Please try them out with your image files.

Thank you for changing the wiki to clarify this.

@jbarth-ubhd
Copy link

»WordStr« format: the lines beginning with tab have 1 space character before the first digit.
If not, you're getting »Encoding of string failed!«

@amitdo
Copy link
Collaborator

amitdo commented Oct 20, 2020

@kbrajwani,

Please use the forum for asking questions.

@TTnTTT
Copy link

TTnTTT commented May 28, 2021

Thank!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants