Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please add language information to the TSV output #1861

Open
jsbien opened this issue Aug 24, 2018 · 3 comments
Open

Please add language information to the TSV output #1861

jsbien opened this issue Aug 24, 2018 · 3 comments
Labels
feature request help wanted output issues related output formats

Comments

@jsbien
Copy link

jsbien commented Aug 24, 2018

This is actually a 4-year old feature request, formulated in my comment to issue 918
(https://groups.google.com/forum/#!searchin/tesseract-issues/issue$20918/tesseract-issues/U7g6ntPXYds/aAyTuIFBUhwJ) but I really need it now, cf. https://groups.google.com/forum/#!topic/tesseract-dev/d4QQdaq3LF0.

@Shreeshrii
Copy link
Collaborator

@zdenop Please label as Feature Request.

@jsbien
Copy link
Author

jsbien commented Nov 17, 2018

A sample of the proposed output, created manually using original tsv and hocr; for this issue the last 3 columns are not relevant. This file and source data are available at https://bitbucket.org/jsbien/linde-info in the TesseractTSVextension directory.

`
level page_num block_num par_num line_num word_num left top width height conf text lang x_font x_fsize emph
1 1 0 0 0 0 0 0 4760 6388 -1
2 1 1 0 0 0 4 4073 11 2256 -1 deu-frak
3 1 1 1 0 0 4 4073 11 2256 -1 deu-frak
4 1 1 1 1 0 4 4073 11 2256 -1 deu-frak
5 1 1 1 1 1 4 4073 11 2256 95 deu-frak Century_Schoolbook_L_Bold 135 strong
2 1 2 0 0 0 207 498 2048 698 -1
3 1 2 1 0 0 207 498 2048 698 -1 pol
4 1 2 1 1 0 892 498 682 65 -1 pol Century_Schoolbook_L_Bold 10 strong
5 1 2 1 1 1 892 499 48 63 85 A pol Century_Schoolbook_L_Bold 10 strong
5 1 2 1 1 2 968 501 45 62 95 L pol Century_Schoolbook_L_Bold 10 strong
5 1 2 1 1 3 1040 499 48 64 87 A pol Century_Schoolbook_L_Bold 10 strong
5 1 2 1 1 4 1114 500 45 62 89 B pol Century_Schoolbook_L_Bold 10 strong
5 1 2 1 1 5 1188 499 50 63 81 A pol Century_Schoolbook_L_Bold 10 strong
5 1 2 1 1 6 1283 536 37 12 88 —- deu-frak Century_Schoolbook_L_Bold 10 strong
5 1 2 1 1 7 1365 498 47 63 87 A pol Times_New_Roman 10
5 1 2 1 1 8 1441 500 43 60 93 L pol Century_Schoolbook_L_Bold 10
5 1 2 1 1 9 1512 499 62 62 90 B. pol Century_Schoolbook_L_Bold 10

`

@jsbien
Copy link
Author

jsbien commented Nov 30, 2020

Am I correct that this request feature require just a few lines of code? Unfortunately I'm not a programmer. Any kind soul willing to help? I start a new project and this feature would be very useful just now.

DrDub added a commit to DrDub/tesseract that referenced this issue Dec 13, 2023
Also make the font_info flag work on TSV output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request help wanted output issues related output formats
Projects
None yet
Development

No branches or pull requests

4 participants