Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tesstrain.sh doesn't support vertical languages #2989

Closed
davidb1 opened this issue May 22, 2020 · 8 comments
Closed

tesstrain.sh doesn't support vertical languages #2989

davidb1 opened this issue May 22, 2020 · 8 comments

Comments

@davidb1
Copy link

davidb1 commented May 22, 2020

Environment

  • Tesseract Version:
tesseract 5.0.0-alpha-685-g3a3c4
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
  • Platform: Ubuntu 18.04

Current Behavior:

When passing a _vert language to tesstrain.sh in --lang

tesstrain.sh --lang lang_vert

it throws an error:

ERROR: Error: lang_vert is not a valid language code

as per this line

*) err_exit "Error: ${lang} is not a valid language code"

Expected Behavior:

Throw a more specific error: ERROR: Error: vertical languages aren't supported
or add a config to generate data for vertical languages

Suggested Fix:

Add a case for _vert languages in https://github.com/tesseract-ocr/tesseract/blob/d8d2f6f48a8ddaf0b668eb1abf18fd6d08470041/src/training/language-specific.sh

@Shreeshrii
Copy link
Collaborator

Vertical languages seem to be supported indirectly based on font names.

Please see:

# The following fonts will be rendered vertically in phase I.
VERTICAL_FONTS=( \
"TakaoExGothic" \ # for jpn
"TakaoExMincho" \ # for jpn
"AR PL UKai Patched" \ # for chi_tra
"AR PL UMing Patched Light" \ # for chi_tra
"Baekmuk Batang Patched" \ # for kor
)

and

# add --writing_mode=vertical-upright to common_args if the font is
# specified to be rendered vertically.
for vfont in "${VERTICAL_FONTS[@]}"; do
if [[ "${font}" == "${vfont}" ]]; then
common_args+=" --writing_mode=vertical-upright "
break
fi
done

Try adding your font to the vertical fonts list as well as the language fonts list and try.

@davidb1
Copy link
Author

davidb1 commented Nov 2, 2020

@Shreeshrii so the _vert languages are made just by training on vertical fonts only or are there additional steps?

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Nov 2, 2020

I have not trained any CJK languages or any other scripts requiring vertical fonts. I just pointed out what I found by searching on vert in the training script.

I suggest you give it a try. If you have more vertical fonts, they need to be added to both the lists.

@Shreeshrii
Copy link
Collaborator

You can try contacting @zodiac3539 for pointers, see https://github.com/zodiac3539/jpn_vert

@davidb1
Copy link
Author

davidb1 commented Nov 5, 2020

You can try contacting @zodiac3539 for pointers, see https://github.com/zodiac3539/jpn_vert

I actually did try a couple months back but he doesn't wanna part with his secrets :)

@kamui-fin
Copy link

kamui-fin commented Nov 12, 2020

Having the same issue here. I tried adding the font to the vertical font list but all i get is:

Warning in pixScaleSmooth: ridiculously small scaling factor 0.010464
Image too small to scale!! (1x1 vs min width of 3)
Line cannot be recognized!!
Image not trainable

Is it possible to train vertical languages? How was the jpn_vert.traineddata file in the tessdata_best repo made?

@Shreeshrii
Copy link
Collaborator

Please see comment by Ray at #707 (comment)

so, it's possible that the current code is using layout analysis for vertical text rather than a separate language.

@Shreeshrii
Copy link
Collaborator

Fixed via PR #3223

See #3001 for discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants