-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tesseract 4.0 Multilanguage issue: kor+eng and eng+kor #2639
Comments
Please provide output of tesseract -v Which traineddata files are you using? tessdata_fast, tessdata_best, tessdata? Please try with HanS, HanT and Hangul traineddata from https://github.com/tesseract-ocr/tessdata_best/tree/master/script |
The tesseract version that I'm using is c# nuget package of cosine tesseract. It says 4.0.0.17 in their description. For Hangul traineddata:(language = "Hangul";) For HanS traineddata:(language = "HanS";) For HanT traineddata:(language = "HanT";) Conclusion: Non of the Hangul, HanS, HanT was able to recognise Korean characters. It seems those data files are trained for chinese letters(such as 詞)+Japanese letters(such as の) + english only. |
Please use this issues tracker only for issues related to tesseeract executable or documented usage of C of C++ API (we do not support problem with tesseract wrappers). |
I installed latest tesseract 4.01 version today. For Hangul traineddata: Hangul traineddata dropped "는" and there was spacing issues after the English letters. |
For kor+eng best traindata setting in tesseract 4.01 For eng+kor and kor+eng, result was the same. |
I can confirm the results by @whatohyou. It is interesting to note that the results highly depend on the selected page segmentation mode: $ tesseract /tmp/test.jpg - -l kor+eng
ㅠ65563아는 ㅁ새-3094396를 『60090156하지 못합니다.
$ tesseract /tmp/test.jpg - -l eng+kor
Tesseract= multi-languageS recognised = ELICH
$ tesseract /tmp/test.jpg - -l kor+eng --psm 7
Tesseract= multi-language를 [60090156하지 못합니다.
$ tesseract /tmp/test.jpg - -l eng+kor --psm 7
Tesseract= multi-language 를 [60090156 하 지 못합니다. Could it be the case that script detection and page segmentation are the source of the inconsistencies? |
@wrznr Page segmentation as a source of inconsistencies has been reported in other issues also. Came across another example today, though this is just with a single language.
|
Same result with all 3 traineddata repos and oems.
|
Have there been any updates on this issue? I also see that order matters a lot when providing multiple languages to Tesseract. On all the other issues related to multi-language problems I have seen a drift in the discussion as it happened here. I am using version v5.0.0-alpha (2020-03-28 version for Windows from the UB Mannheim) |
I did some more testing today. It seems to me that multi-language is not supported for the LSTM engine, is that correct? Is it possible to combine the advantages of both worlds and enable the LSTM engine to read multi-language inputs? |
(Tesseract는 multi-language를 recognise하지 못합니다.)
language setting = "kor+eng"
result = ㅠ65563아는 ㅁ새-3094396를 『60090156하지 못합니다.
language setting = "eng+kor"
result = Tesseract= multi-languageS recognised =
Suggested fix: It seems tesseract works poorly with multiple sets of trained data. It needs 1 single trained data that covers multiple languages that are often used together.(Eg. Korean+English+Traditional chinese)
The text was updated successfully, but these errors were encountered: