Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract 4.0 Multilanguage issue: kor+eng and eng+kor #2639

Closed
whatohyou opened this issue Sep 5, 2019 · 10 comments
Closed

Tesseract 4.0 Multilanguage issue: kor+eng and eng+kor #2639

whatohyou opened this issue Sep 5, 2019 · 10 comments

Comments

@whatohyou
Copy link

T1
(Tesseract는 multi-language를 recognise하지 못합니다.)

language setting = "kor+eng"
result = ㅠ65563아는 ㅁ새-3094396를 『60090156하지 못합니다.

language setting = "eng+kor"
result = Tesseract= multi-languageS recognised =

Suggested fix: It seems tesseract works poorly with multiple sets of trained data. It needs 1 single trained data that covers multiple languages that are often used together.(Eg. Korean+English+Traditional chinese)

@Shreeshrii
Copy link
Collaborator

Please provide output of

tesseract -v

Which traineddata files are you using? tessdata_fast, tessdata_best, tessdata?

Please try with HanS, HanT and Hangul traineddata from https://github.com/tesseract-ocr/tessdata_best/tree/master/script

@whatohyou
Copy link
Author

whatohyou commented Sep 5, 2019

Please provide output of

tesseract -v

Which traineddata files are you using? tessdata_fast, tessdata_best, tessdata?

Please try with HanS, HanT and Hangul traineddata from https://github.com/tesseract-ocr/tessdata_best/tree/master/script

The tesseract version that I'm using is c# nuget package of cosine tesseract. It says 4.0.0.17 in their description.
The trained data is tessdata_best, tessdata.

For Hangul traineddata:(language = "Hangul";)
Result = Tesseract詞 multi-language言 recogrises和| 実計日員|
For Hangul traineddata + eng best traineddata:(language = "Hangul+eng";)
Result = Tesseract詞 multi-language吊oooaコのop中
For Hangul traineddata + eng best traineddata:(language = "eng+Hangul";)
Result = Tesseract詞 multi-language吊oooaコのop中

For HanS traineddata:(language = "HanS";)
Result = Tesseract詞 multi-language吊oooaコのop中

For HanT traineddata:(language = "HanT";)
Result = Tesseract詞 multi-language言 recogrises和| 実計日員|

Conclusion: Non of the Hangul, HanS, HanT was able to recognise Korean characters. It seems those data files are trained for chinese letters(such as 詞)+Japanese letters(such as の) + english only.

@zdenop
Copy link
Contributor

zdenop commented Sep 6, 2019

Please use this issues tracker only for issues related to tesseeract executable or documented usage of C of C++ API (we do not support problem with tesseract wrappers).

@whatohyou
Copy link
Author

I installed latest tesseract 4.01 version today.

For Hangul traineddata:
Result = Tesseract multi-language 를 recognise 하 지 못합니다.
For HanS traineddata:
Result = Tesseract 三 multi-language 旱 recognise 引 | 吴 土 LICH
For HanT traineddata:
Result = Tesseract 三 mult-language 告 recognise5t 入 | 史 竺 刷

Conclusion: Hangul traineddata was the closest one to get the correct result.
100% correst result = Tesseract는 multi-language를 recognise하지 못합니다.
Hangul traineddata = Tesseract multi-language 를 recognise 하 지 못합니다.

Hangul traineddata dropped "는" and there was spacing issues after the English letters.

@whatohyou
Copy link
Author

For kor+eng best traindata setting in tesseract 4.01
Result = Tesseract= multi-language 를 [60090156 하 지 못합니다.

For eng+kor and kor+eng, result was the same.
Hyper sensitive spacing glitch after different language letters exists. 하지 is recognised as 하 지 after the eng characters as well.

@wrznr
Copy link

wrznr commented Oct 10, 2019

I can confirm the results by @whatohyou. It is interesting to note that the results highly depend on the selected page segmentation mode:

$ tesseract /tmp/test.jpg - -l kor+eng
ㅠ65563아는 ㅁ새-3094396를 『60090156하지 못합니다.
$ tesseract /tmp/test.jpg - -l eng+kor
Tesseract= multi-languageS recognised = ELICH
$ tesseract /tmp/test.jpg - -l kor+eng --psm 7
Tesseract= multi-language를 [60090156하지 못합니다.
$ tesseract /tmp/test.jpg - -l eng+kor --psm 7
Tesseract= multi-language 를 [60090156 하 지 못합니다.

Could it be the case that script detection and page segmentation are the source of the inconsistencies?

@Shreeshrii
Copy link
Collaborator

@wrznr Page segmentation as a source of inconsistencies has been reported in other issues also.

Came across another example today, though this is just with a single language.

num-comma

$ tesseract -v
tesseract 5.0.0-alpha-479-g247c
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.4.4 : libopenjp2 2.3.0

$ tesseract num-comma.png - --oem 1 --psm 3 -l eng --tessdata-dir ~/tessdata_best
1092000
924000

504,000
546,000
546,000
882,000
588,000
546,000
462,000

1,092,000
924,000

000,504,000
000,546,000
000,546,000
000,882,000
000,588,000
000,546,000
000,462,000
001,092,000
000,924,000

504,000.00
546,000.00
546,000.00
882,000.00
588,000.00
546,000.00
462,000.00
1,092,000.00
924,000.00

$ tesseract num-comma.png - --oem 1 --psm 6 -l eng --tessdata-dir ~/tessdata_best
1092000 1,092,000 001,092,000 1,092,000.00
924000 924,000 000,924,000 924,000.00

psm 6 is recognizing only last two lines in image while psm 3 starts with last two lines in column 1 but recognizes all lines in other columns. It might have something to do with the fact that the second last line which is the first to be recognized is indented to left.

@Shreeshrii
Copy link
Collaborator

Same result with all 3 traineddata repos and oems.

ubuntu@tesseract-ocr:~/TEST$ tesseract num-comma.png - --oem 1 --psm 6 -l eng --tessdata-dir ~/tessdata_best
1092000 1,092,000 001,092,000 1,092,000.00
924000 924,000 000,924,000 924,000.00
ubuntu@tesseract-ocr:~/TEST$ tesseract num-comma.png - --oem 1 --psm 6 -l eng --tessdata-dir ~/tessdata_fast
1092000 1,092,000 001,092,000  1,092,000.00
924000 924,000 000,924,000 924,000.00
ubuntu@tesseract-ocr:~/TEST$ tesseract num-comma.png - --oem 1 --psm 6 -l eng --tessdata-dir ~/tessdata
1092000 1,092,000 001,092,000 1,092,000.00
924000 924,000 000,924,000 924,000.00
ubuntu@tesseract-ocr:~/TEST$ tesseract num-comma.png - --oem 0 --psm 6 -l eng --tessdata-dir ~/tessdata
1092000 1,092,000 001,092,000 1,092,000.00
924000 924,000 000,924,000 924,000.00
ubuntu@tesseract-ocr:~/TEST$ tesseract num-comma.png - --oem 2 --psm 6 -l eng --tessdata-dir ~/tessdata
1092000 1,092,000 001,092,000 1,092,000.00
924000 924,000 000,924,000 924,000.00

@ToelgeKilian
Copy link

ToelgeKilian commented Feb 2, 2021

Have there been any updates on this issue? I also see that order matters a lot when providing multiple languages to Tesseract.

On all the other issues related to multi-language problems I have seen a drift in the discussion as it happened here.
There seems to clearly be a problem with the order at which languages are provided to Tesseract. How do I know which order is the best one without cross-checking the results with the image manually? (Kind of defies the necessity of doing OCR in the first place)

I am using version v5.0.0-alpha (2020-03-28 version for Windows from the UB Mannheim)
PSM=7 (Single Line)

@ToelgeKilian
Copy link

I did some more testing today. It seems to me that multi-language is not supported for the LSTM engine, is that correct?
The legacy engine returns a lot better results for the different languages, but sadly makes more spacing- and noise-errors.

Is it possible to combine the advantages of both worlds and enable the LSTM engine to read multi-language inputs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants