Tesseract 4.0 Multilanguage issue: kor+eng and eng+kor #2639

whatohyou · 2019-09-05T01:01:33Z

(Tesseract는 multi-language를 recognise하지 못합니다.)

language setting = "kor+eng"
result = ㅠ65563아는 ㅁ새-3094396를 『60090156하지 못합니다.

language setting = "eng+kor"
result = Tesseract= multi-languageS recognised =

Suggested fix: It seems tesseract works poorly with multiple sets of trained data. It needs 1 single trained data that covers multiple languages that are often used together.(Eg. Korean+English+Traditional chinese)

Shreeshrii · 2019-09-05T03:31:53Z

Please provide output of

tesseract -v

Which traineddata files are you using? tessdata_fast, tessdata_best, tessdata?

Please try with HanS, HanT and Hangul traineddata from https://github.com/tesseract-ocr/tessdata_best/tree/master/script

whatohyou · 2019-09-05T21:50:55Z

Please provide output of

tesseract -v

Which traineddata files are you using? tessdata_fast, tessdata_best, tessdata?

Please try with HanS, HanT and Hangul traineddata from https://github.com/tesseract-ocr/tessdata_best/tree/master/script

The tesseract version that I'm using is c# nuget package of cosine tesseract. It says 4.0.0.17 in their description.
The trained data is tessdata_best, tessdata.

For Hangul traineddata:(language = "Hangul";)
Result = Tesseract詞 multi-language言 recogrises和| 実計日員|
For Hangul traineddata + eng best traineddata:(language = "Hangul+eng";)
Result = Tesseract詞 multi-language吊oooaコのop中
For Hangul traineddata + eng best traineddata:(language = "eng+Hangul";)
Result = Tesseract詞 multi-language吊oooaコのop中

For HanS traineddata:(language = "HanS";)
Result = Tesseract詞 multi-language吊oooaコのop中

For HanT traineddata:(language = "HanT";)
Result = Tesseract詞 multi-language言 recogrises和| 実計日員|

Conclusion: Non of the Hangul, HanS, HanT was able to recognise Korean characters. It seems those data files are trained for chinese letters(such as 詞)+Japanese letters(such as の) + english only.

zdenop · 2019-09-06T05:54:29Z

Please use this issues tracker only for issues related to tesseeract executable or documented usage of C of C++ API (we do not support problem with tesseract wrappers).

whatohyou · 2019-09-07T01:07:51Z

I installed latest tesseract 4.01 version today.

For Hangul traineddata:
Result = Tesseract multi-language 를 recognise 하 지 못합니다.
For HanS traineddata:
Result = Tesseract 三 multi-language 旱 recognise 引 | 吴土 LICH
For HanT traineddata:
Result = Tesseract 三 mult-language 告 recognise5t 入 | 史竺刷

Conclusion: Hangul traineddata was the closest one to get the correct result.
100% correst result = Tesseract는 multi-language를 recognise하지 못합니다.
Hangul traineddata = Tesseract multi-language 를 recognise 하 지 못합니다.

Hangul traineddata dropped "는" and there was spacing issues after the English letters.

whatohyou · 2019-09-07T01:17:39Z

For kor+eng best traindata setting in tesseract 4.01
Result = Tesseract= multi-language 를 [60090156 하 지 못합니다.

For eng+kor and kor+eng, result was the same.
Hyper sensitive spacing glitch after different language letters exists. 하지 is recognised as 하 지 after the eng characters as well.

wrznr · 2019-10-10T09:25:13Z

I can confirm the results by @whatohyou. It is interesting to note that the results highly depend on the selected page segmentation mode:

$ tesseract /tmp/test.jpg - -l kor+eng
ㅠ65563아는 ㅁ새-3094396를 『60090156하지 못합니다.
$ tesseract /tmp/test.jpg - -l eng+kor
Tesseract= multi-languageS recognised = ELICH
$ tesseract /tmp/test.jpg - -l kor+eng --psm 7
Tesseract= multi-language를 [60090156하지 못합니다.
$ tesseract /tmp/test.jpg - -l eng+kor --psm 7
Tesseract= multi-language 를 [60090156 하 지 못합니다.

Could it be the case that script detection and page segmentation are the source of the inconsistencies?

Shreeshrii · 2019-10-19T04:37:19Z

@wrznr Page segmentation as a source of inconsistencies has been reported in other issues also.

Came across another example today, though this is just with a single language.

$ tesseract -v
tesseract 5.0.0-alpha-479-g247c
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.4.4 : libopenjp2 2.3.0

$ tesseract num-comma.png - --oem 1 --psm 3 -l eng --tessdata-dir ~/tessdata_best
1092000
924000

504,000
546,000
546,000
882,000
588,000
546,000
462,000

1,092,000
924,000

000,504,000
000,546,000
000,546,000
000,882,000
000,588,000
000,546,000
000,462,000
001,092,000
000,924,000

504,000.00
546,000.00
546,000.00
882,000.00
588,000.00
546,000.00
462,000.00
1,092,000.00
924,000.00

$ tesseract num-comma.png - --oem 1 --psm 6 -l eng --tessdata-dir ~/tessdata_best
1092000 1,092,000 001,092,000 1,092,000.00
924000 924,000 000,924,000 924,000.00

psm 6 is recognizing only last two lines in image while psm 3 starts with last two lines in column 1 but recognizes all lines in other columns. It might have something to do with the fact that the second last line which is the first to be recognized is indented to left.

Shreeshrii · 2019-10-19T04:41:24Z

Same result with all 3 traineddata repos and oems.

ubuntu@tesseract-ocr:~/TEST$ tesseract num-comma.png - --oem 1 --psm 6 -l eng --tessdata-dir ~/tessdata_best
1092000 1,092,000 001,092,000 1,092,000.00
924000 924,000 000,924,000 924,000.00
ubuntu@tesseract-ocr:~/TEST$ tesseract num-comma.png - --oem 1 --psm 6 -l eng --tessdata-dir ~/tessdata_fast
1092000 1,092,000 001,092,000  1,092,000.00
924000 924,000 000,924,000 924,000.00
ubuntu@tesseract-ocr:~/TEST$ tesseract num-comma.png - --oem 1 --psm 6 -l eng --tessdata-dir ~/tessdata
1092000 1,092,000 001,092,000 1,092,000.00
924000 924,000 000,924,000 924,000.00
ubuntu@tesseract-ocr:~/TEST$ tesseract num-comma.png - --oem 0 --psm 6 -l eng --tessdata-dir ~/tessdata
1092000 1,092,000 001,092,000 1,092,000.00
924000 924,000 000,924,000 924,000.00
ubuntu@tesseract-ocr:~/TEST$ tesseract num-comma.png - --oem 2 --psm 6 -l eng --tessdata-dir ~/tessdata
1092000 1,092,000 001,092,000 1,092,000.00
924000 924,000 000,924,000 924,000.00

ToelgeKilian · 2021-02-02T17:23:01Z

Have there been any updates on this issue? I also see that order matters a lot when providing multiple languages to Tesseract.

On all the other issues related to multi-language problems I have seen a drift in the discussion as it happened here.
There seems to clearly be a problem with the order at which languages are provided to Tesseract. How do I know which order is the best one without cross-checking the results with the image manually? (Kind of defies the necessity of doing OCR in the first place)

I am using version v5.0.0-alpha (2020-03-28 version for Windows from the UB Mannheim)
PSM=7 (Single Line)

ToelgeKilian · 2021-02-03T11:59:28Z

I did some more testing today. It seems to me that multi-language is not supported for the LSTM engine, is that correct?
The legacy engine returns a lot better results for the different languages, but sadly makes more spacing- and noise-errors.

Is it possible to combine the advantages of both worlds and enable the LSTM engine to read multi-language inputs?

Shreeshrii mentioned this issue Sep 5, 2019

multilingual ocr ara+eng #2626

Open

Shreeshrii mentioned this issue Oct 21, 2019

default PSM (--psm 3) accuracy issues #1327

Open

amitdo closed this as completed Feb 7, 2021

amitdo added the multilingual ocr label Feb 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tesseract 4.0 Multilanguage issue: kor+eng and eng+kor #2639

Tesseract 4.0 Multilanguage issue: kor+eng and eng+kor #2639

whatohyou commented Sep 5, 2019

Shreeshrii commented Sep 5, 2019

whatohyou commented Sep 5, 2019 •

edited

Loading

zdenop commented Sep 6, 2019

whatohyou commented Sep 7, 2019

whatohyou commented Sep 7, 2019

wrznr commented Oct 10, 2019

Shreeshrii commented Oct 19, 2019

Shreeshrii commented Oct 19, 2019

ToelgeKilian commented Feb 2, 2021 •

edited

Loading

ToelgeKilian commented Feb 3, 2021

Tesseract 4.0 Multilanguage issue: kor+eng and eng+kor #2639

Tesseract 4.0 Multilanguage issue: kor+eng and eng+kor #2639

Comments

whatohyou commented Sep 5, 2019

Shreeshrii commented Sep 5, 2019

whatohyou commented Sep 5, 2019 • edited Loading

zdenop commented Sep 6, 2019

whatohyou commented Sep 7, 2019

whatohyou commented Sep 7, 2019

wrznr commented Oct 10, 2019

Shreeshrii commented Oct 19, 2019

Shreeshrii commented Oct 19, 2019

ToelgeKilian commented Feb 2, 2021 • edited Loading

ToelgeKilian commented Feb 3, 2021

whatohyou commented Sep 5, 2019 •

edited

Loading

ToelgeKilian commented Feb 2, 2021 •

edited

Loading