tesseract add similar characters in Japanese text (ambiguity management?) #1063

alevillard · 2017-08-04T08:10:44Z

Hi, we have noticed that in a japanese text, tesseract doubled or also triples some characters which there are not in the text. Maybe we can imagine that tesseract make some management about character that are very similar and put all of them in the output instead of choosing one. There's some way to avoid this problem in the output?

Example:
text in image= エンジンコンポーネント
text read by tesseract= エンジンコンポボーネント

as you can see the charachter ポ is transleted as two charachter ポボ
maybe because both have a similar high score of confidence and tesseract do not decide which one to use but put both in the text. There's a way to avoid this error?

TheSeiko · 2017-08-04T11:37:37Z

ShreeDevi suggested to try the new traindata files:
https://github.com/tesseract-ocr/tessdata/tree/master/best
for similar problem in German
see #1060

alevillard · 2017-08-04T12:39:42Z

I've added --oem 0 as suggested in #1060 and it don't double the charachters. But with this setting the system outputs wrongly the character ボ instead of ポ.
Here below I add the 'tsv' file text for both modality (oem=0 and oem=1).
I can see that for oem =1 tesseract give more confidence (conf=86) to the correct one ポ instead of the similar char ボ (conf = 75) but numbers for left, top, width and height are strange.
Could you explain what happen? Can I have a complete correct ocr for that sequence in some way?

image text: エンジンコンポーネント

make ocr with tesseract using --oem=0

level page_num block_num par_num line_num word_num left top width height conf text
1 1 0 0 0 0 0 0 399 54 -1
2 1 1 0 0 0 10 12 349 30 -1
3 1 1 1 0 0 10 12 349 30 -1
4 1 1 1 1 0 10 12 349 30 -1
5 1 1 1 1 1 10 12 317 30 69 工ンジンコンボ一ネン
5 1 1 1 1 2 339 14 20 28 84 卜

##ocr with tesseract using --oem=1

level page_num block_num par_num line_num word_num left top width height conf text
1 1 0 0 0 0 0 0 399 54 -1
2 1 1 0 0 0 10 12 349 30 -1
3 1 1 1 0 0 10 12 349 30 -1
4 1 1 1 1 0 10 12 349 30 -1
5 1 1 1 1 1 10 18 30 21 96 エ
5 1 1 1 1 2 45 16 26 25 93 ン
5 1 1 1 1 3 76 16 28 25 94 ジ
5 1 1 1 1 4 109 16 26 25 96 ン
5 1 1 1 1 5 140 17 25 23 81 コ
5 1 1 1 1 6 173 16 26 25 89 ン
5 1 1 1 1 7 0 0 399 54 86 ポ
5 1 1 1 1 8 202 12 31 29 75 ボ
5 1 1 1 1 9 236 27 26 2 95 ー
5 1 1 1 1 10 266 14 31 28 96 ネ
5 1 1 1 1 11 301 16 26 25 93 ン
5 1 1 1 1 12 339 14 20 28 95 ト

TheSeiko · 2017-08-04T12:47:39Z

--oem 0 is the conventional engine
--oem 1 is the new LSTM engine

I unterstood #1060 that he was only asking/checking if --oem 1 is used not suggesting -oem 0

He suggested to download and use the new trainingdata from best directory uploaded some days ago - https://github.com/tesseract-ocr/tessdata/tree/master/best

Shreeshrii mentioned this issue Apr 22, 2018

Tesseract inserting additional alternative characters #1465

Open

stweil mentioned this issue Nov 7, 2020

Character confusion fix suggestion #3144

Open

amitdo added the diplopia label Mar 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tesseract add similar characters in Japanese text (ambiguity management?) #1063

tesseract add similar characters in Japanese text (ambiguity management?) #1063

alevillard commented Aug 4, 2017

TheSeiko commented Aug 4, 2017

alevillard commented Aug 4, 2017

TheSeiko commented Aug 4, 2017 •

edited

tesseract add similar characters in Japanese text (ambiguity management?) #1063

tesseract add similar characters in Japanese text (ambiguity management?) #1063

Comments

alevillard commented Aug 4, 2017

TheSeiko commented Aug 4, 2017

alevillard commented Aug 4, 2017

make ocr with tesseract using --oem=0

TheSeiko commented Aug 4, 2017 • edited

TheSeiko commented Aug 4, 2017 •

edited