Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tesseract add similar characters in Japanese text (ambiguity management?) #1063

Open
alevillard opened this issue Aug 4, 2017 · 3 comments
Labels

Comments

@alevillard
Copy link

Hi, we have noticed that in a japanese text, tesseract doubled or also triples some characters which there are not in the text. Maybe we can imagine that tesseract make some management about character that are very similar and put all of them in the output instead of choosing one. There's some way to avoid this problem in the output?

Example:
text in image= エンジンコンポーネント
text read by tesseract= エンジンコンポボーネント

as you can see the charachter ポ is transleted as two charachter ポボ
maybe because both have a similar high score of confidence and tesseract do not decide which one to use but put both in the text. There's a way to avoid this error?

@TheSeiko
Copy link

TheSeiko commented Aug 4, 2017

ShreeDevi suggested to try the new traindata files:
https://github.com/tesseract-ocr/tessdata/tree/master/best
for similar problem in German
see #1060

@alevillard
Copy link
Author

I've added --oem 0 as suggested in #1060 and it don't double the charachters. But with this setting the system outputs wrongly the character ボ instead of ポ.
Here below I add the 'tsv' file text for both modality (oem=0 and oem=1).
I can see that for oem =1 tesseract give more confidence (conf=86) to the correct one ポ instead of the similar char ボ (conf = 75) but numbers for left, top, width and height are strange.
Could you explain what happen? Can I have a complete correct ocr for that sequence in some way?

image text: エンジンコンポーネント

make ocr with tesseract using --oem=0

level page_num block_num par_num line_num word_num left top width height conf text
1 1 0 0 0 0 0 0 399 54 -1
2 1 1 0 0 0 10 12 349 30 -1
3 1 1 1 0 0 10 12 349 30 -1
4 1 1 1 1 0 10 12 349 30 -1
5 1 1 1 1 1 10 12 317 30 69 工ンジンコンボ一ネン
5 1 1 1 1 2 339 14 20 28 84 卜

##ocr with tesseract using --oem=1

level page_num block_num par_num line_num word_num left top width height conf text
1 1 0 0 0 0 0 0 399 54 -1
2 1 1 0 0 0 10 12 349 30 -1
3 1 1 1 0 0 10 12 349 30 -1
4 1 1 1 1 0 10 12 349 30 -1
5 1 1 1 1 1 10 18 30 21 96 エ
5 1 1 1 1 2 45 16 26 25 93 ン
5 1 1 1 1 3 76 16 28 25 94 ジ
5 1 1 1 1 4 109 16 26 25 96 ン
5 1 1 1 1 5 140 17 25 23 81 コ
5 1 1 1 1 6 173 16 26 25 89 ン
5 1 1 1 1 7 0 0 399 54 86 ポ
5 1 1 1 1 8 202 12 31 29 75 ボ
5 1 1 1 1 9 236 27 26 2 95 ー
5 1 1 1 1 10 266 14 31 28 96 ネ
5 1 1 1 1 11 301 16 26 25 93 ン
5 1 1 1 1 12 339 14 20 28 95 ト

@TheSeiko
Copy link

TheSeiko commented Aug 4, 2017

--oem 0 is the conventional engine
--oem 1 is the new LSTM engine

I unterstood #1060 that he was only asking/checking if --oem 1 is used not suggesting -oem 0

He suggested to download and use the new trainingdata from best directory uploaded some days ago - https://github.com/tesseract-ocr/tessdata/tree/master/best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants