New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wordlist cleaning (lots of incomplete words found in Thai wordlist) #106

Closed
bact opened this Issue Feb 15, 2018 · 8 comments

Comments

Projects
None yet
3 participants
@bact
Copy link

bact commented Feb 15, 2018

I spotted a bunch of instances in langdata/tha/tha.wordlist that for sure they are invalid Thai words, since they go against word formation rules (like having a vowel that require an immediate consonant, but that consonant is missing).

For example [line number][instance]:
165 ส์
207 ต์
335 ย์
404 ท์
428 ด์
527 น์
580 ห์
629 อั
658 ล์
774 ค์
787 นั
798 ชั่
863 เอ็
886 สั
986 มั
1114 ว์
1187 ฮั
1244 ชั
1305 กั
1310 ษ์
1380 ลั
1487 บั
1554 ดั
...
7649 ลั่
7656 ยั
7666 ฉั
7733 เกี๋
7914 ล่
7931 น๊
8008 ส่
8045 ญั
8100 ข์
8148 ด่
...

Should we remove all these instances? They seems to have some patterns as well, like:

  • char + u0e31 : c ั
  • char + u0e31 + tonemarks : c ั่
  • char + u0e4c : c ์
  • u0e40 + char + u0e47 : เc็

Do instance in this wordlist meant to be a word in itself, or it suppose to be a component of a larger word? If it's the latter case, it's totally ok to leave them as they are. But if it's the first case, we should remove them, as they are not words.

I'm not entirely sure how tesseract utilizes XXX.wordlist in langdata, so please correct me if this is irrelevant. Thank you.

@Shreeshrii

This comment has been minimized.

Copy link
Contributor

Shreeshrii commented Feb 15, 2018

This is a file from 3.04. I would suggest that you unpack the current traineddata, extract the wordlist from it and see if the list is the same.

If it is, try removing the error words from that list, combine the traineddata again and test for accuracy.

The commands should be similar to the following, please change as per the paths in your setup.

combine_tessdata -u ./tessdata_best/tha.traineddata ./tessdata_TEST/tha.
dawg2wordlist ./tessdata_TEST/tha.lstm-unicharset ./tessdata_TEST/tha.lstm-word-dawg ./tessdata_TEST/tha.lstm-word-list

REVIEW & EDIT wordlist

wordlist2dawg ./tessdata_TEST/tha.lstm-word-list ./tessdata_TEST/tha.lstm-word-dawg ./tessdata_TEST/tha.lstm-unicharset 
combine_tessdata ./tessdata_TEST/tha.

COMPARE accuracy of ./tessdata_best/tha.traineddata and ./tessdata_TEST/tha.traineddata

@bact

This comment has been minimized.

Copy link

bact commented Feb 15, 2018

Thank you for detailed instructions. I will try that accordingly.

@bact

This comment has been minimized.

Copy link

bact commented Feb 15, 2018

Saw those error words in current tha.traineddata (from https://github.com/tesseract-ocr/tessdata_best) as well.

Current ./tessdata_best/tha.lstm-word-list : 9083 lines
Modified ./tessdata_TEST/tha.lstm-word-list : 8811 lines (272 error words removed)

Compare the two tessdata with a screenshot of short text from https://prachatai.com/journal/2018/02/75448 (chose two paragraphs with Thai text only), with options "--oem 1 -l tha" (LSTM, Thai).

No much difference in accuracy, as both went as bad :(
Although the modified tessdata is slightly (very slightly) better.


Example original text:
กิติภูมิ กล่าวว่า มาร์กบอกว่าการต่อสู้ทางชนชั้น

Output text from current tessdata:
ก ิ ต ิ ภู ม ิ ก ล ่ า ว ว ่ า ม า ร ์ ก บ อ ก ว ่ า ก า ร ต ่ อ ส ู ้ ท า ง ชน ชั ้ น

Output text from modified tessdata:
ก ิ ต ิ ภู ม ิ ก ล ่ า ว ว ่ า ม า ร ์ ก บ อ ก ว ่ า ก า ร ต ่ อ ส ู ้ ท า ง ชน ชั้น

Characters got recognized perfectly in both tessdata.
But as you can see, most of the time characters are separated by space. It shouldn't.

The only difference between outputs from current tessdata and modified tessdata here is that the last word "ชั้น" from modified tessdata is actually comes combined as a proper word, no spaces in between.

In general, by removing impossible combination of characters in Thai language from the word list, the output is a little more accurate. But maybe I need to adjust some config.


Current tha.config:

segsearch_max_futile_classifications 10
language_model_ngram_on 1
language_model_ngram_space_delimited_language F
chop_enable 0


These are patterns of words that got removed:
^.[่้๊๋็ํั์]$
^.[ัื][่้๊๋]$
^เ.[็ิีื][่้๊๋]?$

@Shreeshrii

This comment has been minimized.

Copy link
Contributor

Shreeshrii commented Feb 15, 2018

Extra spaces could be related to issue reported earlier (for a different language) - see tesseract-ocr/tesseract#1009

You may want to try ocr with

-c preserve_interword_spaces=1

to remove extra spaces

@bact

This comment has been minimized.

Copy link

bact commented Feb 15, 2018

Thank you! Extra spaces solved with -c preserve_interword_spaces=1

From the same web page, tested with several different parts of text,
current tessdata and modified tessdata produced exactly the same output.

No improvement in terms of accuracy can be measured from the test.

@Shreeshrii

This comment has been minimized.

Copy link
Contributor

Shreeshrii commented Feb 15, 2018

so looks like that wordlist is not used much in recognition.

@Shreeshrii

This comment has been minimized.

Copy link
Contributor

Shreeshrii commented Feb 15, 2018

@jbreiden

preserve_interword_spaces=1 should be added to the config files in tessdata_fast for CJK languages and Thai.

@Shreeshrii

This comment has been minimized.

Copy link
Contributor

Shreeshrii commented Feb 20, 2018

Extra space problem identified in the comment above - #106 (comment)

Characters got recognized perfectly in both tessdata.
But as you can see, most of the time characters are separated by space. It shouldn't.

Fixed via
tesseract-ocr/tessdata_fast#7

@zdenop Please close this issue, after PR is merged in tessdata_fast.

@zdenop zdenop closed this Feb 21, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment