Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Wordlist cleaning (lots of incomplete words found in Thai wordlist) #106
I spotted a bunch of instances in langdata/tha/tha.wordlist that for sure they are invalid Thai words, since they go against word formation rules (like having a vowel that require an immediate consonant, but that consonant is missing).
For example [line number][instance]:
Should we remove all these instances? They seems to have some patterns as well, like:
Do instance in this wordlist meant to be a word in itself, or it suppose to be a component of a larger word? If it's the latter case, it's totally ok to leave them as they are. But if it's the first case, we should remove them, as they are not words.
I'm not entirely sure how tesseract utilizes XXX.wordlist in langdata, so please correct me if this is irrelevant. Thank you.
This is a file from 3.04. I would suggest that you unpack the current traineddata, extract the wordlist from it and see if the list is the same.
If it is, try removing the error words from that list, combine the traineddata again and test for accuracy.
The commands should be similar to the following, please change as per the paths in your setup.
REVIEW & EDIT wordlist
COMPARE accuracy of ./tessdata_best/tha.traineddata and ./tessdata_TEST/tha.traineddata
Saw those error words in current tha.traineddata (from https://github.com/tesseract-ocr/tessdata_best) as well.
Current ./tessdata_best/tha.lstm-word-list : 9083 lines
Compare the two tessdata with a screenshot of short text from https://prachatai.com/journal/2018/02/75448 (chose two paragraphs with Thai text only), with options "--oem 1 -l tha" (LSTM, Thai).
No much difference in accuracy, as both went as bad :(
Example original text:
Output text from current tessdata:
Output text from modified tessdata:
Characters got recognized perfectly in both tessdata.
The only difference between outputs from current tessdata and modified tessdata here is that the last word "ชั้น" from modified tessdata is actually comes combined as a proper word, no spaces in between.
In general, by removing impossible combination of characters in Thai language from the word list, the output is a little more accurate. But maybe I need to adjust some config.
These are patterns of words that got removed:
Thank you! Extra spaces solved with -c preserve_interword_spaces=1
From the same web page, tested with several different parts of text,
No improvement in terms of accuracy can be measured from the test.
This was referenced
Feb 20, 2018
Extra space problem identified in the comment above - #106 (comment)
@zdenop Please close this issue, after PR is merged in tessdata_fast.