Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 1392: Vietnamese dictionaries #9

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Commits on May 13, 2015

  1. Issue 1392: Vietnamese dictionaries

    https://code.google.com/p/tesseract-ocr/issues/detail?id=1392
    
    What steps will reproduce the problem?
    1. Unpack vie.traineddata downloaded from Tesseract repository
    2. Run dawg2wordlist on vie.freq-dawg & vie.word-dawg to recover original lists
    3. Examine the content
    
    What is the expected output? What do you see instead?
    
    The recovered word lists are found to be incomplete and contain many erroneous entries.
    
    Please use the included dictionaries for training data for Vietnamese language.
    
    Apr 16, 2015
    #1 zdenop
    can you have a look and review Vietnamese dictionaries in langdata repository?
    
    https://code.google.com/p/tesseract-ocr/source/browse/vie/?repo=langdata&name=master
    
    Apr 19, 2015
    2 nguyenq87
    vie.wordlist.clean would need to be scrapped totally as it contains so many misspelled Vietnamese and English words, words missing diacritical marks or running on together (Vietnamese words are mostly monosyllables).
    
    The provided vie.words_list is composed of several lists commonly used among Vietnamese-language application developers, including those from http://www.informatik.uni-leipzig.de/~duc/software/misc/wordlist.html.
    
    The fourth column in vie.unicharambigs contains many characters that are not Vietnamese, e.g., üûñËÄ. Those characters should not be used for match target.
    
    http://vietunicode.sourceforge.net/charset/vietalphabet.html
    nguyenq authored and jimregan committed May 13, 2015
    Configuration menu
    Copy the full SHA
    7511e21 View commit details
    Browse the repository at this point in the history

Commits on Feb 21, 2018

  1. merge

    jimregan committed Feb 21, 2018
    Configuration menu
    Copy the full SHA
    546d0d2 View commit details
    Browse the repository at this point in the history