Q&A: Indic - length of the compressed codes #654

Shreeshrii · 2017-01-12T06:47:28Z

Indic may be troubled by the length of the compressed codes used.

@theraysmith Can you explain a little more about this?

Shreeshrii · 2017-01-12T07:01:46Z

Devanagari script has a large set of ligature forms forms for consonant conjuncts. These are combinations of Consonant + Viraama + Consonant (CVC) or CVCVC or even rarer CVCVCVC.

Currently the generated unicharset uses the combination of the conjunct ligatures followed by vowel maatraas as well as vowel modifiers as a recognition unit, leading to unicharset of 5000+ lines.

You may want to consider recognizing the conjunct cluster as a unit and vowel maatras and vowel modifiers separately. A special case can be the i maatraa that comes before (to the left of) the consonant for Devanagari.

For a listing of orthographic syllables by frequency for Sanskrit, please see
http://www.sanskritweb.net/itrans/ortho2003.pdf

For a list of ligature sets for Hindi, please see
http://tdil-dc.in/tdildcMain/articles/82170Devanagari%20Script%20Behaviour%20for%20Hindi%20%20ver%201.4.10.pdf

Shreeshrii · 2017-01-20T07:19:45Z

Font Comparison Samples

Attested Hindi Ligatures

http://www.sanskritweb.net/itrans/itmanual2003.pdf Pages 110-130

theraysmith · 2017-01-20T22:04:15Z

The LSTM recognizer is currently trained to recognize the sequence of unicodes for Indic languages. This reduces the size of the output softmax of the network from the 5000+ elements in the unicharset to ~140. (There is an analogous process for Chinese, Japanese, and Korean, that doesn't use the unicode encoding, but it is a similar idea, and the codes are strictly limited in length.)
The unicharset is used as a filter in the beam search to allow only sensible grapheme/syllable combinations of unicodes, so it doesn't output complete garbage text.

The consequence of this recoding is that it runs a lot faster, but it has to learn to output a long sequence for each grapheme/syllable.
The recoding system that maps from unicharset elements to the sequence of unicodes currently only allows a maximum of 9 unicodes per grapheme/syllable, including any viramas.

I'm running a new training experiment this weekend to try a new coding scheme, in which <virama><consonant> pairs are mapped to a single code, allowing a long CVCVCVC string to be encoded using just CCCC, cutting down from 7 codes to 4. This will probably increase the size of the output softmax to ~170, but reduce the length of the average code sequence by about 1/3, which might be easier for it to learn, without slowing it down much.

It will take a couple of weeks to tell if it works, but if it does I will check in the code, and upload new traineddatas, and close this issue. If it doesn't work, I will have to think again...

Shreeshrii · 2017-01-21T15:46:05Z

Ray, Thank you for explaining regrading unicharset compression and your new strategy for Indic graphemes. Since the unicharset is being used as a filter, it will be important to include the most common conjunct clusters in it, which may differ from language to language. Some more questions Are the desired_characters and forbidden_characters used in the process of creating the text corpus for different languages? How many text lines are you using for training of Devanagari, e.g. Sanskrit, Hindi, Marathi etc. Is it all/only from Wikipedia? - excuse the brevity, sent from mobile

…

On 21-Jan-2017 3:34 AM, "theraysmith" ***@***.***> wrote: The LSTM recognizer is currently trained to recognize the sequence of *unicodes* for Indic languages. This reduces the size of the output softmax of the network from the 5000+ elements in the unicharset to ~140. (There is an analogous process for Chinese, Japanese, and Korean, that doesn't use the unicode encoding, but it is a similar idea, and the codes are strictly limited in length.) The unicharset is used as a *filter* in the beam search to allow only sensible grapheme/syllable combinations of unicodes, so it doesn't output complete garbage text. The consequence of this recoding is that it runs a lot faster, but it has to learn to output a long sequence for each grapheme/syllable. The recoding system that maps from unicharset elements to the sequence of unicodes currently only allows a maximum of 9 unicodes per grapheme/syllable, including any viramas. I'm running a new training experiment this weekend to try a new coding scheme, in which pairs are mapped to a single code, allowing a long CVCVCVC string to be encoded using just CCCC, cutting down from 7 codes to 4. This will probably increase the size of the output softmax to ~170, but reduce the length of the average code sequence by about 1/3, which might be easier for it to learn, without slowing it down much. It will take a couple of weeks to tell if it works, but if it does I will check in the code, and upload new traineddatas, and close this issue. If it doesn't work, I will have to think again... — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#654 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o-xusyCIFbh-wE4T4cp4mVb4oBWWks5rUS9vgaJpZM4LhbNY> .

theraysmith · 2017-01-23T18:30:29Z

The text corpus is from *all* the www, taken several years ago, plus more recent data from wiki-something. The text is divided by language automatically, so there is a separate stream for each of the Devanagari-based languages (as there is for the Latin-based languages) and clipped to 1GB for each language. For each language, the text is frequency counted and cleaned by multiple methods, and sometimes this cleaning is too stringent automatically, or not stringent enough, so forbidden_characters and desired_characters are used as a guide in the cleanup process. There are other lang-specific numbers like a 1-in-n discard ratio for the frequency. For some languages, the amount of data produced at the end is very thin. The unicharset is extracted from what remains, and the wordlist that is published in langdata. For the LSTM training, I resorted to using Google's parallel infrastructure to render enough text in all the languages. However much or little corpus text there is, the rendering process makes 50000 chunks of 50 words to render in a different combination of font and random degradation, which results in 400000-800000 rendered textlines. The words are chosen to approximately echo the real frequency of conjunct clusters (characters in most languages) in the source text, while also using the most frequent words. This process is all done without significant manual intervention, but counts of the number of generated textlines indicates when it has gone badly, usually due to a lack of fonts, or a lack of corpus text. I recently stopped training chr, iku, khm, mya after discovering that I have no rendered textlines that contain anything other than digits and punctuation. Community input is therefore extremely useful, and usually results in edits to forbidden_characters and desired_characters, which in turn guides the filtration process. Community-provided corpus text would be useful for languages that have very little or no training data, given appropriate copyright/licensing clearance. The languages with very little corpus text are: bih chr dzo iku snd syr tgk tir so these are likely to have poor recognition accuracy. On Sat, Jan 21, 2017 at 7:46 AM, Shreeshrii <notifications@github.com> wrote:

…

Ray, Thank you for explaining regrading unicharset compression and your new strategy for Indic graphemes. Since the unicharset is being used as a filter, it will be important to include the most common conjunct clusters in it, which may differ from language to language. Some more questions Are the desired_characters and forbidden_characters used in the process of creating the text corpus for different languages? How many text lines are you using for training of Devanagari, e.g. Sanskrit, Hindi, Marathi etc. Is it all/only from Wikipedia? - excuse the brevity, sent from mobile On 21-Jan-2017 3:34 AM, "theraysmith" ***@***.***> wrote: > The LSTM recognizer is currently trained to recognize the sequence of > *unicodes* for Indic languages. This reduces the size of the output > softmax of the network from the 5000+ elements in the unicharset to ~140. > (There is an analogous process for Chinese, Japanese, and Korean, that > doesn't use the unicode encoding, but it is a similar idea, and the codes > are strictly limited in length.) > The unicharset is used as a *filter* in the beam search to allow only > sensible grapheme/syllable combinations of unicodes, so it doesn't output > complete garbage text. > > The consequence of this recoding is that it runs a lot faster, but it has > to learn to output a long sequence for each grapheme/syllable. > The recoding system that maps from unicharset elements to the sequence of > unicodes currently only allows a maximum of 9 unicodes per > grapheme/syllable, including any viramas. > > I'm running a new training experiment this weekend to try a new coding > scheme, in which pairs are mapped to a single code, allowing a long CVCVCVC > string to be encoded using just CCCC, cutting down from 7 codes to 4. This > will probably increase the size of the output softmax to ~170, but reduce > the length of the average code sequence by about 1/3, which might be easier > for it to learn, without slowing it down much. > > It will take a couple of weeks to tell if it works, but if it does I will > check in the code, and upload new traineddatas, and close this issue. If it > doesn't work, I will have to think again... > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#654# issuecomment-274192153>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AE2_o-xusyCIFbh- wE4T4cp4mVb4oBWWks5rUS9vgaJpZM4LhbNY> > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#654 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056XOUmyQKlAM4aHUJc-jTRmhEwWOxks5rUihVgaJpZM4LhbNY> .

-- Ray.

Shreeshrii · 2017-01-24T05:43:40Z

Ray,

Thank you for the info on corpus building.

I have added links for resources for bih and snd in the langdata repo just now. Please see

Bihari training text not representative langdata#39 (Bihari)
Sindhi Language resources for corpus (Arabic script) langdata#42 (Sindhi in Arabic script)

I also added a link to this discussion at #622 for support regarding Khmer.

I will forward your post in the tesseract-ocr group for reach other community members too.

Shreeshrii · 2017-01-25T04:06:42Z

I recently stopped training chr, iku, khm, mya after discovering that I
have no rendered textlines that contain anything other than digits and
punctuation.

@theraysmith

I tried creating training data for khmer and was able to create box/tiff pairs with khmer text. It is possible that the fonts directory you used did not have khmer fonts or for some reason 'latin' fonts were used instead of khmer fonts. I will post the files separately under an issue in langdata.

I used --find_fonts function of text2image to find the fonts that covered 70℅ of the khmer training text.

It maybe useful in the training process to check the given font list for coverage and give an error or warning if it falls below a certain threshold, before going ahead with building the box tiff pairs.

edit: --oem 0 works with the khm.traineddata, --oem 1 recognizes it incorrectly.

Shreeshrii · 2017-01-25T11:33:09Z


text2image --find_fonts \
--fonts_dir /usr/share/fonts \
--text ./langdata/ara/ara.training_text \
--min_coverage .8  \
--outputbase ./langdata/ara/ara \
|& grep raw | sed -e 's/ :.*/" \\/g'  | sed -e 's/^/  "/' >./langdata/ara/fontslist.txt

Commands similar to above can be used for getting a fontlist that can be plugged into language-specific.sh to ensure that it calls fonts that are available on the system and have adequate coverage. Here is the output file from the above on my system.

  "Arial" \
  "Arial Bold" \
  "Courier New" \
  "Courier New Bold" \
  "DejaVu Sans" \
  "DejaVu Sans Bold" \
  "DejaVu Sans Mono" \
  "DejaVu Sans Mono Bold" \
  "FreeMono" \
  "FreeMono Bold" \
  "FreeSerif" \
  "FreeSerif Bold" \
  "Times New Roman," \
  "Times New Roman, Bold" \

theraysmith · 2017-03-30T01:18:53Z

Update: after going back to the www to get fresh data, I believe that my corpus text is now good for:
chr
dzo
iku
snd
syr
tgk
tir
I have put a lot of time into cleaners/filters for languages that use 'virama' characters.
I am not convinced that they are perfect, but I will add the code to the github repo in due course, so experts/native speakers can offer suggestions/fixes to make them better. Myanmar in particular needs improvement, as the www data is littered with dotted circles, and the unicode book does not adequately describe the syntax for a well-formed grapheme in Myanmar (or any other language for that matter).

Shreeshrii · 2017-03-30T12:12:21Z

Ray: Regarding Myanamar, please see discussion on tesseract-ocr/langdata#13

We have 2 types of unicode font. Non standard unicode font and standard unicode font. When I check langdata files for Burmese, most words are incorrect. I guess you have generated mixed contents with non standard unicode contents and standard unicode contents.

https://my.wikipedia.org/
All contents on wikipedia are in standard unicode font.

http://crubadan.org/languages/my lists three primary sources for Myanmar/Burmese. One is the myanmar wikipedia, the other two are:

http://www.unicode.org/udhr/d/udhr_mya.html

http://www.jw.org/mya/

Also see: tesseract-ocr/langdata#46

Myanmar wordlists
https://github.com/kanaung/wordlists

https://github.com/kanyawtech/myanmar-karen-word-lists/blob/master/burmese-word-list.txt?raw=true

https://en.wiktionary.org/wiki/Appendix:Burmese_basic_vocabulary

You may also find the charts at http://www.virtualvinodh.com/wp/character-matrix/ useful for a comparison of various Indic scripts. please see rows for Burmese for Mynamar.

Shreeshrii · 2017-06-14T03:53:58Z

@theraysmith You wrote in january:

The LSTM recognizer is currently trained to recognize the sequence of unicodes for Indic languages. This reduces the size of the output softmax of the network from the 5000+ elements in the unicharset to ~140. (There is an analogous process for Chinese, Japanese, and Korean, that doesn't use the unicode encoding, but it is a similar idea, and the codes are strictly limited in length.)

In recent trainings, I still see large unicharsets (eg, with ALL akshara combinations from the training_text in Devanagari).

Appending a new network to an old one!!Setting unichar properties
Setting properties for script Common
Setting properties for script Latin
Setting properties for script Devanagari
Setting properties for script Han
Unichar 1945=र्त्स्न्ये->र्त्स्न्ये is too long to encode!!
Warning: given outputs 105 not equal to unicharset of 3784.

Depending on the training text, this number can go as high as 6-7000. I thought the intention was to reduce this number.

Also, when training with hin.lstm as the starting point for replace top layer training, while the original .lstm file is about 8 MB, the intermediate .lstm files are about 80 MB and the _checkpoint file is about 160MB.

Is this to be expected or is something wrong with the training process?

Shreeshrii · 2017-06-14T12:55:35Z

I'm running a new training experiment this weekend to try a new coding scheme, in which pairs are mapped to a single code, allowing a long CVCVCVC string to be encoded using just CCCC, cutting down from 7 codes to 4. This will probably increase the size of the output softmax to ~170, but reduce the length of the average code sequence by about 1/3, which might be easier for it to learn, without slowing it down much.

@theraysmith Did the above approach work?

In https://github.com/tesseract-ocr/docs/blob/master/das_tutorial2016/7Building%20a%20Multi-Lingual%20OCR%20Engine.pdf, you have desscribed what's a character in Devanagari and used the following example:

rdvika - र्द्विक - 0930 094D 0926 094D 0935 093F 0915

I would actually split the above as two aksharas, each ending in the either the implicit a or a maatraa or a combining mark.

So the above would be:

rdvi - र्द्वि - 0930 094D 0926 094D 0935 093F
ka - क - 0915

To reduce the various akshara combinations, i would suggest splitting
CVCVCVC - consonant clusters
and
Maatra and Combining marks separately

eg. possible combinations with ka (and these do not include the combining vedic accents!!)

क का कि की कु कू कृ के कै को कौ कँ काँ किँ कीँ कुँ कूँ कृँ केँ कैँ कोँ कौँ कं कां किं कीं कुं कूं कृं कें कैं कों कौं कः काः किः कीः कुः कूः कृः केः कैः कोः कौः

Imagine these for every consonant cluster with every vowel sign (or matra), other signs like candrabindu, anusvara and visarga to each combination. the number of combinations will be HUGE.

By splitting consonant cluster part separately from maatraa and other signs combination, the number of combinations can be cut down drastically.

   ा  ि  ी  ु  ू  ृ  े  ै  ो  ौ  ँ  ाँ  िँ  ीँ  ुँ  ूँ  ृँ  ेँ  ैँ  ोँ  ौँ  ं  ां  िं  ीं  ुं  ूं  ृं  ें  ैं  ों  ौं  ः  ाः  िः  ीः  ुः  ूः  ृः  ेः  ैः  ोः  ौः

and consonants and consonant clusters such as

क ख ग घ ङ च छ ज झ ञ ट ठ ड ढ ण त थ द ध न प फ ब भ म य र ल व श ष स ह ळ
क्क क्क्य क्ख क्ख्य क्त क्त्य क्त्र क्त्र्य क्त्व क्थ क्न क्न्य क्प क्प्य क्फ क्फ्य क्म क्म्य क्य क्र क्र्य क्ल क्ल्य क्व क्व्य
क्ष क्ष्ण क्ष्म क्ष्म्य क्ष्य क्ष्व क्स ख्य ग्द ग्द्य ग्ध ग्ध्य ग्ध्व ग्न ग्न्य ग्ब ग्ब्र ग्भ ग्भ्य ग्म ग्म्य ग्य ग्र ग्र्य ग्ल ग्व
घ्न घ्म घ्य घ्र ङ्क ङ्क्त ङ्क्ष ङ्क्ष्व ङ्ख ङ्ख्य ङ्ग ङ्ग्य ङ्ग्र ङ्ग्र्य ङ्घ ङ्घ्र ङ्ङ ङ्म ङ्स
च्च च्छ च्छ्य च्छ्र च्छ्व च्म च्य छ्य छ्र छ्र्य छ्व ज्ज ज्ज्ञ ज्ज्व ज्झ ज्ञ ज्ञ्य ज्म ज्य ज्र ज्व
ञ्च ञ्च्य ञ्छ ञ्छ्र ञ्ज ञ्ज्म ञ्ज्य ञ्झ ञ्श
ट्क ट्ट ट्य ट्व ट्स ठ्य ठ्र ड्र ड्ड ड्य ढ्य ढ्र ढ्व ण्ट ण्ठ ण्ड ण्ढ ण्ण ण्म ण्य ण्व
त्क त्त त्त्य त्त्र त्त्व त्थ त्न त्न्य त्प त्प्र त्प्र्य त्फ त्म त्म्य त्य त्र त्र्य त्व त्स त्स्र त्स्र्य त्स्य त्स्व
थ्य द्ग द्ग्र द्द द्द्य द्द्र द्द्व् द्ध द्ध्य द्ध्व द्ध्व्य द्न द्न्य द्ब द्ब्र द्भ द्भ्य द्म द्य द्र द्र्य द्व् द्व्य द्व्र ध्न ध्म ध्य ध्र ध्व
न्त न्त्य न्त्र न्त्स न्थ न्द न्द्ध न्द्र न्द्व् न्ध न्ध्य न्ध्र न्ध्व न्न न्न्य न्प न्प्र न्फ न्म न्य न्व न्स
प्त प्त्य प्त्र प्न प्प प्फ प्म प्य प्र प्ल प्स फ्य ब्घ ब्ज ब्द ब्ध ब्ध्व ब्ब ब्भ ब्य ब्र ब्न भ्य ब्र भ्व
म्न म्प म्प्र म्ब म्ब्य म्भ म्ब्र म्म म्य म्र म्ल
य्य य्व ल्क ल्ग ल्प ल्म ल्य ल्ल ल्व ल्ह व्य व्र व्व श्च श्च्य श्न श्न्य श्म श्य श्र श्र्य श्ल श्व श्व्य श्श
ष्क ष्क्र ष्ट ष्ट्य ष्ट्र ष्ट्र्य ष्ट्व ष्ठ ष्ठ्य ष्ठ्र ष्ण ष्ण्य ष्प ष्प्र ष्म ष्य ष्व
स्क स्क्र स्ख स्त स्त्य स्त्र स्त्व स्थ स्थ्य स्र स्प स्प्र स्फ स्म स्म्य स्य स्र स्व स ह्न ह्म ह्य ह्र ह्ल ह्व

and (with reph)

 र्क र्ख र्ग र्घ र्ङ र्च र्छ र्ज र्झ र्ञ र्ट र्ठ र्ड र्ढ र्ण र्त र्थ र्द र्ध र्न र्प र्फ र्ब र्भ र्म र्य र्र र्ल र्व र्श र्ष र्स र्ह र्ळ
 र्क्क र्क्क्य र्क्ख र्क्ख्य र्क्त र्क्त्य र्क्त्र र्क्त्र्य र्क्त्व र्क्थ र्क्न र्क्न्य र्क्प र्क्प्य र्क्फ र्क्फ्य र्क्म र्क्म्य र्क्य र्क्र र्क्र्य र्क्ल र्क्ल्य र्क्व र्क्व्य
 र्क्ष र्क्ष्ण र्क्ष्म र्क्ष्म्य र्क्ष्य र्क्ष्व र्क्स र्ख्य र्ग्द र्ग्द्य र्ग्ध र्ग्ध्य र्ग्ध्व र्ग्न र्ग्न्य र्ग्ब र्ग्ब्र र्ग्भ र्ग्भ्य र्ग्म र्ग्म्य र्ग्य र्ग्र र्ग्र्य र्ग्ल र्ग्व
 र्घ्न र्घ्म र्घ्य र्घ्र र्ङ्क र्ङ्क्त र्ङ्क्ष र्ङ्क्ष्व र्ङ्ख र्ङ्ख्य र्ङ्ग र्ङ्ग्य र्ङ्ग्र र्ङ्ग्र्य र्ङ्घ र्ङ्घ्र र्ङ्ङ र्ङ्म र्ङ्स
 र्च्च र्च्छ र्च्छ्य र्च्छ्र र्च्छ्व र्च्म र्च्य र्छ्य र्छ्र र्छ्र्य र्छ्व र्ज्ज र्ज्ज्ञ र्ज्ज्व र्ज्झ र्ज्ञ र्ज्ञ्य र्ज्म र्ज्य र्ज्र र्ज्व
 र्ञ्च र्ञ्च्य र्ञ्छ र्ञ्छ्र र्ञ्ज र्ञ्ज्म र्ञ्ज्य र्ञ्झ र्ञ्श
 र्ट्क र्ट्ट र्ट्य र्ट्व र्ट्स र्ठ्य र्ठ्र र्ड्र र्ड्ड र्ड्य र्ढ्य र्ढ्र र्ढ्व र्ण्ट र्ण्ठ र्ण्ड र्ण्ढ र्ण्ण र्ण्म र्ण्य र्ण्व
 र्त्क र्त्त र्त्त्य र्त्त्र र्त्त्व र्त्थ र्त्न र्त्न्य र्त्प र्त्प्र र्त्प्र्य र्त्फ र्त्म र्त्म्य र्त्य र्त्र र्त्र्य र्त्व र्त्स र्त्स्र र्त्स्र्य र्त्स्य र्त्स्व
 र्थ्य र्द्ग र्द्ग्र र्द्द र्द्द्य र्द्द्र र्द्द्व् र्द्ध र्द्ध्य र्द्ध्व र्द्ध्व्य र्द्न र्द्न्य र्द्ब र्द्ब्र र्द्भ र्द्भ्य र्द्म र्द्य र्द्र र्द्र्य र्द्व् र्द्व्य र्द्व्र र्ध्न र्ध्म र्ध्य र्ध्र र्ध्व
 र्न्त र्न्त्य र्न्त्र र्न्त्स र्न्थ र्न्द र्न्द्ध र्न्द्र र्न्द्व् र्न्ध र्न्ध्य र्न्ध्र र्न्ध्व र्न्न र्न्न्य र्न्प र्न्प्र र्न्फ र्न्म र्न्य र्न्व र्न्स
 र्प्त र्प्त्य र्प्त्र र्प्न र्प्प र्प्फ र्प्म र्प्य र्प्र र्प्ल र्प्स र्फ्य र्ब्घ र्ब्ज र्ब्द र्ब्ध र्ब्ध्व र्ब्ब र्ब्भ र्ब्य र्ब्र र्ब्न र्भ्य र्ब्र र्भ्व
 र्म्न र्म्प र्म्प्र र्म्ब र्म्ब्य र्म्भ र्म्ब्र र्म्म र्म्य र्म्र र्म्ल
 र्य्य र्य्व र्ल्क र्ल्ग र्ल्प र्ल्म र्ल्य र्ल्ल र्ल्व र्ल्ह र्व्य र्व्र र्व्व र्श्च र्श्च्य र्श्न र्श्न्य र्श्म र्श्य र्श्र र्श्र्य र्श्ल र्श्व र्श्व्य र्श्श
 र्ष्क र्ष्क्र र्ष्ट र्ष्ट्य र्ष्ट्र र्ष्ट्र्य र्ष्ट्व र्ष्ठ र्ष्ठ्य र्ष्ठ्र र्ष्ण र्ष्ण्य र्ष्प र्ष्प्र र्ष्म र्ष्य र्ष्व
 र्स्क र्स्क्र र्स्ख र्स्त र्स्त्य र्स्त्र र्स्त्व र्स्थ र्स्थ्य र्स्र र्स्प र्स्प्र र्स्फ र्स्म र्स्म्य र्स्य र्स्र र्स्व र्स र्ह्न र्ह्म र्ह्य र्ह्र र्ह्ल र्ह्व

Shreeshrii · 2017-06-15T07:22:29Z

Please see pages 48-75 of http://www.sanskritweb.net/itrans/itmanual2003.pdf
for the varied rendition of consonant cluster ligatures in different Devanagari fonts.

a1012 · 2017-10-27T07:17:09Z

Hey Ray,
Can you please explain the training process of tesseract-ocr with LSTM ?

xiongfeihtp · 2017-12-10T06:01:28Z

Hey Ray, I am confused with data prepared for tesseract 4.0 training? Could you please explain it and explain the training process of tesseract-ocr with LSTM?

ghost · 2018-06-02T11:57:44Z

@Shreeshrii Ray have said

render in a different combination of font and random degradation

How can I render my text at random degradation in training?

Shreeshrii · 2018-06-02T12:18:13Z

It is default option in text2image.

  --degrade_image  Degrade rendered image with speckle noise, dilation/erosion and rotation  (type:bool default:true)
  --rotate_image  Rotate the image in a random way.  (type:bool default:true)

almas · 2018-07-04T03:41:56Z

Hello guys.
I want to add new language script to Tesseract OCR.
Then I want to know below things.

Is there any automatic tool that make a langdata training_text and wordlist files from massive text?
Is there any documentation about preparing text data and explanation about text data files? I just saw directory langdata/jpn/ and there are some files. But I have know idea about this files and how to create files like those files? What rule should I use create langdata files?

kaomoneus · 2022-06-06T18:10:22Z

Hello! Is here any chance to get rendered ground truth data for English (eng.traineddata)?

AFAIK it was trained on huge set of font and some of them are not freely accessible. Not sure I'm able to render it locally.

Thank you!

Shreeshrii mentioned this issue Jan 20, 2017

Suggest 'deva' for Devanagari tesseract-ocr/langdata#41

Closed

This was referenced Jan 24, 2017

Bihari training text not representative tesseract-ocr/langdata#39

Closed

Khmer Language Support #622

Closed

Shreeshrii mentioned this issue Jan 25, 2017

khmer - not working with --oem 1 tesseract-ocr/langdata#43

Closed

Shreeshrii mentioned this issue Jan 27, 2017

LSTM: khmer is not working with --oem 1 #682

Closed

Shreeshrii mentioned this issue Feb 14, 2017

Would like to help for Burmese/Myanmar language training? tesseract-ocr/langdata#13

Open

Shreeshrii mentioned this issue Apr 1, 2017

Sindhi Language resources for corpus (Arabic script) tesseract-ocr/langdata#42

Closed

Shreeshrii changed the title ~~LSTM: Indic - length of the compressed codes~~ Q&A: Indic - length of the compressed codes Sep 11, 2017

Shreeshrii closed this as completed Sep 11, 2017

Shreeshrii reopened this Mar 20, 2018

amitdo mentioned this issue Sep 15, 2018

Wordlists and training texts contain lots of errors tesseract-ocr/langdata_lstm#1

Open

Shreeshrii mentioned this issue Jan 23, 2020

Segmentation fault when processing large images with lstm.train #2860

Open

udibarzi mentioned this issue Feb 3, 2021

Creating training data using tesstrain.sh tesseract-ocr/tessdoc#39

Open

m-kafiyan mentioned this issue Oct 2, 2022

training tesseract for persian (fas) language tesseract-ocr/tesstrain#315

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q&A: Indic - length of the compressed codes #654

Q&A: Indic - length of the compressed codes #654

Shreeshrii commented Jan 12, 2017

Shreeshrii commented Jan 12, 2017 •

edited

Loading

Shreeshrii commented Jan 20, 2017

theraysmith commented Jan 20, 2017 •

edited

Loading

Shreeshrii commented Jan 21, 2017 via email

theraysmith commented Jan 23, 2017 via email

Shreeshrii commented Jan 24, 2017 •

edited

Loading

Shreeshrii commented Jan 25, 2017 •

edited

Loading

Shreeshrii commented Jan 25, 2017 •

edited

Loading

theraysmith commented Mar 30, 2017

Shreeshrii commented Mar 30, 2017 •

edited

Loading

Shreeshrii commented Jun 14, 2017

Shreeshrii commented Jun 14, 2017 •

edited

Loading

Shreeshrii commented Jun 15, 2017 •

edited

Loading

a1012 commented Oct 27, 2017 •

edited

Loading

xiongfeihtp commented Dec 10, 2017

ghost commented Jun 2, 2018 •

edited by ghost

Loading

Shreeshrii commented Jun 2, 2018

almas commented Jul 4, 2018 •

edited

Loading

kaomoneus commented Jun 6, 2022

Q&A: Indic - length of the compressed codes #654

Q&A: Indic - length of the compressed codes #654

Comments

Shreeshrii commented Jan 12, 2017

Shreeshrii commented Jan 12, 2017 • edited Loading

Shreeshrii commented Jan 20, 2017

theraysmith commented Jan 20, 2017 • edited Loading

Shreeshrii commented Jan 21, 2017 via email

theraysmith commented Jan 23, 2017 via email

Shreeshrii commented Jan 24, 2017 • edited Loading

Shreeshrii commented Jan 25, 2017 • edited Loading

Shreeshrii commented Jan 25, 2017 • edited Loading

theraysmith commented Mar 30, 2017

Shreeshrii commented Mar 30, 2017 • edited Loading

Shreeshrii commented Jun 14, 2017

Shreeshrii commented Jun 14, 2017 • edited Loading

Shreeshrii commented Jun 15, 2017 • edited Loading

a1012 commented Oct 27, 2017 • edited Loading

xiongfeihtp commented Dec 10, 2017

ghost commented Jun 2, 2018 • edited by ghost Loading

Shreeshrii commented Jun 2, 2018

almas commented Jul 4, 2018 • edited Loading

kaomoneus commented Jun 6, 2022

Shreeshrii commented Jan 12, 2017 •

edited

Loading

theraysmith commented Jan 20, 2017 •

edited

Loading

Shreeshrii commented Jan 24, 2017 •

edited

Loading

Shreeshrii commented Jan 25, 2017 •

edited

Loading

Shreeshrii commented Jan 25, 2017 •

edited

Loading

Shreeshrii commented Mar 30, 2017 •

edited

Loading

Shreeshrii commented Jun 14, 2017 •

edited

Loading

Shreeshrii commented Jun 15, 2017 •

edited

Loading

a1012 commented Oct 27, 2017 •

edited

Loading

ghost commented Jun 2, 2018 •

edited by ghost

Loading

almas commented Jul 4, 2018 •

edited

Loading