Source training data for Tesseract for lots of languages
Switch branches/tags
Nothing to show
Clone or download
zdenop Merge pull request #123 from Shreeshrii/patch-1
remove 'tessedit_load_sublangs chi_tra' for korean
Latest commit 106c9b3 Apr 9, 2018
Permalink
Failed to load latest commit information.
afr Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
amh Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
ara Updates to desired/forbidden characters to include Arabic diacritcs, … Jan 13, 2017
asm Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
aze Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
aze_cyrl Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
bel Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
bel_tarask add lexical data (using same character data as bel, for now) Aug 15, 2014
ben Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
bih Merge pull request #113 from Shreeshrii/patch-5 Feb 22, 2018
bod Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
bos Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
bul Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
cat Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
ceb Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
ces Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
chi_sim Fixes extra intra-word spacing in Chinese for 4.0 Feb 20, 2018
chi_sim_vert Updates to desired/forbidden characters to include Arabic diacritcs, … Jan 13, 2017
chi_tra Updates to desired/forbidden characters to include Arabic diacritcs, … Jan 13, 2017
chi_tra_vert Updates to desired/forbidden characters to include Arabic diacritcs, … Jan 13, 2017
chr Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
cym Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
dan Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
deu Merge pull request #56 from stweil/master Feb 20, 2017
div Updates to desired/forbidden characters to include Arabic diacritcs, … Jan 13, 2017
dzo Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
ell Updates to desired/forbidden characters to include Arabic diacritcs, … Jan 13, 2017
eng Add ï to desired English characters Feb 21, 2017
enm Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
epo Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
est Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
eus Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
fas Updates to desired/forbidden characters to include Arabic diacritcs, … Jan 13, 2017
fin Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
fra Delete cube langdata Dec 14, 2016
frk Remove more German confusions Feb 19, 2017
frm Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
gle Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
gle_uncial Merge pull request #5 from tesseract-ocr/gle_uncial Feb 21, 2018
glg Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
grc Update grc to latest version upstream Dec 11, 2015
guj Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
hat Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
heb Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
hin Updates to desired/forbidden characters to include Arabic diacritcs, … Jan 13, 2017
hrv Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
hun Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
iast add IAST version of san and hin training text Feb 22, 2018
iku Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
ind Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
isl Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
ita ita: Remove user words Jun 5, 2017
ita_old Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
jav Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
jpn Remove parameter textord_tabfind_vertical_horizontal_mix Mar 29, 2018
jpn_vert Remove parameter textord_tabfind_vertical_horizontal_mix Mar 29, 2018
kan Fix file mode (remove execute permission) Mar 29, 2018
kat Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
kat_old Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
kaz Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
khm Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
kir Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
kor remove 'tessedit_load_sublangs chi_tra' Apr 9, 2018
kur Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
kur_ara copy files from kur subdirectory Mar 23, 2018
lao Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
lat sort|uniq Feb 21, 2018
lav Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
lit Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
mal Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
mar Updates to desired/forbidden characters to include Arabic diacritcs, … Jan 13, 2017
mkd Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
mlt Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
mri training_text Feb 21, 2018
msa Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
mya Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
nep Updates to desired/forbidden characters to include Arabic diacritcs, … Jan 13, 2017
nld Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
nor Fixed issue 15 Jan 11, 2017
ori Update desired_characters Dec 8, 2017
pan Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
pol Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
por Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
pus Updates to desired/forbidden characters to include Arabic diacritcs, … Jan 13, 2017
ron Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
rus Delete cube langdata Dec 14, 2016
rus_accent unicharset from rus; ligatures for accents todo May 14, 2015
san Merge pull request #15 from Shreeshrii/master Feb 21, 2018
sin Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
slk Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
slv Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
snd Updates to desired/forbidden characters to include Arabic diacritcs, … Jan 13, 2017
spa Delete cube langdata Dec 14, 2016
spa_old Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
sqi Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
srp Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
srp_latn do not load Serbian Cyrillic for Serbian latin OCR Mar 8, 2016
swa Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
swe Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
syr Updates to desired/forbidden characters to include Arabic diacritcs, … Jan 13, 2017
tam Fix file mode (remove execute permission) Mar 29, 2018
tel Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
tgk Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
tgl Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
tha Addresses extra spaces problem with 4.00 Feb 20, 2018
tir Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
tur Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
tyv Updated training text. Dec 30, 2015
uig Updates to desired/forbidden characters to include Arabic diacritcs, … Jan 13, 2017
ukr Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
urd urd.wordlist Jan 13, 2018
uzb Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
uzb_cyrl Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
vie Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
yid Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
zlm Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
Arabic.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Arabic.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Armenian.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Armenian.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Bengali.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Bengali.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Bopomofo.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Bopomofo.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Canadian_Aboriginal.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Canadian_Aboriginal.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Cherokee.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Cherokee.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Common.unicharset Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
Cyrillic.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Cyrillic.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Devanagari.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Devanagari.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Ethiopic.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Ethiopic.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Georgian.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Georgian.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Greek.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Greek.xheights Add Ancient Greek langdata Oct 29, 2015
Gujarati.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Gujarati.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Gurmukhi.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Gurmukhi.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Han.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Han.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Hangul.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Hangul.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Hebrew.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Hebrew.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Hiragana.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Hiragana.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Kannada.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Kannada.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Katakana.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Katakana.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Khmer.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Khmer.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Lao.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Lao.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Latin.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Latin.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Malayalam.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Malayalam.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Myanmar.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Myanmar.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Ogham.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Ogham.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Oriya.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Oriya.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
README.md Create README.md May 14, 2015
Runic.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Runic.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Sinhala.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Sinhala.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Syriac.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Syriac.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Tamil.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Tamil.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Telugu.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Telugu.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Thai.unicharset Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Thai.xheights Initial commit of *all* the language source data (87 langs) Aug 12, 2014
Tibetan.unicharset Updated all langdata with newly generated source training data for 3.04 Jun 24, 2015
common.punc Initial commit of *all* the language source data (87 langs) Aug 12, 2014
common.unicharambigs Fix file mode (remove execute permission) Mar 29, 2018
font_properties Merge pull request #19 from nickjwhite/addgrc Feb 21, 2018
forbidden_characters_default Initial commit of *all* the language source data (87 langs) Aug 12, 2014
radical-stroke.txt Changed stroke encoding to be based on wubi instead of radical stroke… Jul 25, 2017

README.md

langdata

Source training data for Tesseract for lots of languages

Want to re-train tesseract for a specific language, by modifying/augmenting the original training data? Then you have come to the right place!

If you want to find a language data set to run Tesseract, then look at our tessdata repository instead.

To re-create the training of a single language, lang, you need the following:

  • All the data in the lang directory.
  • The corresponding unicharset/xheights files for the script(s) used by lang.
  • All the remaining non-lang-specific files in the top-level directory, such as font_properties.
  • You also need to obtain the fonts needed to train the language. Some languages were trained with commercially available fonts, so you will need to buy them in order to reproduce the training exactly, or use substitutes.