Data Files

Amit D edited this page Jan 2, 2018 · 62 revisions

Special Data Files

Lang Code Description 4.0/3.0x traineddata
osd Orientation and script detection osd.traineddata
equ Math / equation detection equ.traineddata

Note: These two data files are compatible with older versions of Tesseract. osd is compatible with version 3.01 and up, and equ is compatible with version 3.02 and up.

Updated Data Files for Version 4.00 (September 15, 2017)

We have three sets of .traineddata files on GitHub in three separate repositories.

Most users will want tessdata_fast and that is what will be shipped as part of Linux distributions. tessdata_best is for people willing to trade a lot of speed for slightly better accuracy. It is also better for certain retraining scenarios for advanced users.

The third set in tessdata is for the legacy recognizer. The 4.00 files from November 2016 have both LSTM and legacy models.

Note: When using the new models in the tessdata_best and tessdata_fast repositories, only the new LSTM-based OCR engine is supported. The legacy engine is not supported with these files, so Tesseract's oem modes '0' and '2' won't work with them.

Data Files for Version 4.00 (November 29, 2016)

This set of traineddata files has support for the legacy recognizer with --oem 0 and for LSTM models with --oem 1.

Note: The kur data file was not updated from 3.04. For Fraktur, see the section Fraktur Data Files, or use the newer data files from the tessdata_fast or tessdata_best repositories.

Lang Code Language 4.0 traineddata
afr Afrikaans afr.traineddata
amh Amharic amh.traineddata
ara Arabic ara.traineddata
asm Assamese asm.traineddata
aze Azerbaijani aze.traineddata
aze_cyrl Azerbaijani - Cyrillic aze_cyrl.traineddata
bel Belarusian bel.traineddata
ben Bengali ben.traineddata
bod Tibetan bod.traineddata
bos Bosnian bos.traineddata
bul Bulgarian bul.traineddata
cat Catalan; Valencian cat.traineddata
ceb Cebuano ceb.traineddata
ces Czech ces.traineddata
chi_sim Chinese - Simplified chi_sim.traineddata
chi_tra Chinese - Traditional chi_tra.traineddata
chr Cherokee chr.traineddata
cym Welsh cym.traineddata
dan Danish dan.traineddata
deu German deu.traineddata
dzo Dzongkha dzo.traineddata
ell Greek, Modern (1453-) ell.traineddata
eng English eng.traineddata
enm English, Middle (1100-1500) enm.traineddata
epo Esperanto epo.traineddata
est Estonian est.traineddata
eus Basque eus.traineddata
fas Persian fas.traineddata
fin Finnish fin.traineddata
fra French fra.traineddata
frk Frankish frk.traineddata
frm French, Middle (ca. 1400-1600) frm.traineddata
gle Irish gle.traineddata
glg Galician glg.traineddata
grc Greek, Ancient (-1453) grc.traineddata
guj Gujarati guj.traineddata
hat Haitian; Haitian Creole hat.traineddata
heb Hebrew heb.traineddata
hin Hindi hin.traineddata
hrv Croatian hrv.traineddata
hun Hungarian hun.traineddata
iku Inuktitut iku.traineddata
ind Indonesian ind.traineddata
isl Icelandic isl.traineddata
ita Italian ita.traineddata
ita_old Italian - Old ita_old.traineddata
jav Javanese jav.traineddata
jpn Japanese jpn.traineddata
kan Kannada kan.traineddata
kat Georgian kat.traineddata
kat_old Georgian - Old kat_old.traineddata
kaz Kazakh kaz.traineddata
khm Central Khmer khm.traineddata
kir Kirghiz; Kyrgyz kir.traineddata
kor Korean kor.traineddata
kur Kurdish kur.traineddata
lao Lao lao.traineddata
lat Latin lat.traineddata
lav Latvian lav.traineddata
lit Lithuanian lit.traineddata
mal Malayalam mal.traineddata
mar Marathi mar.traineddata
mkd Macedonian mkd.traineddata
mlt Maltese mlt.traineddata
msa Malay msa.traineddata
mya Burmese mya.traineddata
nep Nepali nep.traineddata
nld Dutch; Flemish nld.traineddata
nor Norwegian nor.traineddata
ori Oriya ori.traineddata
pan Panjabi; Punjabi pan.traineddata
pol Polish pol.traineddata
por Portuguese por.traineddata
pus Pushto; Pashto pus.traineddata
ron Romanian; Moldavian; Moldovan ron.traineddata
rus Russian rus.traineddata
san Sanskrit san.traineddata
sin Sinhala; Sinhalese sin.traineddata
slk Slovak slk.traineddata
slv Slovenian slv.traineddata
spa Spanish; Castilian spa.traineddata
spa_old Spanish; Castilian - Old spa_old.traineddata
sqi Albanian sqi.traineddata
srp Serbian srp.traineddata
srp_latn Serbian - Latin srp_latn.traineddata
swa Swahili swa.traineddata
swe Swedish swe.traineddata
syr Syriac syr.traineddata
tam Tamil tam.traineddata
tel Telugu tel.traineddata
tgk Tajik tgk.traineddata
tgl Tagalog tgl.traineddata
tha Thai tha.traineddata
tir Tigrinya tir.traineddata
tur Turkish tur.traineddata
uig Uighur; Uyghur uig.traineddata
ukr Ukrainian ukr.traineddata
urd Urdu urd.traineddata
uzb Uzbek uzb.traineddata
uzb_cyrl Uzbek - Cyrillic uzb_cyrl.traineddata
vie Vietnamese vie.traineddata
yid Yiddish yid.traineddata

Data Files for Version 3.04/3.05

Note: For Arabic and Hindi you need both the traineddata file and the cube data files.

Lang Code Language 3.04 traineddata
afr Afrikaans afr.traineddata
amh Amharic amh.traineddata
ara Arabic ara.traineddata
asm Assamese asm.traineddata
aze Azerbaijani aze.traineddata
aze_cyrl Azerbaijani - Cyrillic aze_cyrl.traineddata
bel Belarusian bel.traineddata
ben Bengali ben.traineddata
bod Tibetan bod.traineddata
bos Bosnian bos.traineddata
bul Bulgarian bul.traineddata
cat Catalan; Valencian cat.traineddata
ceb Cebuano ceb.traineddata
ces Czech ces.traineddata
chi_sim Chinese - Simplified chi_sim.traineddata
chi_tra Chinese - Traditional chi_tra.traineddata
chr Cherokee chr.traineddata
cym Welsh cym.traineddata
dan Danish dan.traineddata
deu German deu.traineddata
dzo Dzongkha dzo.traineddata
ell Greek, Modern (1453-) ell.traineddata
eng English eng.traineddata
enm English, Middle (1100-1500) enm.traineddata
epo Esperanto epo.traineddata
est Estonian est.traineddata
eus Basque eus.traineddata
fas Persian fas.traineddata
fin Finnish fin.traineddata
fra French fra.traineddata
frk Frankish frk.traineddata
frm French, Middle (ca. 1400-1600) frm.traineddata
gle Irish gle.traineddata
glg Galician glg.traineddata
grc Greek, Ancient (-1453) grc.traineddata
guj Gujarati guj.traineddata
hat Haitian; Haitian Creole hat.traineddata
heb Hebrew heb.traineddata
hin Hindi hin.traineddata
hrv Croatian hrv.traineddata
hun Hungarian hun.traineddata
iku Inuktitut iku.traineddata
ind Indonesian ind.traineddata
isl Icelandic isl.traineddata
ita Italian ita.traineddata
ita_old Italian - Old ita_old.traineddata
jav Javanese jav.traineddata
jpn Japanese jpn.traineddata
kan Kannada kan.traineddata
kat Georgian kat.traineddata
kat_old Georgian - Old kat_old.traineddata
kaz Kazakh kaz.traineddata
khm Central Khmer khm.traineddata
kir Kirghiz; Kyrgyz kir.traineddata
kor Korean kor.traineddata
kur Kurdish kur.traineddata
lao Lao lao.traineddata
lat Latin lat.traineddata
lav Latvian lav.traineddata
lit Lithuanian lit.traineddata
mal Malayalam mal.traineddata
mar Marathi mar.traineddata
mkd Macedonian mkd.traineddata
mlt Maltese mlt.traineddata
msa Malay msa.traineddata
mya Burmese mya.traineddata
nep Nepali nep.traineddata
nld Dutch; Flemish nld.traineddata
nor Norwegian nor.traineddata
ori Oriya ori.traineddata
pan Panjabi; Punjabi pan.traineddata
pol Polish pol.traineddata
por Portuguese por.traineddata
pus Pushto; Pashto pus.traineddata
ron Romanian; Moldavian; Moldovan ron.traineddata
rus Russian rus.traineddata
san Sanskrit san.traineddata
sin Sinhala; Sinhalese sin.traineddata
slk Slovak slk.traineddata
slv Slovenian slv.traineddata
spa Spanish; Castilian spa.traineddata
spa_old Spanish; Castilian - Old spa_old.traineddata
sqi Albanian sqi.traineddata
srp Serbian srp.traineddata
srp_latn Serbian - Latin srp_latn.traineddata
swa Swahili swa.traineddata
swe Swedish swe.traineddata
syr Syriac syr.traineddata
tam Tamil tam.traineddata
tel Telugu tel.traineddata
tgk Tajik tgk.traineddata
tgl Tagalog tgl.traineddata
tha Thai tha.traineddata
tir Tigrinya tir.traineddata
tur Turkish tur.traineddata
uig Uighur; Uyghur uig.traineddata
ukr Ukrainian ukr.traineddata
urd Urdu urd.traineddata
uzb Uzbek uzb.traineddata
uzb_cyrl Uzbek - Cyrillic uzb_cyrl.traineddata
vie Vietnamese vie.traineddata
yid Yiddish yid.traineddata

Cube Data Files for Version 3.04/3.05

In Tesseract 3.0x Arabic and Hindi use the Cube OCR engine. You need to download the cube files and move them to the same folder where the <ara/hin>.traineddata file is located.

In Tesseract 4.0 the Cube OCR engine was removed from the codebase, so if you are using 4.0 or a newer version these files are not needed.

Hindi:
hin.cube.bigrams, hin.cube.fold, hin.cube.lm, hin.cube.nn, hin.cube.params, hin.cube.word-freq, hin.tesseract_cube.nn

Arabic:
ara.cube.bigrams, ara.cube.fold, ara.cube.lm, ara.cube.nn, ara.cube.params, ara.cube.word-freq, ara.cube.size, ara.tesseract_cube.nn

Fraktur Data Files

These data files were prepared by @paalberti for some old versions of Tesseract. dan_frak, deu_frak and swe_frak were prepared for version 3.00, slk_frak was prepared for 3.01. Updates to these files are available at paalberti/tesseract-dan-fraktur.

Lang Code Language 3.0x traineddata
dan_frak Danish - Fraktur dan_frak.traineddata
deu_frak German - Fraktur deu_frak.traineddata
slk_frak Slovak - Fraktur slk_frak.traineddata
swe_frak Swedish - Fraktur swe-frak.traineddata

Data Files for Version 3.02

Lang Code Language 3.02 traineddata
afr Afrikaans tesseract-ocr-3.02.afr.tar.gz
ara Arabic tesseract-ocr-3.02.ara.tar.gz
aze Azerbaijani tesseract-ocr-3.02.aze.tar.gz
bel Belarusian tesseract-ocr-3.02.bel.tar.gz
ben Bengali tesseract-ocr-3.02.ben.tar.gz
bul Bulgarian tesseract-ocr-3.02.bul.tar.gz
cat Catalan; Valencian tesseract-ocr-3.02.cat.tar.gz
ces Czech tesseract-ocr-3.02.ces.tar.gz
chi_sim Chinese - Simplified tesseract-ocr-3.02.chi_sim.tar.gz
chi_tra Chinese - Traditional tesseract-ocr-3.02.chi_tra.tar.gz
chr Cherokee tesseract-ocr-3.02.chr.tar.gz
dan Danish tesseract-ocr-3.02.dan.tar.gz
deu German tesseract-ocr-3.02.deu.tar.gz
ell Greek, Modern (1453-) tesseract-ocr-3.02.ell.tar.gz
eng English tesseract-ocr-3.02.eng.tar.gz
enm English, Middle (1100-1500) tesseract-ocr-3.02.enm.tar.gz
epo Esperanto tesseract-ocr-3.02.epo.tar.gz
est Estonian tesseract-ocr-3.02.est.tar.gz
eus Basque tesseract-ocr-3.02.eus.tar.gz
fin Finnish tesseract-ocr-3.02.fin.tar.gz
fra French tesseract-ocr-3.02.fra.tar.gz
frk Frankish tesseract-ocr-3.02.frk.tar.gz
frm French, Middle (ca. 1400-1600) tesseract-ocr-3.02.frm.tar.gz
glg Galician tesseract-ocr-3.02.glg.tar.gz
grc Greek, Ancient (-1453) tesseract-ocr-3.02.grc.tar.gz
heb Hebrew tesseract-ocr-3.02.heb.tar.gz
hin Hindi tesseract-ocr-3.02.hin.tar.gz
hrv Croatian tesseract-ocr-3.02.hrv.tar.gz
hun Hungarian tesseract-ocr-3.02.hun.tar.gz
ind Indonesian tesseract-ocr-3.02.ind.tar.gz
isl Icelandic tesseract-ocr-3.02.isl.tar.gz
ita Italian tesseract-ocr-3.02.ita.tar.gz
ita_old Italian - Old tesseract-ocr-3.02.ita_old.tar.gz
jpn Japanese tesseract-ocr-3.02.jpn.tar.gz
kan Kannada tesseract-ocr-3.02.kan.tar.gz
kor Korean tesseract-ocr-3.02.kor.tar.gz
lav Latvian tesseract-ocr-3.02.lav.tar.gz
lit Lithuanian tesseract-ocr-3.02.lit.tar.gz
mal Malayalam tesseract-ocr-3.02.mal.tar.gz
mkd Macedonian tesseract-ocr-3.02.mkd.tar.gz
mlt Maltese tesseract-ocr-3.02.mlt.tar.gz
msa Malay tesseract-ocr-3.02.msa.tar.gz
nld Dutch; Flemish tesseract-ocr-3.02.nld.tar.gz
nor Norwegian tesseract-ocr-3.02.nor.tar.gz
pol Polish tesseract-ocr-3.02.pol.tar.gz
por Portuguese tesseract-ocr-3.02.por.tar.gz
ron Romanian; Moldavian; Moldovan tesseract-ocr-3.02.ron.tar.gz
rus Russian tesseract-ocr-3.02.rus.tar.gz
slk Slovak tesseract-ocr-3.02.slk.tar.gz
slv Slovenian tesseract-ocr-3.02.slv.tar.gz
spa Spanish; Castilian tesseract-ocr-3.02.spa.tar.gz
spa_old Spanish; Castilian - Old tesseract-ocr-3.02.spa_old.tar.gz
sqi Albanian tesseract-ocr-3.02.sqi.tar.gz
srp Serbian tesseract-ocr-3.02.srp.tar.gz
swa Swahili tesseract-ocr-3.02.swa.tar.gz
swe Swedish tesseract-ocr-3.02.swe.tar.gz
tam Tamil tesseract-ocr-3.02.tam.tar.gz
tel Telugu tesseract-ocr-3.02.tel.tar.gz
tgl Tagalog tesseract-ocr-3.02.tgl.tar.gz
tha Thai tesseract-ocr-3.02.tha.tar.gz
tur Turkish tesseract-ocr-3.02.tur.tar.gz
ukr Ukrainian tesseract-ocr-3.02.ukr.tar.gz
vie Vietnamese tesseract-ocr-3.02.vie.tar.gz

Data Files for Version 2.0x

Lang Code Language 2.0x traineddata
deu German tesseract-2.00.deu.tar.gz
deu-f German - Fraktur tesseract-2.01.deu-f.tar.gz
eng English tesseract-2.00.eng.tar.gz
eus Basque tesseract-2.04-eus.tar.gz
fra French tesseract-2.00.fra.tar.gz
ita Italian tesseract-2.00.ita.tar.gz
nld Dutch; Flemish tesseract-2.00.nld.tar.gz
por Portuguese tesseract-2.01.por.tar.gz
spa Spanish; Castilian tesseract-2.00.spa.tar.gz
vie Vietnamese tesseract-2.01.vie.tar.gz

Format of traineddata files

The traineddata file for each language is an archive file in a Tesseract specific format. It contains several uncompressed component files which are needed by the Tesseract OCR process. The program combine_tessdata is used to create a tessdata file from the component files and can also extract them again like in the following examples:

Pre 4.0.0 format from Nov 2016 (with both LSTM and Legacy models)

combine_tessdata -u eng.traineddata eng.
Extracting tessdata components from eng.traineddata
Wrote eng.unicharset
Wrote eng.unicharambigs
Wrote eng.inttemp
Wrote eng.pffmtable
Wrote eng.normproto
Wrote eng.punc-dawg
Wrote eng.word-dawg
Wrote eng.number-dawg
Wrote eng.freq-dawg
Wrote eng.cube-unicharset
Wrote eng.cube-word-dawg
Wrote eng.shapetable
Wrote eng.bigram-dawg
Wrote eng.lstm
Wrote eng.lstm-punc-dawg
Wrote eng.lstm-word-dawg
Wrote eng.lstm-number-dawg
Wrote eng.version
Version string:Pre-4.0.0
1:unicharset:size=7477, offset=192
2:unicharambigs:size=1047, offset=7669
3:inttemp:size=976552, offset=8716
4:pffmtable:size=844, offset=985268
5:normproto:size=13408, offset=986112
6:punc-dawg:size=4322, offset=999520
7:word-dawg:size=1082890, offset=1003842
8:number-dawg:size=6426, offset=2086732
9:freq-dawg:size=1410, offset=2093158
11:cube-unicharset:size=1511, offset=2094568
12:cube-word-dawg:size=1062106, offset=2096079
13:shapetable:size=63346, offset=3158185
14:bigram-dawg:size=16109842, offset=3221531
17:lstm:size=5390718, offset=19331373
18:lstm-punc-dawg:size=4322, offset=24722091
19:lstm-word-dawg:size=7143578, offset=24726413
20:lstm-number-dawg:size=3530, offset=31869991
23:version:size=9, offset=31873521

4.00.00alpha LSTM only format

combine_tessdata -u eng.traineddata eng.
Extracting tessdata components from eng.traineddata
Wrote eng.lstm
Wrote eng.lstm-punc-dawg
Wrote eng.lstm-word-dawg
Wrote eng.lstm-number-dawg
Wrote eng.lstm-unicharset
Wrote eng.lstm-recoder
Wrote eng.version
Version string:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
17:lstm:size=11689099, offset=192
18:lstm-punc-dawg:size=4322, offset=11689291
19:lstm-word-dawg:size=3694794, offset=11693613
20:lstm-number-dawg:size=4738, offset=15388407
21:lstm-unicharset:size=6360, offset=15393145
22:lstm-recoder:size=1012, offset=15399505
23:version:size=80, offset=15400517

Proposal for compressed traineddata files

There are some proposals to replace the Tesseract archive format by a standard archive format which could also support compression. A discussion on the tesseract-dev forum proposed the ZIP format already in 2014. In 2017 an experimental implementation was provided as a pull request.