Fast integer versions of trained models
Switch branches/tags
Clone or download
zdenop Merge pull request #16 from Shreeshrii/master
correct name kur_ara to kmr - Kurmanji (Latin script)
Latest commit 7274cfa Apr 25, 2018
Permalink
Failed to load latest commit information.
script Move trained data for scripts to new subdirectory Mar 10, 2018
COPYING Use the full Apache License text Sep 15, 2017
README.md Updated based on Ray's comment Mar 20, 2018
afr.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
amh.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
ara.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
asm.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
aze.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
aze_cyrl.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
bel.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
ben.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
bod.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
bos.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
bre.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
bul.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
cat.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
ceb.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
ces.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
chi_sim.traineddata Fix extra intra-word spaces by adding config file Feb 20, 2018
chi_sim_vert.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
chi_tra.traineddata Fix extra spaces in words for chi_tra Feb 20, 2018
chi_tra_vert.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
chr.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
cos.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
cym.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
dan.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
deu.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
div.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
dzo.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
ell.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
eng.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
enm.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
epo.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
est.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
eus.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
fao.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
fas.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
fil.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
fin.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
fra.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
frk.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
frm.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
fry.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
gla.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
gle.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
glg.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
grc.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
guj.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
hat.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
heb.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
hin.traineddata Add config files to fix auto PSM issue 1273 Feb 26, 2018
hrv.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
hun.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
hye.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
iku.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
ind.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
isl.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
ita.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
ita_old.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
jav.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
jpn.traineddata Fix extra intra-word spaces by adding config file Feb 20, 2018
jpn_vert.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
kan.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
kat.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
kat_old.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
kaz.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
khm.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
kir.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
kmr.traineddata correct name kur_ara to kmr - Kurmanji (Latin script) Apr 25, 2018
kor.traineddata Fix extra intra-word spaces by adding config file Feb 20, 2018
kor_vert.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
lao.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
lat.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
lav.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
lit.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
ltz.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
mal.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
mar.traineddata Add config files to fix auto PSM issue 1273 Feb 26, 2018
mkd.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
mlt.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
mon.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
mri.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
msa.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
mya.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
nep.traineddata Add config files to fix auto PSM issue 1273 Feb 26, 2018
nld.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
nor.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
oci.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
ori.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
osd.traineddata Use legacy Orientation Script Detector (OSD) because that is the only… Sep 15, 2017
pan.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
pol.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
por.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
pus.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
que.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
ron.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
rus.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
san.traineddata Fix config file for default oem mode, change to --oem 1 Sep 15, 2017
sin.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
slk.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
slv.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
snd.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
spa.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
spa_old.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
sqi.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
srp.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
srp_latn.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
sun.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
swa.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
swe.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
syr.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
tam.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
tat.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
tel.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
tgk.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
tha.traineddata Fix extra intra-word spaces by adding config file Feb 20, 2018
tir.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
ton.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
tur.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
uig.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
ukr.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
urd.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
uzb.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
uzb_cyrl.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
vie.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
yid.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017
yor.traineddata Initial import to github (on behalf of Ray) Sep 14, 2017

README.md

tessdata_fast – Fast integer versions of trained models

This repository contains fast integer versions of trained models for the Tesseract Open Source OCR Engine.

  • These are a speed/accuracy compromise as to what offered the best "value for money" in speed vs accuracy.
  • For some languages, this is still best, but for most not.
  • The "best value for money" network configuration was then integerized for further speed.
  • Most users will want to use these traineddata files to do OCR and these will be shipped as part of Linux distributions eg. Ubuntu 18.04.
  • Fine tuning/incremental training will NOT be possible from these fast models, as they are 8-bit integer.
  • When using the models in this repository, only the new LSTM-based OCR engine is supported. The legacy tesseract engine is not supported with these files, so Tesseract's oem modes '0' and '2' won't work with them.

Two types of models

The repository contains two types of models,

  • those for a single language and
  • those for a single script supporting one or more languages.

Most of the script models include English training data as well as the script, but not Cyrillic, as that would have a major ambiguity problem.

On Debian and Ubuntu, the language based traineddata packages are named tesseract-ocr-LANG where LANG is the three letter language code eg. tesseract-ocr-eng (English language), tesseract-ocr-hin (Hindi language), etc.

On Debian and Ubuntu, the script based traineddata packages are named tesseract-ocr-script-SCRIPT where SCRIPT is the four letter script code eg. tesseract-ocr-script-latn (Latin Script), tesseract-ocr-script-deva (Devanagari Script), etc.

Data files for a particular script

Initial capitals in the filename indicate the one model for all languages in that script. These are now available under script subdirectory.

  • Latin is all latin-based languages, except vie.
  • Vietnamese is for latin-based Vietnamese language.
  • Fraktur is basically a combination of all the latin-based languages that have an 'old' variant.
  • Devanagari is for hin+san+mar+nep+eng.

LSTM training details for different languages and scripts

For Latin-based languages, the existing model data provided has been trained on about 400000 textlines spanning about 4500 fonts. For other scripts, not so many fonts are available, but they have still been trained on a similar number of textlines. eg. Latin ~4500 fonts, Devanagari ~50 fonts, Kannada 15.

With a theory that poor accuracy on test data and over-fitting on training data was caused by the lack of fonts, the training data has been mixed with English, so that some of the font diversity might generalize to the other script. The overall effect was slightly positive, hence the script models include English language also.

Example - jpn and Japanese

'jpn' contains whatever appears on the www that is labelled as the language, trained only with fonts that can render Japanese.

Japanese contains all the languages that use that script (in this case just the one) PLUS English.The resulting model is trained with a mix of both training sets, with the expectation that some of the generalization to 4500 English training fonts will also apply to the other script that has a lot less.

'jpn_vert' is trained on text rendered vertically (but the image is rotated so the long edge is still horizontal).

'jpn' loads 'jpn_vert' as a secondary language so it can try it in case the text is rendered vertically. This seems to work most of the time as a reasonable solution.


See the Tesseract wiki for additional information.

All data in the repository are licensed under the Apache-2.0 License, see file COPYING.