Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Clone this wiki locally
How to use the tools provided to train Tesseract 4.00
Have questions about the training process? If you had some problems during the training process and you need help, use tesseract-ocr mailing-list to ask your question(s). PLEASE DO NOT report your problems and ask questions about training as issues!
- Before You Start
- Additional Libraries Required
- Building the Training Tools
- Hardware-Software Requirements
- Overview of Training Process
- Understanding the Various Files Used During Training
- Creating the Training Data
Tutorial Guide to lstmtraining
- Creating the starter traineddata
- LSTMTraining Command Line
- Training From Scratch
- Fine Tuning for Impact(new-font-style)
- Fine Tuning for ± a few characters
- Training Just a Few Layers
- Error Messages From Training
- Combining the Output Files
- The Hallucination Effect
Tesseract 4.00 includes a new neural network-based recognition engine that delivers significantly higher accuracy (on document images) than the previous versions, in return for a significant increase in required compute power. On complex languages however, it may actually be faster than base Tesseract.
Neural networks require significantly more training data and train a lot slower than base Tesseract. For Latin-based languages, the existing model data provided has been trained on about 400000 textlines spanning about 4500 fonts. For other scripts, not so many fonts are available, but they have still been trained on a similar number of textlines. Instead of taking a few minutes to a couple of hours to train, Tesseract 4.00 takes a few days to a couple of weeks. Even with all this new training data, you might find it inadequate for your particular problem, and therefore you are here wanting to retrain it.
There are multiple options for training:
- Fine tune. Starting with an existing trained language, train on your specific additional data. This may work for problems that are close to the existing training data, but different in some subtle way, like a particularly unusual font. May work with even a small amount of training data.
- Cut off the top layer (or some arbitrary number of layers) from the network and retrain a new top layer using the new data. If fine tuning doesn't work, this is most likely the next best option. Cutting off the top layer could still work for training a completely new language or script, if you start with the most similar looking script.
- Retrain from scratch. This is a daunting task, unless you have a very representative and sufficiently large training set for your problem. If not, you are likely to end up with an over-fitted network that does really well on the training data, but not on the actual data.
While the above options may sound different, the training steps are actually almost identical, apart from the command line, so it is relatively easy to try it all ways, given the time or hardware to run them in parallel.
For 4.00 at least, the old recognition engine is still present, and can also be trained, but is deprecated, and, unless good reasons materialize to keep it, may be deleted in a future release.
Before You Start
You don't need any background in neural networks to train Tesseract 4.00, but it may help in understanding the difference between the training options. Please read the Implementation introduction before delving too deeply into the training process, and the same note as for training Tesseract 3.04 applies:
Important note: Before you invest time and effort on training Tesseract, it is highly recommended to read the ImproveQuality page.
Additional Libraries Required
Beginning with 3.03, additional libraries are required to build the training tools.
sudo apt-get install libicu-dev sudo apt-get install libpango1.0-dev sudo apt-get install libcairo2-dev
Building the Training Tools
Beginning with 3.03, if you're compiling Tesseract from source you need to make and install the training tools with separate make commands. Once the above additional libraries have been installed, run the following from the Tesseract source directory:
make make training sudo make training-install
It is also useful, but not required, to build ScrollView.jar:
make ScrollView.jar export SCROLLVIEW_PATH=$PWD/java
At time of writing, training only works on Linux? Windows yet? As for running Tesseract 4.00, it is useful, but not essential to have a multi-core (4 is good) machine, with OpenMP and Intel Intrinsics support for SSE/AVX extensions. Basically it will still run on anything with enough memory, but the higher-end your processor is, the faster it will go. No GPU is needed. (No support.) Memory use can be controlled via the --max_image_MB command-line option, but you are likely to need at least 1GB of memory over and above what is taken by your OS.
Training Text Requirements
For Latin-based languages, the existing model data provided has been trained on about 400000 textlines spanning about 4500 fonts. For other scripts, not so many fonts are available, but they have still been trained on a similar number of textlines.
Note that it is beneficial to have more training text and make more pages though, as neural nets don't generalize as well and need to train on something similar to what they will be running on. If the target domain is severely limited, then all the dire warnings about needing a lot of training data may not apply, but the network specification may need to be changed.
Overview of Training Process
The overall training process is similar to training 3.04.
Conceptually the same:
- Prepare training text.
- Render text to image + box file. (Or create hand-made box files for existing image data.)
- Make unicharset file. (Can be partially specified, ie created manually).
- Make a starter traineddata from the unicharset and optional dictionary data.
- Run tesseract to process image + box file to make training data set.
- Run training on training data set.
- Combine data files.
The key differences are:
- The boxes only need to be at the textline level. It is thus far easier to make training data from existing image data.
- The .tr files are replaced by .lstmf data files.
- Fonts can and should be mixed freely instead of being separate.
- The clustering steps (mftraining, cntraining, shapeclustering) are replaced with a single slow lstmtraining step.
The training cannot be quite as automated as the training for 3.04 for several reasons:
- The slow training step isn't good to run from the middle of a script as it can be restarted if stopped, and it is hard to tell automatically when it is finished.
- There are multiple options for how to train the network (see above).
- The language models and unicharset are allowed to be different from those used by base Tesseract, but don't have to be.
- It isn't necessary to have a base Tesseract of the same language as the neural net Tesseract.
The process of Creating the training data is
documented below, followed by a Tutorial guide to
lstmtraining which gives an introduction to
the main training process, with command-lines that have been tested for real. On
Linux at least, you should be able to just copy-paste the command lines into
your terminal. To make the
tesstrain.sh script work, it will be necessary to
PATH to include your local
api directories, or use
Understanding the Various Files Used During Training
As with base Tesseract, the completed LSTM model and everything else it needs is
collected in the
traineddata file. Unlike base Tesseract, a starter
traineddata file is given during training, and has to be setup in advance. It
- Config file providing control parameters.
- Unicharset defining the character set.
- Unicharcompress, aka the recoder, which maps the unicharset further to the codes actually used by the neural network recognizer.
- Punctuation pattern dawg, with patterns of punctuation allowed around words.
- Word dawg. The system word-list language model.
- Number dawg, with patterns of numbers that are allowed.
Bold elements must be provided. Others are optional, but if any of the dawgs
are provided, the punctuation dawg must also be provided. A new tool:
combine_lang_data is provided to make a starter
traineddata from a
unicharset and optional wordlists.
During training, the trainer writes checkpoint files, which is a standard
behavior for neural network trainers. This allows training to be stopped and
continued again later if desired. Any checkpoint can be converted to a full
traineddata for recognition by using the
--stop_training command-line flag.
The trainer also periodically writes checkpoint files at new bests achieved during training.
It is possible to modify the network and retrain just part of it, or fine tune
for specific training data (even with a modified unicharset!) by telling the
--continue_from either an existing checkpoint file, or from a naked
LSTM model file that has been extracted from an existing
combine_tessdata provided it has not been converted to integer.
If the unicharset is changed in the
--traineddata flag, compared to the one
that was used in the model provided via
--continue_from, then the
--old_traineddata flag must be provided with the corresponding
file that holds the
recoder. This enables the trainer to
compute the mapping between the character sets.
The training data is provided via
.lstmf files, which are serialized
DocumentData They contain an image and the corresponding UTF8 text
transcription, and can be generated from tif/box file pairs using Tesseract in a
similar manner to the way
.tr files were created for the old engine.
Creating Training Data
Making Box Files
As with base Tesseract, there is a choice between rendering synthetic training data from fonts, or labeling some pre-existing images (like ancient manuscripts for example).
In either case, the required format is still the tiff/box file pair, except that the boxes only need to cover a textline instead of individual characters.
Each line in the box file matches a 'character' (glyph) in the tiff image.
<symbol> <left> <bottom> <right> <top> <page>
To mark an end-of-textline, a special line must be inserted after a series of lines.
<tab> <left> <bottom> <right> <top> <page>
Note that in all cases, even for right-to-left languages, such as Arabic, the text transcription for the line, should be ordered left-to-right. In other words, the network is going to learn from left-to-right regardless of the language, and the right-to-left/bidi handling happens at a higher level inside Tesseract.
These instructions only cover the case of rendering from fonts, so the needed fonts must be installed first.
The setup for running tesstrain.sh is the
same as for base Tesseract. Use
--linedata_only option for LSTM training.
Note that it is beneficial to have more training
text and make more pages though, as neural nets don't generalize as well and
need to train on something similar to what they will be running on. If the
target domain is severely limited, then all the dire warnings about needing a
lot of training data may not apply, but the network specification may need to be
Training data is created using tesstrain.sh as follows: Note that your fonts location may vary.
src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \ --noextract_font_properties --langdata_dir ../langdata \ --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain
The above command makes LSTM training data equivalent to the data used to train base Tesseract for English. For making a general-purpose LSTM-based OCR engine, it is woefully inadequate, but makes a good tutorial demo.
Now try this to make eval data for the 'Impact' font:
src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \ --noextract_font_properties --langdata_dir ../langdata \ --tessdata_dir ./tessdata \ --fontlist "Impact Condensed" --output_dir ~/tesstutorial/engeval
We will use that data later to demonstrate tuning.
Tutorial Guide to lstmtraining
Creating Starter Traineddata
NOTE: This is a new step!
Instead of a
lstmtraining now takes a
traineddata file on its command-line, to obtain all the information it needs
on the language to be learned. The
traineddata must contain at least an
lstm-recoder component, and may also contain the three
lstm-punc-dawg lstm-word-dawg lstm-number-dawg A
config file is
also optional. The other components, if present, will be ignored and unused.
There is no tool to create the
lstm-recoder directly. Instead there is a new
combine_lang_model which takes as input an
script_dir points to the
langdata directory) and
lang is the
language being used) and optional word list files. It creates the
input_unicharset and creates all the dawgs, if wordlists are provided,
putting everything together into a
LSTMTraining Command Line
The lstmtraining program is a multi-purpose tool for training neural networks. The following table describes its command-line options:
||none||Path to the starter traineddata file that contains the unicharset, recoder and optional language model.|
||none||Specifies the topology of the network.|
||none||Base path of output model files/checkpoints.|
||Maximum amount of memory to use for caching images.|
||Set to true for sequential training. Default is to process all training data in round-robin fashion.|
||When the network gets good, only backprop a perfect sample after this many imperfect samples have been seen since the last perfect sample was allowed through.|
||If non-zero, show visual debugging every this many iterations.|
||Range of random values to initialize weights.|
||Momentum for alpha smoothing gradients.|
||Smoothing factor squared gradients in ADAM algorithm.|
||Stop training after this many iterations.|
||Stop training if the mean percent error rate gets below this value.|
||none||Path to previous checkpoint from which to continue training or fine tune.|
||Convert the training checkpoint in
||Cut the head off the network at the given index and append
||none||Filename of a file listing training data files.|
||none||Filename of a file listing evaluation data files to be used in evaluating the model independently of the training data.|
Most of the flags work with defaults, and several are only required for particular operations listed below, but first some detailed comments on the more complex flags:
LSTMs are great at learning sequences, but slow down a lot when the number of
states is too large. There are empirical results that suggest it is better to
ask an LSTM to learn a long sequence than a short sequence of many classes, so
for the complex scripts, (Han, Hangul, and the Indic scripts) it is better to
recode each symbol as a short sequence of codes from a small number of classes
than have a large set of classes. The
combine_lang_model command has this
feature on by default. It encodes each Han character as a variable-length
sequence of 1-5 codes, Hangul using the Jamo encoding as a sequence of 3 codes,
and other scripts as a sequence of their unicode components. For the scripts
that use a virama character to generate conjunct consonants, (All the Indic
scripts plus Myanmar and Khmer) the function
pairs the virama with an appropriate neighbor to generate a more glyph-oriented
encoding in the unicharset. To make full use of this improvement, the
--pass_through_recoder flag should be set for
combine_lang_model for these
Randomized Training Data and sequential_training
For Stochastic Gradient Descent to work properly, the training data is supposed to be randomly shuffled across all the sample files, so the trainer can read its way through each file in turn and go back to the first one when it reaches the end. This is entirely contrary to the way base Tesseract was trained!
If using the rendering code, (via
tesstrain.sh) then it will shuffle the
sample text lines within each file, but you will get a set of files, each
containing training samples from a single font. To add a more even mix, the
default is to process one sample from each file in turn aka 'round robin' style.
If you have generated training data some other way, or it is all from the same
style (a handwritten manuscript book for instance) then you can use the
--sequential_training flag for
lstmtraining. This is more memory efficient
since it will load data from only two files at a time, and process them in
sequence. (The second file is read-ahead so it is ready when needed.)
The trainer saves checkpoints periodically using
--model_output as a basename.
It is therefore possible to stop training at any point, and restart it, using
the same command line, and it will continue. To force a restart, use a different
--model_output or delete all the files.
Net Mode and Optimization
128 flag turns on Adam optimization, which seems to work a lot better than
64 flag enables automatic layer-specific learning rate. When progress
stalls, the trainer investigates which layer(s) should have their learning rate
reduced independently, and may lower one or more learning rates to continue
The default value of
192 enables both Adam and layer-specific
Perfect Sample Delay
Training on "easy" samples isn't necessarily a good idea, as it is a waste of
time, but the network shouldn't be allowed to forget how to handle them, so it
is possible to discard some easy samples if they are coming up too often. The
--perfect_sample_delay argument discards perfect samples if there haven't been
that many imperfect ones seen since the last perfect sample. The current default
value of zero uses all samples. In practice the value doesn't seem to have a
huge effect, and if training is allowed to run long enough, zero produces the
Debug Interval and Visual Debugging
With zero (default)
--debug_interval, the trainer outputs a progress report
every 100 iterations.
--debug_interval -1, the trainer outputs verbose text debug for every
--debug_interval > 0, the trainer displays several windows of debug
information on the layers of the network. In the special case of
--debug_interval 1 it waits for a click in the
LSTMForward window before
continuing to the next iteration, but for all others it just continues and draws
information at the frequency requested.
The text debug information includes the truth text, the recognized text, the iteration number, the training sample id (file and page) and the mean value of several error metrics.
The visual debug information includes:
A forward and backward window for each network layer. Most are just random
noise, but the
ConvNL windows are worth viewing.
Output shows the output of the final Softmax, which starts out as a yellow
line for the null character, and gradually develops yellow marks at each point
where it thinks there is a character. (The x-axis is the image x-coordinate, and
the y-axis is character class.) The
Output-back window shows the difference
between the actual output and the target using the same layout, but with yellow
for "give me more of this" and blue for "give me less of this". As the network
ConvNL window develops the typical edge detector results that you
expect from the bottom layer.
LSTMForward shows the output of the whole network on the training image.
LSTMTraining shows the training target on the training image. In both, green
lines are drawn to show the peak output for each character, and the character
itself is drawn to the right of the line.
The other two windows worth looking at are
CTC Outputs and
These show the current output of the network and the targets as a line graph of
strength of output against image x-coordinate. Instead of a heatmap, like the
Output window, a different colored line is drawn for each character class and
the y-axis is strength of output.
Training From Scratch
The following example shows the command line for training from scratch. Try it with the default training data created with the command-lines above.
mkdir -p ~/tesstutorial/engoutput training/lstmtraining --debug_interval 100 \ --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \ --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \ --model_output ~/tesstutorial/engoutput/base --learning_rate 20e-4 \ --train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \ --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \ --max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log
In a separate window monitor the log file:
tail -f ~/tesstutorial/engoutput/basetrain.log
(If you tried this tutorial before, you might notice that the numbers have changed. This is a result of a slightly smaller network, and the addition of the ADAM optimizer, which enables a higher learning rate.)
You should observe that by 600 iterations, the spaces (white) are starting to
show on the
CTC Outputs window and by 1300 iterations green lines appear on
LSTMForward window where there are spaces in the image.
By 1300 iterations, there are noticeable non-space bumps in the
Note that the
CTC Targets, which started at all the same height are now varied
in height because of the definite output for spaces and some and the tentative
outputs for other characters. At the same time, the characters and positioning
of the green lines in the
LSTMTraining window are not as accurate as they were
initially, because the partial output from the network confuses the CTC
algorithm. (CTC assumes statistical independence between the different
x-coordinates, but they are clearly not independent.)
By 2000 iterations, it should be clear on the
Output window that some faint
yellow marks are appearing to indicate that there is some growing output for
non-null and non-space, and characters are starting to appear in the
The character error rate falls below 50% just after 3700 iterations, and by 5000 to about 13%, where it will terminate. (In about 20 minutes on a current high-end machine with AVX.)
Note that this engine is trained on the same amount of training data as used by the legacy Tesseract engine, but its accuracy on other fonts is probably very poor. Run an independent test on the 'Impact' font:
training/lstmeval --model ~/tesstutorial/engoutput/base_checkpoint \ --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \ --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
85% character error rate? Not so good!
Now base Tesseract doesn't do very well on 'Impact', but it is included in the 4500 or so fonts used to train the new LSTM version, so if you can run on that for a comparison:
training/lstmeval --model tessdata/best/eng.traineddata \ --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
2.45% character error rate? Much better!
For reference in the next section, also run a test of the full model on the training set that we have been using:
training/lstmeval --model tessdata/best/eng.traineddata \ --eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt
Char error rate=0.25047642, Word error rate=0.63389585
(If you ran this before, and notice that the error rates are a lot higher than the previous alpha version, this is due to a change in the use of shaped quotes. It didn't count errors in quote shape before, but now it does.)
You can train for another 5000 iterations, and get the error rate on the
training set a lot lower, but it doesn't help the
Impact font much:
mkdir -p ~/tesstutorial/engoutput training/lstmtraining \ --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \ --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \ --model_output ~/tesstutorial/engoutput/base --learning_rate 20e-4 \ --train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \ --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \ --max_iterations 10000 &>>~/tesstutorial/engoutput/basetrain.log
Character error rate on
Impact now >100%, even as the error rate on the
training set has fallen to 2.68% character / 10.01% word:
training/lstmeval --model ~/tesstutorial/engoutput/base_checkpoint \ --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \ --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
This shows that the model has completely over-fitted to the supplied training set! It is an excellent illustration of what happens when the training set doesn't cover the desired variation in the target data.
In summary, training from scratch needs either a very constrained problem, a lot
of training data, or you need to shrink the network by reducing some of the
sizes of the layers in the
--net_spec above. Alternatively, you could try fine
Fine Tuning for Impact
Fine tuning is the process of training an existing model on new data without changing any part of the network, although you can now add characters to the character set. (See Fine Tuning for ± a few characters).
training/lstmtraining --model_output /path/to/output [--max_image_MB 6000] \ --continue_from /path/to/existing/model \ --traineddata /path/to/original/traineddata \ [--perfect_sample_delay 0] [--debug_interval 0] \ [--max_iterations 0] [--target_error_rate 0.01] \ --train_listfile /path/to/list/of/filenames.txt
Note that the
--continue_from arg can point to a training checkpoint
or a recognition model, even though the file formats are different.
Training checkpoints are the files that begin with
--model_output and end
checkpoint. A recognition model can be extracted from an existing
traineddata file, using
combine_tessdata. Note that it is also necessary to
supply the original traineddata file as well, as that contains the unicharset
and recoder. Let's start by fine tuning the model we built earlier, and see if
we can make it work for 'Impact':
mkdir -p ~/tesstutorial/impact_from_small training/lstmtraining --model_output ~/tesstutorial/impact_from_small/impact \ --continue_from ~/tesstutorial/engoutput/base_checkpoint \ --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \ --train_listfile ~/tesstutorial/engeval/eng.training_files.txt \ --max_iterations 1200
This has character/word error at 22.36%/50.0% after 100 iterations and gets down to 0.3%/1.2% at 1200. Now a stand-alone test:
training/lstmeval --model ~/tesstutorial/impact_from_small/impact_checkpoint \ --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \ --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
That shows a better result of 0.0086%/0.057% because the trainer is averaging
over 1000 iterations, and it has been improving. This isn't a representative
result for the
Impact font though, as we are testing on the training data!
That was a bit of a toy example. The idea of fine tuning is really to apply it to one of the fully-trained existing models:
mkdir -p ~/tesstutorial/impact_from_full training/combine_tessdata -e tessdata/best/eng.traineddata \ ~/tesstutorial/impact_from_full/eng.lstm training/lstmtraining --model_output ~/tesstutorial/impact_from_full/impact \ --continue_from ~/tesstutorial/impact_from_full/eng.lstm \ --traineddata tessdata/best/eng.traineddata \ --train_listfile ~/tesstutorial/engeval/eng.training_files.txt \ --max_iterations 400
After 100 iterations, it has 1.35%/4.56% char/word error and gets down to 0.533%/1.633% at 400. Again, the stand-alone test gives a better result:
training/lstmeval --model ~/tesstutorial/impact_from_full/impact_checkpoint \ --traineddata tessdata/best/eng.traineddata \ --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
Char error 0.017%, word 0.120% What is more interesting though, is the effect on the other fonts, so run a test on the base training set that we have been using:
training/lstmeval --model ~/tesstutorial/impact_from_full/impact_checkpoint \ --traineddata tessdata/best/eng.traineddata \ --eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt
Char error rate=0.25548592, Word error rate=0.82523491
It is only slightly worse, despite having reached close to zero error on the training set, and achieved it in only 400 iterations. Note that further training beyond 400 iterations makes the error on the base set higher.
In summary, the pre-trained model can be fine-tuned or adapted to a small data set, without doing a lot of harm to its general accuracy. It is still very important however, to avoid over-fitting.
Fine Tuning for ± a few characters
New feature It is possible to add a few new characters to the character set and train for them by fine tuning, without a large amount of training data.
The training requires a new unicharset/recoder, optional language models, and the old traineddata file containing the old unicharset/recoder.
training/lstmtraining --model_output /path/to/output [--max_image_MB 6000] \ --continue_from /path/to/existing/model \ --traineddata /path/to/traineddata/with/new/unicharset \ --old_traineddata /path/to/existing/traineddata \ [--perfect_sample_delay 0] [--debug_interval 0] \ [--max_iterations 0] [--target_error_rate 0.01] \ --train_listfile /path/to/list/of/filenames.txt
Let's try adding the plus-minus sign (±) to the existing English model. Modify
langdata/eng/eng.training_text to include some samples of ±. I inserted 14 of
them, as shown below:
grep ± ../langdata/eng/eng.training_text alkoxy of LEAVES ±1.84% by Buying curved RESISTANCE MARKED Your (Vol. SPANIEL TRAVELED ±85¢ , reliable Events THOUSANDS TRADITIONS. ANTI-US Bedroom Leadership Inc. with DESIGNS self; ball changed. MANHATTAN Harvey's ±1.31 POPSET Os—C(11) VOLVO abdomen, ±65°C, AEROMEXICO SUMMONER = (1961) About WASHING Missouri PATENTSCOPE® # © HOME SECOND HAI Business most COLETTI, ±14¢ Flujo Gilbert Dresdner Yesterday's Dilated SYSTEMS Your FOUR ±90° Gogol PARTIALLY BOARDS ﬁrm Email ACTUAL QUEENSLAND Carl's Unruly ±8.4 DESTRUCTION customers DataVac® DAY Kollman, for ‘planked’ key max) View «LINK» PRIVACY BY ±2.96% Ask! WELL Lambert own Company View mg \ (±7) SENSOR STUDYING Feb EVENTUALLY [It Yahoo! Tv United by #DEFINE Rebel PERFORMED ±500Gb Oliver Forums Many | ©2003-2008 Used OF Avoidance Moosejaw pm* ±18 note: PROBE Jailbroken RAISE Fountains Write Goods (±6) Oberﬂachen source.” CULTURED CUTTING Home 06-13-2008, § ±44.01189673355 € netting Bookmark of WE MORE) STRENGTH IDENTICAL ±2? activity PROPERTY MAINTAINED
Now generate new training and eval data:
src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \ --noextract_font_properties --langdata_dir ../langdata \ --tessdata_dir ./tessdata --output_dir ~/tesstutorial/trainplusminus src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \ --noextract_font_properties --langdata_dir ../langdata \ --tessdata_dir ./tessdata \ --fontlist "Impact Condensed" --output_dir ~/tesstutorial/evalplusminus
Run fine tuning on the new training data. This requires more iterations, as it only has a few samples of the new target character to go on:
training/combine_tessdata -e tessdata/best/eng.traineddata \ ~/tesstutorial/trainplusminus/eng.lstm training/lstmtraining --model_output ~/tesstutorial/trainplusminus/plusminus \ --continue_from ~/tesstutorial/trainplusminus/eng.lstm \ --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \ --old_traineddata tessdata/best/eng.traineddata \ --train_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt \ --max_iterations 3600
After 100 iterations, it has 1.26%/3.98% char/word error and gets down to 0.041%/0.185% at 3600. Again, the stand-alone test gives a better result:
training/lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \ --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \ --eval_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt
Char error 0.0326%, word 0.128%. What is more interesting though, is whether the new character can be recognized in the 'Impact' font, so run a test on the impact eval set:
training/lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \ --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \ --eval_listfile ~/tesstutorial/evalplusminus/eng.training_files.txt
Char error rate=2.3767074, Word error rate=8.3829474
This compares very well against the original test of the original model on the impact data set. Furthermore, if you check the errors:
training/lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \ --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \ --eval_listfile ~/tesstutorial/evalplusminus/eng.training_files.txt 2>&1 | grep ±
...you should see that it gets all the ± signs correct! (Every truth line that contains a ± also contains a ± on the corresponding OCR line, and there are no truth lines that don't have a matching OCR line in the grep output.)
This is excellent news! It means that one or more new characters can be added without impacting existing accuracy, and the ability to recognize the new character will, to some extent at least, generalize to other fonts!
NOTE: When fine tuning, it is important to experiment with the number of iterations, since excessive training on a small data set will cause over-fitting. ADAM, is great for finding the feature combinations necessary to get that rare class correct, but it does seem to overfit more than simpler optimizers.
Training Just a Few Layers
Fine tuning is OK if you only want to add a new font style or need a couple of
new characters, but what if you want to train for Klingon? You are unlikely to
have much training data and it is unlike anything else, so what do you do? You
can try removing some of the top layers of an existing network model, replace
some of them with new randomized layers, and train with your data. The
command-line is mostly the same as Training from
scratch, but in addition you have to provide a model to
--append_index argument tells it to remove all layers above the layer
with the given index, (starting from zero, in the outermost series) and then
append the given
--net_spec argument to what remains. Although this indexing
system isn't a perfect way of referring to network layers, it is a consequence
of the greatly simplified network specification language. The builder will
output a string corresponding to the network it has generated, making it
reasonably easy to check that the index referred to the intended layer.
A new feature of 4.00 alpha is that combine_tessdata can list the content of a traineddata file and its version string. In most cases, the version string includes the net_spec that was used to train:
training/combine_tessdata -d tessdata/best/heb.traineddata Version string:4.00.00alpha:heb:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys48Lfx96Lrx96Lfx192O1c1] 17:lstm:size=3022651, offset=192 18:lstm-punc-dawg:size=3022651, offset=3022843 19:lstm-word-dawg:size=673826, offset=3024221 20:lstm-number-dawg:size=625, offset=3698047 21:lstm-unicharset:size=1673826, offset=3703368 22:lstm-recoder:size=4023, offset=3703368 23:version:size=80, offset=3703993
and for chi_sim:
training/combine_tessdata -d tessdata/best/chi_sim.traineddata Version string:4.00.00alpha:chi_sim:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1] 0:config:size=1966, offset=192 17:lstm:size=12152851, offset=2158 18:lstm-punc-dawg:size=282, offset=12155009 19:lstm-word-dawg:size=590634, offset=12155291 20:lstm-number-dawg:size=82, offset=12745925 21:lstm-unicharset:size=258834, offset=12746007 22:lstm-recoder:size=72494, offset=13004841 23:version:size=84, offset=13077335
Note that the number of layers is the same, but only the sizes differ. Therefore
in these models, the following values of
--append_index will keep the
associated last layer, and append above:
The weights in the remaining part of the existing model are unchanged initially, but allowed to be modified by the new training data.
As an example, let's try converting the existing chi_sim model to eng. We will cut off the last LSTM layer (which was bigger for chi_sim than the one used to train the eng model) and the softmax, replacing with a smaller LSTM layer and a new softmax:
mkdir -p ~/tesstutorial/eng_from_chi training/combine_tessdata -e tessdata/best/chi_sim.traineddata \ ~/tesstutorial/eng_from_chi/eng.lstm training/lstmtraining --debug_interval 100 \ --continue_from ~/tesstutorial/eng_from_chi/eng.lstm \ --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \ --append_index 5 --net_spec '[Lfx256 O1c111]' \ --model_output ~/tesstutorial/eng_from_chi/base \ --train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \ --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \ --max_iterations 3000 &>~/tesstutorial/eng_from_chi/basetrain.log
Since the lower layers are already trained, this learns somewhat faster than training from scratch. At 600 iterations, it suddenly starts producing output and by 800, it is already getting most characters correct. By the time it stops at 3000 iterations, it should be at 6.00% character/22.42% word.
Try the usual tests on the full training set:
training/lstmeval --model ~/tesstutorial/eng_from_chi/base_checkpoint \ --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \ --eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt
and independent test on the
training/lstmeval --model ~/tesstutorial/eng_from_chi/base_checkpoint \ --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \ --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
On the full training set, we get 5.557%/20.43% and on
which is much better than the from-scratch training, but is still badly
In summary, it is possible to cut off the top layers of an existing network and train, as if from scratch, but a fairly large amount of training data is still required to avoid over-fitting.
Error Messages From Training
There are various error messages that can occur when running the training, some of which can be important, and others not so much:
Encoding of string failed! results when the text string for a training image
cannot be encoded using the given unicharset. Possible causes are:
- There is an un-represented character in the text, say a British Pound sign that is not in your unicharset.
- A stray unprintable character (like tab or a control character) in the text.
- There is an un-represented Indic grapheme/aksara in the text.
In any case it will result in that training image being ignored by the trainer. If the error is infrequent, it is harmless, but it may indicate that your unicharset is inadequate for representing the language that you are training.
Unichar xxx is too long to encode!! (Most likely Indic only). There is an
upper limit to the length of unicode characters that can be used in the recoder,
which simplifies the unicharset for the LSTM engine. It will just continue and
leave that Aksara out of the recognizable set, but if there are a lot, then you
are in trouble.
Bad box coordinates in boxfile string! The LSTM trainer only needs bounding
box information for a complete textline, instead of at a character level, but if
you put spaces in the box string, like this:
<text for line including spaces> <left> <bottom> <right> <top> <page>
the parser will be confused and give you the error message.
Deserialize header failed occurs when a training input is not in LSTM format
or the file is not readable. Check your filelist file to see if it contains
No block overlapping textline: occurs when layout analysis fails to correctly
segment the image that was given as training data. The textline is dropped. Not
much problem if there aren't many, but if there are a lot, there is probably
something wrong with the training text or rendering process.
<Undecodable> can occur in either the ALIGNED_TRUTH or OCR TEXT output early
in training. It is a consequence of unicharset compression and CTC training.
(See Unicharset Compression and train_mode above). This should be harmless and
can be safely ignored. Its frequency should fall as training progresses.
Combining the Output Files
The lstmtraining program outputs two kinds of checkpoint files:
<model_base>_checkpointis the latest model file.
<model_base><char_error>_<iteration>.checkpointis periodically written as the model with the best training error. It is a training dump just like the checkpoint, but is smaller because it doesn't have a backup model to be used if the training runs into divergence.
Either of these files can be converted to a standard traineddata file as follows:
training/lstmtraining --stop_training \ --continue_from ~/tesstutorial/eng_from_chi/base_checkpoint \ --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \ --model_output ~/tesstutorial/eng_from_chi/eng.traineddata
This will extract the recognition model from the training dump, and insert it into the --traineddata argument, along with the unicharset, recoder, and any dawgs that were provided during training.
NOTE Tesseract 4.00 will now run happily with a traineddata file that
lstm-*-dawgs are optional, and none of the other components are required or
used with OEM_LSTM_ONLY as the OCR engine mode. No bigrams, unichar ambigs or
any of the other components are needed or even have any effect if present. The
only other component that does anything is the
lang.config, which can affect
layout analysis, and sub-languages.
If added to an existing Tesseract traineddata file, the
doesn't have to match the Tesseract
unicharset, but the same unicharset must
be used to train the LSTM and build the
The Hallucination Effect
If you notice that your model is misbehaving, for example by:
- Adding a
Capitalletter instead of a
Smallletter at the beginning of certain words.
Spacewhere it should not do that.