Training workflow for Tesseract 4 as a Makefile for dependency tracking and building the required software from source.
Alternatively, you can build leptonica and tesseract within this project and install it to a subdirectory
./usr in the repo:
make leptonica tesseract
Tesseract will be built from the git repository, which requires CMake, autotools (including autotools-archive) and some additional libraries for the training tools. See the installation notes in the tesseract repository.
Provide ground truth
Place ground truth consisting of line images and transcriptions in the folder
data/ground-truth. This list of files will be split into training and
evaluation data, the ratio is defined by the
Images must be TIFF and have the extension
Transcriptions must be single-line plain text and have the same name as the
line image but with
.tif replaced by
The repository contains a ZIP archive with sample ground truth, see
ocrd-testset.zip. Extract it to
./data/ground-truth and run
make training MODEL_NAME=name-of-the-resulting-model
which is basically a shortcut for
make unicharset lists proto-model training
make help to see all the possible targets and variables:
Targets unicharset Create unicharset lists Create lists of lstmf filenames for training and eval training Start training proto-model Build the proto model leptonica Build leptonica tesseract Build tesseract tesseract-langs Download tesseract-langs clean Clean all generated files Variables MODEL_NAME Name of the model to be built. Default: foo START_MODEL Name of the model to continue from. Default: '' PROTO_MODEL Name of the proto model. Default: 'data/foo/foo.traineddata' CORES No of cores to use for compiling leptonica/tesseract. Default: 4 LEPTONICA_VERSION Leptonica version. Default: 1.78.0 TESSERACT_VERSION Tesseract commit. Default: 4.1.0 TESSDATA_REPO Tesseract model repo to use. Default: _best GROUND_TRUTH_DIR Ground truth directory. Default: data/ground-truth OUTPUT_DIR Output directory for generated files. Default: data/MODEL_NAME MAX_ITERATIONS Max iterations. Default: 10000 NET_SPEC Network specification. Default: [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c1] NORM_MODE Normalization Mode - see src/training/language_specific.sh for details. Default: 2 PSM Page segmentation mode. Default: 6 RANDOM_SEED Random seed for shuffling of the training data. Default: 0 RATIO_TRAIN Ratio of train / eval training data. Default: 0.90
Software is provided under the terms of the
Apache 2.0 license.