TrainingTesseract 4.00 Finetune

Please read TrainingTesseract 4.00 before trying the following.

Have copies of the 4.0.0 alpha langdata, tessdata and tesseract-ocr repositories in the following directory structure.

./langdata
./tessdata
./tesseract-ocr
./tesseract-ocr/tessdata
./tesseract-ocr/tessdata/configs/

Make a copy of English and Arabic 4.0.0alpha traineddata files in ./tesseract-ocr/tessdata

Check that lstm.train is available under configs.

Setup appropriate TESSDATA_PREFIX directory.

cp ./tessdata/eng.traineddata ./tesseract-ocr/tessdata
cp ./tessdata/ara.traineddata ./tesseract-ocr/tessdata

Change to the tesseract-ocr directory and then follow the given commands.

cd ./tesseract-ocr

training/tesstrain.sh --fonts_dir /usr/share/fonts --lang ara  --linedata_only \
  --training_text ../langdata/ara/arabic1.txt \
  --langdata_dir ../langdata --tessdata_dir ./tessdata \
  --fontlist "Times New Roman," \
  --output_dir ~/tesstutorial/aratest

This creates the .lstmf files in the output directory using the given training_text. The box/tiff pairs are created in a /tmp//ara/ directory and are not copied to the output directory.

mkdir -p ~/tesstutorial/aratuned_from_ara 

combine_tessdata -e ../tessdata/ara.traineddata \
  ~/tesstutorial/aratuned_from_ara/ara.lstm
  
lstmtraining --model_output ~/tesstutorial/aratuned_from_ara/aratuned \
  --continue_from ~/tesstutorial/aratuned_from_ara/ara.lstm \
  --train_listfile ~/tesstutorial/aratest/ara.training_files.txt \
  --target_error_rate 0.01

The above commands extract the existing LSTM model for Arabic from ./tessdata and finetune it using the .lstmf files created earlier, given in the train_listfile.

lstmtraining --model_output ~/tesstutorial/aratuned_from_ara/aratuned.lstm \
  --continue_from ~/tesstutorial/aratuned_from_ara/aratuned_checkpoint \
  --stop_training

The above command creates the new LSTM model from the finetuning output.

combine_tessdata -o ./tessdata/ara.traineddata \
  ~/tesstutorial/aratuned_from_ara/aratuned.lstm \
  ~/tesstutorial/aratest/ara.lstm-number-dawg \
  ~/tesstutorial/aratest/ara.lstm-punc-dawg \
  ~/tesstutorial/aratest/ara.lstm-word-dawg

Finally, the new LSTM model and new dawg files can be combined with the existing Arabic traineddata in ./tesseract-ocr/tessdata. The old ara.traineddata file in ./tesseract-ocr/tessdata is renamed.

training/lstmeval --model ~/tesstutorial/aratuned_from_ara/ara.lstm \
  --eval_listfile ~/tesstutorial/aratest/ara.training_files.txt  
  
training/lstmeval --model ~/tesstutorial/aratuned_from_ara/aratuned_checkpoint \
  --eval_listfile ~/tesstutorial/aratest/ara.training_files.txt  
  
training/lstmeval --model ~/tesstutorial/aratuned_from_ara/aratuned.lstm \
  --eval_listfile ~/tesstutorial/aratest/ara.training_files.txt

The above three commands evaluate the LSTM models, first with the original Arabic LSTM model, second with the checkpoint model during finetuning and third with the finetuned Arabic model.

time tesseract --tessdata-dir ../tessdata /tmp/<tmpdir.ara/ara.Times_New_Roman.exp0.tif out-4alpha.txt -l ara
time tesseract --tessdata-dir ./tessdata /tmp/<tmpdir>/ara/ara.Times_New_Roman.exp0.tif out-tuned.txt -l ara

The above runs OCR on the tif file created during training with the original traineddata and finetuned traineddata

Old wiki - no longer maintained. The pages were moved, see the new documentation.

As of 02/02/2020

These wiki pages are no longer maintained.

All pages were moved to tesseract-ocr/tessdoc.

The latest documentation is available at https://tesseract-ocr.github.io/.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TrainingTesseract 4.00 Finetune

As of 02/02/2020

Clone this wiki locally