Cannot produce final khm.traineddata using lstmtraining from the scratch #1216

phyrumsk · 2017-11-22T03:25:55Z

I am naive to tesseract and LSTM. I read tesseract tutorial on wiki and I tried to train lstmtraining from the scratch using the following command:

training/tesstrain.sh --fonts_dir /usr/share/fonts/truetype/khmeros-ttf --fontlist "Khmer OS" --langdata_dir langdata --lang khm --linedata_only --noextract_font_properties --tessdata_dir /home/phyrum/tesseract/tessdata --output_dir khmtrain

training/lstmtraining --debug_interval -1
--traineddata khmtrain/khm/khm.traineddata
--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]'
--model_output khmoutput/base --learning_rate 20e-4
--train_listfile khmtrain/khm.training_files.txt
--max_iterations 10000 &> khmoutput/basetrain.log
[
basetrain.log

](url)
=> After lstmtraining finished, I got only "basetrain.log" and "base_checkpoint" but I cannot find the final "khm.traineddata". and I really don't know why. Could you please help me?

Environment

Tesseract Version: tesseract 4.00.00dev-691-gfb359fc
Platform: Linux phyrum 4.10.0-38-generic Page segmentation output ocr_float #42~16.04.1-Ubuntu SMP Tue Oct 10 16:32:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Note: I also submitted detail question in tesseract-ocr mailing list with the title "LSTMTRAINING from the scratch for khmer language - Legacy Limon Fonts". Could you please check and give me some advices?

Shreeshrii · 2017-11-22T03:58:18Z

Please see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files regarding how to create the traineddata file.

To create the 'fast' version, use

convert_to_int	bool	false	With stop_training, convert to 8-bit integer for greater speed, with slightly less accuracy.

You cannot use the tesstrain.sh /text2image process using non-unicode legacy fonts. For those you have to create image files from a document/pdf, use tesseract with makebox config file to create the box files, manually edit the box files for correct info and 4.0 format.

Training from images is NOT supported for 4.0 LSTM.

Your best bet will be to find some unicode fonts that look similar to the legacy fonts and train using those.

phyrumsk · 2017-12-04T03:27:56Z

@Shreeshrii, I am very grateful for your clear explanation, it helps me a lot. I can trained from the scratch and can generate traineddata file and also fine tune.

what we have done:

Generated new fonts i.e Khmer legacy-unicode fonts by copy only glyph of legacy font and paste into one of Khmer Unicode font (the substitution or any rules of Khmer Unicode font are preserved).
Trained it using the same langdata [from Tesseract github] and net_spec of Tesseract Tessdata "best/khm.trainedata", "[1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx384 O1c1]", and then fine tune with Tesseract Tessdata "best/khm.trainedata" .
We get an expected accuracy for some legacy fonts around 90% but the accuracy of each Khmer Unicode fonts is dropped by 2%, lower than the accuracy that we tested using Tesseract Tessdata "fast/khm.trainedata".

Thus, I have few doubts:

a. Do you have any recommendation on what I have done and the issue of accuracy of Khmer Unicode which are dropped by 2% ?

b. Is there any way to improve the quality of my finetune .traineddata e.g khm.config (my khm.config is attached:
khm.config.txt) ,....?
Could you please share your experience when you train khm.traineddata and Khmer.traineddata?

c. What is the "radical-stroke.txt"? and inside the file: What does "19886 3 23 6 3" mean?
I try to find any tutorial of "radical-stroke.txt", but it seems no luck. But I noticed that if that file doesn't exist then the starter traineddata "khm.traineddata" will not created.
Could you please explain me?

Thanks very much for your quick and always support.
Best regards,
phyrum

amitdo · 2017-12-04T08:55:59Z

What is the "radical-stroke.txt"?

It is used for Chinese.

phyrumsk · 2017-12-04T09:18:00Z

@amitdo Could you please kindly explain me why it is used for Chinese? I trained for Khmer Language and the "radical-stroke.txt" is required when i use tesstrain.sh command.
Thanks in advance for your explanation.
Best regards,
Phyrum

amitdo · 2017-12-04T11:33:34Z

The fact that the training tool required the file 'radical-stroke.txt' for non-Han scripts is a bug.
In practice, it is not relevant at all for other scripts (including Khmer), and will not influence their accuracy.

Shreeshrii · 2018-01-12T15:57:59Z

@phyrumsk

You have taken an interesting approach, creating a unicode font with glyphs from legacy font.

I would suggest that while fine-tuning you also include the other unicode fonts and see if that helps.

Langdata training-text has not been updated for 4.0. you may want to modify it based on the kind of errors you are seeing, include some more samples of that and then fine-tune.

You can name the traineddata after fine-tuning as khmer-legacy and use it for pages in legacy font and use original traineddata for other unicode fonts.

Shreeshrii · 2018-04-08T16:36:10Z

@amitdo Thanks for the following info:

Khmer OCR fine tune engine for Unicode and legacy fonts using
Tesseract 4.0 with Deep Neural Network

http://ona2017.khmernlp.org/wp-content/uploads/paper/papers/paper/ONA_2017_paper_Khmer_OCR_FineTune_Engine.pdf

The OP is one of the authors of this paper.

Shreeshrii · 2018-04-08T16:36:43Z

@zdenop Please close the issue.

zdenop closed this as completed Apr 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot produce final khm.traineddata using lstmtraining from the scratch #1216

Cannot produce final khm.traineddata using lstmtraining from the scratch #1216

phyrumsk commented Nov 22, 2017 •

edited

Loading

Shreeshrii commented Nov 22, 2017

phyrumsk commented Dec 4, 2017 •

edited

Loading

amitdo commented Dec 4, 2017 •

edited

Loading

phyrumsk commented Dec 4, 2017 •

edited

Loading

amitdo commented Dec 4, 2017 •

edited

Loading

Shreeshrii commented Jan 12, 2018

Shreeshrii commented Apr 8, 2018 •

edited

Loading

Shreeshrii commented Apr 8, 2018

Cannot produce final khm.traineddata using lstmtraining from the scratch #1216

Cannot produce final khm.traineddata using lstmtraining from the scratch #1216

Comments

phyrumsk commented Nov 22, 2017 • edited Loading

Environment

Shreeshrii commented Nov 22, 2017

phyrumsk commented Dec 4, 2017 • edited Loading

amitdo commented Dec 4, 2017 • edited Loading

phyrumsk commented Dec 4, 2017 • edited Loading

amitdo commented Dec 4, 2017 • edited Loading

Shreeshrii commented Jan 12, 2018

Shreeshrii commented Apr 8, 2018 • edited Loading

Shreeshrii commented Apr 8, 2018

phyrumsk commented Nov 22, 2017 •

edited

Loading

phyrumsk commented Dec 4, 2017 •

edited

Loading

amitdo commented Dec 4, 2017 •

edited

Loading

phyrumsk commented Dec 4, 2017 •

edited

Loading

amitdo commented Dec 4, 2017 •

edited

Loading

Shreeshrii commented Apr 8, 2018 •

edited

Loading