Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot produce final khm.traineddata using lstmtraining from the scratch #1216

Closed
phyrumsk opened this issue Nov 22, 2017 · 8 comments
Closed

Comments

@phyrumsk
Copy link

phyrumsk commented Nov 22, 2017

I am naive to tesseract and LSTM. I read tesseract tutorial on wiki and I tried to train lstmtraining from the scratch using the following command:

training/tesstrain.sh --fonts_dir /usr/share/fonts/truetype/khmeros-ttf --fontlist "Khmer OS" --langdata_dir langdata --lang khm --linedata_only --noextract_font_properties --tessdata_dir /home/phyrum/tesseract/tessdata --output_dir khmtrain

training/lstmtraining --debug_interval -1
--traineddata khmtrain/khm/khm.traineddata
--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]'
--model_output khmoutput/base --learning_rate 20e-4
--train_listfile khmtrain/khm.training_files.txt
--max_iterations 10000 &> khmoutput/basetrain.log
[
basetrain.log
tesseract_version
training_lstmtraining_tesseract
](url)
=> After lstmtraining finished, I got only "basetrain.log" and "base_checkpoint" but I cannot find the final "khm.traineddata". and I really don't know why. Could you please help me?

Environment

  • Tesseract Version: tesseract 4.00.00dev-691-gfb359fc
  • Platform: Linux phyrum 4.10.0-38-generic Page segmentation output ocr_float #42~16.04.1-Ubuntu SMP Tue Oct 10 16:32:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Note: I also submitted detail question in tesseract-ocr mailing list with the title "LSTMTRAINING from the scratch for khmer language - Legacy Limon Fonts". Could you please check and give me some advices?

@Shreeshrii
Copy link
Collaborator

  1. Please see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files regarding how to create the traineddata file.

To create the 'fast' version, use

convert_to_int bool false With stop_training, convert to 8-bit integer for greater speed, with slightly less accuracy.
  1. You cannot use the tesstrain.sh /text2image process using non-unicode legacy fonts. For those you have to create image files from a document/pdf, use tesseract with makebox config file to create the box files, manually edit the box files for correct info and 4.0 format.

Training from images is NOT supported for 4.0 LSTM.

Your best bet will be to find some unicode fonts that look similar to the legacy fonts and train using those.

@phyrumsk
Copy link
Author

phyrumsk commented Dec 4, 2017

@Shreeshrii, I am very grateful for your clear explanation, it helps me a lot. I can trained from the scratch and can generate traineddata file and also fine tune.

what we have done:

  1. Generated new fonts i.e Khmer legacy-unicode fonts by copy only glyph of legacy font and paste into one of Khmer Unicode font (the substitution or any rules of Khmer Unicode font are preserved).

  2. Trained it using the same langdata [from Tesseract github] and net_spec of Tesseract Tessdata "best/khm.trainedata", "[1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx384 O1c1]", and then fine tune with Tesseract Tessdata "best/khm.trainedata" .

  3. We get an expected accuracy for some legacy fonts around 90% but the accuracy of each Khmer Unicode fonts is dropped by 2%, lower than the accuracy that we tested using Tesseract Tessdata "fast/khm.trainedata".

Thus, I have few doubts:

a. Do you have any recommendation on what I have done and the issue of accuracy of Khmer Unicode which are dropped by 2% ?

b. Is there any way to improve the quality of my finetune .traineddata e.g khm.config (my khm.config is attached:
khm.config.txt) ,....?
Could you please share your experience when you train khm.traineddata and Khmer.traineddata?

c. What is the "radical-stroke.txt"? and inside the file: What does "19886 3 23 6 3" mean?
I try to find any tutorial of "radical-stroke.txt", but it seems no luck. But I noticed that if that file doesn't exist then the starter traineddata "khm.traineddata" will not created.
Could you please explain me?

Thanks very much for your quick and always support.
Best regards,
phyrum

@amitdo
Copy link
Collaborator

amitdo commented Dec 4, 2017

What is the "radical-stroke.txt"?

It is used for Chinese.

@phyrumsk
Copy link
Author

phyrumsk commented Dec 4, 2017

@amitdo Could you please kindly explain me why it is used for Chinese? I trained for Khmer Language and the "radical-stroke.txt" is required when i use tesstrain.sh command.
Thanks in advance for your explanation.
Best regards,
Phyrum

@amitdo
Copy link
Collaborator

amitdo commented Dec 4, 2017

The fact that the training tool required the file 'radical-stroke.txt' for non-Han scripts is a bug.
In practice, it is not relevant at all for other scripts (including Khmer), and will not influence their accuracy.

@Shreeshrii
Copy link
Collaborator

@phyrumsk

You have taken an interesting approach, creating a unicode font with glyphs from legacy font.

I would suggest that while fine-tuning you also include the other unicode fonts and see if that helps.

Langdata training-text has not been updated for 4.0. you may want to modify it based on the kind of errors you are seeing, include some more samples of that and then fine-tune.

You can name the traineddata after fine-tuning as khmer-legacy and use it for pages in legacy font and use original traineddata for other unicode fonts.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Apr 8, 2018

@amitdo Thanks for the following info:

Khmer OCR fine tune engine for Unicode and legacy fonts using
Tesseract 4.0 with Deep Neural Network

http://ona2017.khmernlp.org/wp-content/uploads/paper/papers/paper/ONA_2017_paper_Khmer_OCR_FineTune_Engine.pdf

The OP is one of the authors of this paper.

@Shreeshrii
Copy link
Collaborator

@zdenop Please close the issue.

@zdenop zdenop closed this as completed Apr 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants