-
Notifications
You must be signed in to change notification settings - Fork 9.4k
-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot produce final khm.traineddata using lstmtraining from the scratch #1216
Comments
To create the 'fast' version, use
Training from images is NOT supported for 4.0 LSTM. Your best bet will be to find some unicode fonts that look similar to the legacy fonts and train using those. |
@Shreeshrii, I am very grateful for your clear explanation, it helps me a lot. I can trained from the scratch and can generate traineddata file and also fine tune. what we have done:
Thus, I have few doubts: a. Do you have any recommendation on what I have done and the issue of accuracy of Khmer Unicode which are dropped by 2% ? b. Is there any way to improve the quality of my finetune .traineddata e.g khm.config (my khm.config is attached: c. What is the "radical-stroke.txt"? and inside the file: What does "19886 3 23 6 3" mean? Thanks very much for your quick and always support. |
It is used for Chinese. |
@amitdo Could you please kindly explain me why it is used for Chinese? I trained for Khmer Language and the "radical-stroke.txt" is required when i use tesstrain.sh command. |
The fact that the training tool required the file 'radical-stroke.txt' for non-Han scripts is a bug. |
You have taken an interesting approach, creating a unicode font with glyphs from legacy font. I would suggest that while fine-tuning you also include the other unicode fonts and see if that helps. Langdata training-text has not been updated for 4.0. you may want to modify it based on the kind of errors you are seeing, include some more samples of that and then fine-tune. You can name the traineddata after fine-tuning as khmer-legacy and use it for pages in legacy font and use original traineddata for other unicode fonts. |
@amitdo Thanks for the following info:
|
@zdenop Please close the issue. |
I am naive to tesseract and LSTM. I read tesseract tutorial on wiki and I tried to train lstmtraining from the scratch using the following command:
training/tesstrain.sh --fonts_dir /usr/share/fonts/truetype/khmeros-ttf --fontlist "Khmer OS" --langdata_dir langdata --lang khm --linedata_only --noextract_font_properties --tessdata_dir /home/phyrum/tesseract/tessdata --output_dir khmtrain
training/lstmtraining --debug_interval -1
--traineddata khmtrain/khm/khm.traineddata
--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]'
--model_output khmoutput/base --learning_rate 20e-4
--train_listfile khmtrain/khm.training_files.txt
--max_iterations 10000 &> khmoutput/basetrain.log
[
basetrain.log
](url)
=> After lstmtraining finished, I got only "basetrain.log" and "base_checkpoint" but I cannot find the final "khm.traineddata". and I really don't know why. Could you please help me?
Environment
Note: I also submitted detail question in tesseract-ocr mailing list with the title "LSTMTRAINING from the scratch for khmer language - Legacy Limon Fonts". Could you please check and give me some advices?
The text was updated successfully, but these errors were encountered: