-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High error rate on training with Impact on RTL language (Kur) Persian-Arabic script #151 #157
Comments
Start from script/Arabic rather than ara.
Currently your unicharset is increasing from 85 to over 100. This is not
suitable for fine-tuning.
|
Thank you for your support. is the below link is the correct script for LSTM training ? I am using PSM 6 How to solve this issue (Normalization failed for string ) ? |
Yes, tessdata_best/script/Arabic is preferable. For single lines, I suggest using --psm 13. Please make sure that correct RTL processing is happening in reversal of text for box files. |
See #137 |
Thank you, Will do that. Changes to the generate_wordstr_box.py as follow: ` create WordStr line boxes for Indic & RTL is this correct modification? |
@sam-kurdi I will upload a new training for ckb that I have done and you can check whether results are as expected on real life images. It gives over 95% accuracy with lstmeval on single line images similar to those used for training. |
@Shreeshrii |
Please see new PR https://github.com/tesseract-ocr/tesstrain/pull/159/commits |
My first experience with training Arabic handwriting is documented here. The training is still running. I used the old |
@stweil How is the EPOCH defined? Are you using a custom version of Makefile? |
1 epoch = 1 iteration over all training data. It is commonly used for training of neural networks, but up to now not for Tesseract training. Yes, this is currently a local custom version of Makefile which calculates MAX_ITERATIONS from EPOCHS:
|
I updated https://github.com/tesseract-ocr/tesstrain/wiki/Arabic-Handwriting#training to explain what epochs means in the context of that training. |
@stweil Thanks. Calculating MAX_ITERATIONS from EPOCHS is a good addition. Since you are testing for RTL, it will be interesting to see tesseract results for https://github.com/OpenITI/OCR_GS_Data - maybe you can do a run for those too. I had tried a test earlier but I change too many things for it to be a valid comparison to their results. |
@stweil Please check that your custom Makefile is using
|
I had called |
I have tested with the modified Makefile version, it works fine the finished error rate with my own data set is 2.33 @Shreeshrii @stweil is it a mandatory step that the image lines and corresponding ground-truth must be the same font? |
Ground truth should be in Unicode text format and can be rendered in any Unicode font. So font does not really matter for ground truth as long as it is not a legacy non Unicode font. The test dataset was extracted from synthetic training data generated using Unicode text and fonts. I think rtltest.tgz has images in Unikurd-Jino font. |
Is ZWNJ being used in certain character combinations? |
yes, it has been used in many gt files |
Your training data is very limited number of lines. Try with more training data and include more samples of characters which are in error. |
I will prepare more training data, how about ZWNJ and WAN |
@Shreeshrii |
Yes, error rate will depend on number of iteration as well as number of lines of training data. How many lines of text are there in your training set? |
550 image lines. |
@Shreeshrii @stweil @theraysmith |
Clarify what you mean by WAN - is it 0-9 or farsi numbers? EAN I assume is numbers in Arabic script? |
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I am getting a very high error (85) rate after training with Impact
I have started the training by the following configuration :
tesseract version:
tesseract 4.0.0-beta.1
leptonica-1.75.3
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found SSE
1:the workflow is utilized https://github.com/tesseract-ocr/tesstrain . Command-line to run make file is (make training MODEL_NAME=krd START_MODEL=ara LANG_TYPE=RTL FINETUNETYPE=Impact)
2:I added inherited.unicharset , ara.config , kur langdata, Arabic.unicharset, and Latin.unercharset provided by https://github.com/tesseract-ocr/langdata_lstm
3:I used ara.traineddata as a start model from https://github.com/tesseract-ocr/tessdata_best
4: (1304 / 2) imagelines + ground truth transcription
could you please tell is there any misconfiguration?
how can I improve the accuracy rate?
Training Log :
pc1@pc:~/Desktop/tesstrain-master$ make training MODEL_NAME=krd START_MODEL=ara LANG_TYPE=RTL FINETUNETYPE=Impact
find data/krd-ground-truth -name '*.gt.txt' | xargs cat | sort | uniq > "data/krd/all-gt"
combine_tessdata -u /home/pc1/Desktop/tesstrain-master/usr/share/tessdata/ara.traineddata data/ara/krd
Extracting tessdata components from /home/pc1/Desktop/tesstrain-master/usr/share/tessdata/ara.traineddata
Wrote data/ara/krd.config
Wrote data/ara/krd.lstm
Wrote data/ara/krd.lstm-punc-dawg
Wrote data/ara/krd.lstm-word-dawg
Wrote data/ara/krd.lstm-number-dawg
Wrote data/ara/krd.lstm-unicharset
Wrote data/ara/krd.lstm-recoder
Wrote data/ara/krd.version
Version string:4.00.00alpha:ara:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
0:config:size=545, offset=192
17:lstm:size=11582395, offset=737
18:lstm-punc-dawg:size=1986, offset=11583132
19:lstm-word-dawg:size=999442, offset=11585118
20:lstm-number-dawg:size=13250, offset=12584560
21:lstm-unicharset:size=5061, offset=12597810
22:lstm-recoder:size=769, offset=12602871
23:version:size=80, offset=12603640
unicharset_extractor --output_unicharset "data/krd/my.unicharset" --norm_mode 3 "data/krd/all-gt"
Bad box coordinates in boxfile string!
Extracting unicharset from plain text file data/krd/all-gt
Wrote unicharset file data/krd/my.unicharset
merge_unicharsets data/ara/krd.lstm-unicharset data/krd/my.unicharset "data/krd/unicharset"
Loaded unicharset of size 85 from file data/ara/krd.lstm-unicharset
Loaded unicharset of size 73 from file data/krd/my.unicharset
Wrote unicharset file data/krd/unicharset.
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/krd-ground-truth/17.7.png" -t "data/krd-ground-truth/17.7.gt.txt" > "data/krd-ground-truth/17.7.box"
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/krd-ground-truth/65.3.png" -t "data/krd-ground-truth/65.3.gt.txt" > "data/krd-ground-truth/65.3.box"
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/krd-ground-truth/46.10.png" -t "data/krd-ground-truth/46.10.gt.txt" > "data/krd-ground-truth/46.10.box"
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/krd-ground-truth/30.3.png" -t "data/krd-ground-truth/30.3.gt.txt" > "data/krd-ground-truth/30.3.box"
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/krd-ground-truth/16.6.png" -t "data/krd-ground-truth/16.6.gt.txt" > "data/krd-ground-truth/16.6.box"
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/krd-ground-truth/24.24.png" -t "data/krd-ground-truth/24.24.gt.txt" > "data/krd-ground-truth/24.24.box"
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
(REMOVED...... the log)
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/krd-ground-truth/63.10.png" -t "data/krd-ground-truth/63.10.gt.txt" > "data/krd-ground-truth/63.10.box"
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
find data/krd-ground-truth -name '*.lstmf' | python3 shuffle.py 0 > "data/krd/all-lstmf"
combine_lang_model
--input_unicharset data/krd/unicharset
--script_dir data
--numbers data/krd/krd.numbers
--puncs data/krd/krd.punc
--words data/krd/krd.wordlist
--output_dir data
--pass_through_recoder --lang_is_rtl
--lang krd
Loaded unicharset of size 107 from file data/krd/unicharset
Setting unichar properties
Setting script properties
Warning: properties incomplete for index 16 = َ
Warning: properties incomplete for index 20 = ُ
Warning: properties incomplete for index 44 = ٍ
Warning: properties incomplete for index 48 = ّ
Warning: properties incomplete for index 65 = ِ
Warning: properties incomplete for index 66 = ْ
Warning: properties incomplete for index 69 = ً
Warning: properties incomplete for index 71 = ٌ
Warning: properties incomplete for index 87 =
Config file is optional, continuing...
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
lstmtraining
--debug_interval 0
--traineddata data/krd/krd.traineddata
--old_traineddata /home/pc1/Desktop/tesstrain-master/usr/share/tessdata/ara.traineddata
--continue_from data/ara/krd.lstm
--model_output data/krd/checkpoints/krd
--train_listfile data/krd/list.train
--eval_listfile data/krd/list.eval
--max_iterations 20000
Loaded file data/ara/krd.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 85 to 107!
Num (Extended) outputs,weights in Series:
1,48,0,1:1, 0
Num (Extended) outputs,weights in Series:
C3,3:9, 0
Ft16:16, 160
Total weights = 160
[C3,3Ft16]:16, 160
Mp3,3:16, 0
Lfys64:64, 20736
Lfx96:96, 61824
Lrx96:96, 74112
Lfx512:512, 1247232
Fc107:107, 54891
Total weights = 1458955
Previous null char=2 mapped to 2
Continuing from data/ara/krd.lstm
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/38.3.lstmf
(LOG REMOVED....)
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/101.2.lstmf
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/57.8.lstmf
2 Percent improvement time=100, best error was 100 @ 0
At iteration 100/100/100, Mean rms=6.343%, delta=52.311%, char train=85.617%, word train=98.507%, skip ratio=0%, New best char error = 85.617 wrote checkpoint.
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/28.4.lstmf
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/39.1.lstmf
(LOG REMOVED....)
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/45.12.lstmf
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/85.3.lstmf
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/12.6.lstmf
At iteration 200/200/200, Mean rms=6.631%, delta=58.281%, char train=92.803%, word train=99.254%, skip ratio=0%, New worst char error = 92.803 wrote checkpoint.
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/72.3.lstmf
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/38.4.lstmf
(LOG REMOVED...)
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/92.5.lstmf
At iteration 300/300/300, Mean rms=6.674%, delta=60.127%, char train=95.198%, word train=99.502%, skip ratio=0%, New worst char error = 95.198 wrote checkpoint.
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/89.1.lstmf
(LOG REMOVED...)
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/25.3.lstmf
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/88.2.lstmf
At iteration 400/400/400, Mean rms=6.706%, delta=61.546%, char train=96.399%, word train=99.627%, skip ratio=0%, New worst char error = 96.399 wrote checkpoint.
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/48.8.lstmf
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/29.11.lstmf
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/43.5.lstmf
(LOG REMOVED...)
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/46.4.lstmf
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/28.13.lstmf
At iteration 500/500/500, Mean rms=6.73%, delta=62.598%, char train=97.117%, word train=99.701%, skip ratio=0%, New worst char error = 97.117 wrote checkpoint.
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/26.3.lstmf
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/87.5.lstmf
(LOG REMOVED)
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/94.1.lstmf
At iteration 600/600/600, Mean rms=6.746%, delta=63.418%, char train=97.598%, word train=99.751%, skip ratio=0%, New worst char error = 97.598 wrote checkpoint.
At iteration 700/700/700, Mean rms=6.757%, delta=63.932%, char train=97.941%, word train=99.787%, skip ratio=0%, New worst char error = 97.941 wrote checkpoint.
At iteration 800/800/800, Mean rms=6.758%, delta=64.207%, char train=98.198%, word train=99.813%, skip ratio=0%, New worst char error = 98.198 wrote checkpoint.
At iteration 900/900/900, Mean rms=6.753%, delta=64.287%, char train=98.398%, word train=99.834%, skip ratio=0%, New worst char error = 98.398 wrote checkpoint.
At iteration 1000/1000/1000, Mean rms=6.748%, delta=64.338%, char train=98.559%, word train=99.851%, skip ratio=0%, New worst char error = 98.559 wrote checkpoint.
At iteration 1100/1100/1100, Mean rms=6.78%, delta=65.498%, char train=99.997%, word train=100%, skip ratio=0%, New worst char error = 99.997 wrote checkpoint.
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/85.4.lstmf
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/97.7.lstmf
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/35.8.lstmf
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/42.8.lstmf
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/77.1.lstmf
At iteration 1200/1200/1200, Mean rms=6.772%, delta=65.877%, char train=99.998%, word train=100%, skip ratio=0%, New worst char error = 99.998 wrote checkpoint.
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/24.15.lstmf
(LOG REMOVED...)
Loaded 1/1 pages (1-1) of document data/krd-ground-truth/53.1.lstmf
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 1300/1300/1300, Mean rms=6.764%, delta=65.956%, char train=99.999%, word train=100%, skip ratio=0%, New worst char error = 99.999At iteration 1100, stage 0, Eval Char error rate=100, Word error rate=100 wrote checkpoint.
At iteration 1400/1400/1400, Mean rms=6.753%, delta=65.951%, char train=99.999%, word train=100%, skip ratio=0%, wrote checkpoint.
(LOG REMOVED....)
At iteration 19817/19900/19900, Mean rms=5.893%, delta=41.14%, char train=93.046%, word train=99.661%, skip ratio=0%, wrote checkpoint.
At iteration 19917/20000/20000, Mean rms=5.891%, delta=41.096%, char train=92.917%, word train=99.605%, skip ratio=0%, wrote checkpoint.
Finished! Error rate = 85.617
lstmtraining
--stop_training
--continue_from data/krd/checkpoints/krd_checkpoint
--traineddata data/krd/krd.traineddata
--model_output data/krd.traineddata
Loaded file data/krd/checkpoints/krd_checkpoint, unpacking...
pc1@pc:~/Desktop/tesstrain-master$
The text was updated successfully, but these errors were encountered: