Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't encode transcription #147

Closed
peterbence3 opened this issue Oct 7, 2019 · 3 comments
Closed

Can't encode transcription #147

peterbence3 opened this issue Oct 7, 2019 · 3 comments

Comments

@peterbence3
Copy link

Unable to fine-tune Arabic model for font 'Andalus', getting this error:

Encoding of string failed! Failure bytes: 26 26
Can't encode transcription: 'و ىدتنملا ىدتنم الإ دق عيضاوملا ؟؟ عيقوتلا ليجستلا &&' in language ''
Encoding of string failed! Failure bytes: 3d 3d 20 ffffffd9 ffffff89 ffffffd9 ffffff81 20 ffffffd9 ffffff88 ffffffd8 ffffffa3 20 ffffffd9 ffffff84 ffffffd8 ffffffa8 ffffffd9 ffffff82 20 ffffffd9 ffffff89 ffffffd8 ffffffaf ffffffd8 ffffffaa ffffffd9 ffffff86 ffffffd9 ffffff85 ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd9 ffffff86 ffffffd9 ffffff85 20 ffffffd9 ffffff86 ffffffd9 ffffff88 ffffffd9 ffffff83 ffffffd8 ffffffaa 20 ffffffd8 ffffffa9 ffffffd8 ffffffad ffffffd9 ffffff81 ffffffd8 ffffffb5 ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd8 ffffffa9 ffffffd9 ffffff83 ffffffd8 ffffffb1 ffffffd8 ffffffa7 ffffffd8 ffffffb4 ffffffd9 ffffff85 ffffffd9 ffffff84 ffffffd8 ffffffa7

Please note that the line making the error is the pre-last line in the ara.training_txt file, that contains:
&& التسجيل التوقيع ؟؟ المواضيع قد إلا منتدى المنتدى و

I'm using langdata_lstm for generating my training data and the ara.traineddata to continue from.

generating data:

../tesseract/src/training/tesstrain.sh --fonts_dir fonts/win7df \
	     --fontlist 'Andalus' \
	     --lang ara \
	     --linedata_only \
	     --langdata_dir ../langdata_lstm \
	     --tessdata_dir ../tesseract/tessdata \
	     --save_box_tiff \
	     --maxpages 10 \
	     --output_dir train

extracting old lstm:
combine_tessdata -e ../tesseract/tessdata/ara.traineddata ara.lstm

fine-tuning:

rm -rf output/*
OMP_THREAD_LIMIT=8 lstmtraining \
	--continue_from ara.lstm \
	--model_output output/araNewModel \
	--traineddata ../tesseract/tessdata/ara.traineddata \
	--train_listfile train/ara.training_files.txt \
	--max_iterations 400

I'd checked the generated train data, where everything seems to be good, and tiff files includes all the train_text lines including the line making the error. I'd also tried to generate train data and fine tune for different fonts like 'Arial' and 'Tahoma' but still getting the same error.

I was thinking about removing the error line from the train_text file, but I don't know if it is safe or not. Besides, I think that 80 lines for training Arabic models is very small, isn't it?!!! So what if I decided to train for more lines of data, what should I do, and what files would be affected in such case?

Regards

@amitdo
Copy link

amitdo commented Oct 7, 2019

Besides, I think that 80 lines for training Arabic models is very small, isn't it?!!!

tesseract-ocr/langdata_lstm#6

@peterbence3
Copy link
Author

@amitdo is there any tutorial or documentation on how to generate a new langdata? I can contribute making the Arabic version.

@stweil
Copy link
Contributor

stweil commented Oct 9, 2019

This is a duplicate of tesseract-ocr/tesseract#2695.

@stweil stweil closed this as completed Oct 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants