Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't encode transcription #2695

Closed
peterbence3 opened this issue Oct 7, 2019 · 24 comments
Closed

Can't encode transcription #2695

peterbence3 opened this issue Oct 7, 2019 · 24 comments

Comments

@peterbence3
Copy link

Unable to fine-tune Arabic model for font 'Andalus', getting this error:

Encoding of string failed! Failure bytes: 26 26
Can't encode transcription: 'و ىدتنملا ىدتنم الإ دق عيضاوملا ؟؟ عيقوتلا ليجستلا &&' in language ''
Encoding of string failed! Failure bytes: 3d 3d 20 ffffffd9 ffffff89 ffffffd9 ffffff81 20 ffffffd9 ffffff88 ffffffd8 ffffffa3 20 ffffffd9 ffffff84 ffffffd8 ffffffa8 ffffffd9 ffffff82 20 ffffffd9 ffffff89 ffffffd8 ffffffaf ffffffd8 ffffffaa ffffffd9 ffffff86 ffffffd9 ffffff85 ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd9 ffffff86 ffffffd9 ffffff85 20 ffffffd9 ffffff86 ffffffd9 ffffff88 ffffffd9 ffffff83 ffffffd8 ffffffaa 20 ffffffd8 ffffffa9 ffffffd8 ffffffad ffffffd9 ffffff81 ffffffd8 ffffffb5 ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd8 ffffffa9 ffffffd9 ffffff83 ffffffd8 ffffffb1 ffffffd8 ffffffa7 ffffffd8 ffffffb4 ffffffd9 ffffff85 ffffffd9 ffffff84 ffffffd8 ffffffa7

Please note that the line making the error is the pre-last line in the ara.training_txt file, that contains:
&& التسجيل التوقيع ؟؟ المواضيع قد إلا منتدى المنتدى و

I'm using langdata_lstm for generating my training data and the ara.traineddata to continue from.

generating data:

../tesseract/src/training/tesstrain.sh --fonts_dir fonts/win7df \
	     --fontlist 'Andalus' \
	     --lang ara \
	     --linedata_only \
	     --langdata_dir ../langdata_lstm \
	     --tessdata_dir ../tesseract/tessdata \
	     --save_box_tiff \
	     --maxpages 10 \
	     --output_dir train

extracting old lstm:
combine_tessdata -e ../tesseract/tessdata/ara.traineddata ara.lstm

fine-tuning:

rm -rf output/*
OMP_THREAD_LIMIT=8 lstmtraining \
	--continue_from ara.lstm \
	--model_output output/araNewModel \
	--traineddata ../tesseract/tessdata/ara.traineddata \
	--train_listfile train/ara.training_files.txt \
	--max_iterations 400

I'd checked the generated train data, where everything seems to be good, and tiff files includes all the train_text lines including the line making the error. I'd also tried to generate train data and fine tune for different fonts like 'Arial' and 'Tahoma' but still getting the same error.

I was thinking about removing the error line from the train_text file, but I don't know if it is safe or not. Besides, I think that 80 lines for training Arabic models is very small, isn't it?!!! So what if I decided to train for more lines of data, what should I do, and what files would be affected in such case?

Regards

@stweil stweil added the bug label Oct 7, 2019
@stweil
Copy link
Contributor

stweil commented Oct 7, 2019

That error message also occurs with other languages, for example Greek.

Until there is a fix for this problem, I suggest to remove the characters which trigger it.

@peterbence3
Copy link
Author

peterbence3 commented Oct 7, 2019

@stweil is removing those characters safe?
I tried removing those characters but I got the same problem on every other line.

@Shreeshrii
Copy link
Collaborator

I tested just now on Ubuntu, using Andalus font and I get the following errors (sorted from log)

Can't encode transcription: 'ةحفصلا ـ اهل ةصاخ ىلعألا ءاضعألا 3 جمانرب } [ - ديدج' in language ''
Can't encode transcription: 'ةملك == ىف وأ لبق ىدتنملا نم نوكت ةحفصلا ةكراشملا' in language ''
Can't encode transcription: 'ثحب انه %100 ةحاتم باعلا عيطتست ( > زكرملا لاسرإ ةحفصلا' in language ''
Can't encode transcription: 'ددع ىدتنم %0 ىف : ةيئاصحإ :در هفلم ةقطنم رخآ( ةشدرد ةروص' in language ''
Can't encode transcription: 'دمحأ جمارب مسا تامولعم , = ؟تانايبلا لاسرإ ىتح يف » يف' in language ''
Can't encode transcription: 'نإ ةديدج ةديدجلا لك ةلطعم ةطساوب - }} وه بطلا ةكراشم' in language ''
Can't encode transcription: 'ناك [ ]انه سب هللا : يهو نآلا صاخلا انه &&& يهو فيشرألا' in language ''
Can't encode transcription: 'و ىدتنملا ىدتنم الإ دق عيضاوملا ؟؟ عيقوتلا ليجستلا &&' in language ''
Can't encode transcription: 'ىلعو .3+ : نب ةكراشملا خيرات 50 %0.00 عيطتست ىلعألا' in language ''
Can't encode transcription: 'يصخشلا رخآ 13 مويلا & و دقف ظفح - يصخشلا انا ةبوتكملا' in language ''
Can't encode transcription: '% ـه هنأ لاوجلا قافرإ :3+ هنأ عقوم عيمج مسق ةطساوب نآلا' in language ''
Can't encode transcription: ':: ." عيضاوملا ةصاخ ــ ؟ }.. لاؤس دمحم .+. امأ ةروصلا' in language ''
Can't encode transcription: '= تقولا ىلإ وأ تايدتنم ةحفصلا تاودأ هيف ةدهاشم عيطتست' in language ''
Encoding of string failed! Failure bytes: 25 20 d9 87 20 d9 87 d9 86 d8 a3 20 d9 84 d8 a7 d9 88 d8 ac d9 84 d8 a7 20 d9 82 d8 a7 d9 81 d8 b1 d8 a5 20 3a 33 2b 20 d9 87 d9 86 d8 a3 20 d8 b9 d9 82 d9 88 d9 85 20 d8 b9 d9 8a d9 85 d8 ac 20 d9 85 d8 b3 d9 82 20 d8 a9 d8 b7 d8 b3 d8 a7 d9 88 d8 a8 20 d9 86 d8 a2 d9 84 d8 a7
Encoding of string failed! Failure bytes: 25 30 20 d9 89 d9 81 20 3a 20 d8 a9 d9 8a d8 a6 d8 a7 d8 b5 d8 ad d8 a5 20 3a d8 af d8 b1 20 d9 87 d9 81 d9 84 d9 85 20 d8 a9 d9 82 d8 b7 d9 86 d9 85 20 d8 b1 d8 ae d8 a2 28 20 d8 a9 d8 b4 d8 af d8 b1 d8 af 20 d8 a9 d8 b1 d9 88 d8 b5
Encoding of string failed! Failure bytes: 25 30 2e 30 30 20 d8 b9 d9 8a d8 b7 d8 aa d8 b3 d8 aa 20 d9 89 d9 84 d8 b9 d8 a3 d9 84 d8 a7
Encoding of string failed! Failure bytes: 25 31 30 30 20 d8 a9 d8 ad d8 a7 d8 aa d9 85 20 d8 a8 d8 a7 d8 b9 d9 84 d8 a7 20 d8 b9 d9 8a d8 b7 d8 aa d8 b3 d8 aa 20 28 20 3e 20 d8 b2 d9 83 d8 b1 d9 85 d9 84 d8 a7 20 d9 84 d8 a7 d8 b3 d8 b1 d8 a5 20 d8 a9 d8 ad d9 81 d8 b5 d9 84 d8 a7
Encoding of string failed! Failure bytes: 26 20 d9 88 20 d8 af d9 82 d9 81 20 d8 b8 d9 81 d8 ad 20 2d 20 d9 8a d8 b5 d8 ae d8 b4 d9 84 d8 a7 20 d8 a7 d9 86 d8 a7 20 d8 a9 d8 a8 d9 88 d8 aa d9 83 d9 85 d9 84 d8 a7
Encoding of string failed! Failure bytes: 26 26
Encoding of string failed! Failure bytes: 26 26 26 20 d9 8a d9 87 d9 88 20 d9 81 d9 8a d8 b4 d8 b1 d8 a3 d9 84 d8 a7
Encoding of string failed! Failure bytes: 3d 20 d8 9f d8 aa d8 a7 d9 86 d8 a7 d9 8a d8 a8 d9 84 d8 a7 20 d9 84 d8 a7 d8 b3 d8 b1 d8 a5 20 d9 89 d8 aa d8 ad 20 d9 8a d9 81 20 c2 bb 20 d9 8a d9 81
Encoding of string failed! Failure bytes: 3d 20 d8 aa d9 82 d9 88 d9 84 d8 a7 20 d9 89 d9 84 d8 a5 20 d9 88 d8 a3 20 d8 aa d8 a7 d9 8a d8 af d8 aa d9 86 d9 85 20 d8 a9 d8 ad d9 81 d8 b5 d9 84 d8 a7 20 d8 aa d8 a7 d9 88 d8 af d8 a3 20 d9 87 d9 8a d9 81 20 d8 a9 d8 af d9 87 d8 a7 d8 b4 d9 85 20 d8 b9 d9 8a d8 b7 d8 aa d8 b3 d8 aa
Encoding of string failed! Failure bytes: 3d 3d 20 d9 89 d9 81 20 d9 88 d8 a3 20 d9 84 d8 a8 d9 82 20 d9 89 d8 af d8 aa d9 86 d9 85 d9 84 d8 a7 20 d9 86 d9 85 20 d9 86 d9 88 d9 83 d8 aa 20 d8 a9 d8 ad d9 81 d8 b5 d9 84 d8 a7 20 d8 a9 d9 83 d8 b1 d8 a7 d8 b4 d9 85 d9 84 d8 a7
Encoding of string failed! Failure bytes: 7d 20 5b 20 2d 20 d8 af d9 8a d8 af d8 ac
Encoding of string failed! Failure bytes: 7d 2e 2e 20 d9 84 d8 a7 d8 a4 d8 b3 20 d8 af d9 85 d8 ad d9 85 20 2e 2b 2e 20 d8 a7 d9 85 d8 a3 20 d8 a9 d8 b1 d9 88 d8 b5 d9 84 d8 a7
Encoding of string failed! Failure bytes: 7d 7d 20 d9 88 d9 87 20 d8 a8 d8 b7 d9 84 d8 a7 20 d8 a9 d9 83 d8 b1 d8 a7 d8 b4 d9 85

So it seems to be related to the characters % & = }.

@stweil
Copy link
Contributor

stweil commented Oct 7, 2019

Encoding of string failed! Failure bytes: 3d 3d 20 ffffffd9 ffffff89 [...]

ffffffd9 indicates that this was not the latest Tesseract code (which fixed that error message).

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Oct 7, 2019

Besides, I think that 80 lines for training Arabic models is very small, isn't it?!!!

This is a known issue. langdata and langdata_lstm have not been updated with new (tess4) language training data for Arabic.

it seems to be related to the characters % & =

While these characters are there in the current training_text they are not there in the ara.lstm-unicharset extracted from tessdata_best/ara.traineddata, that is the reason for error while doing the impact finetuning.

The error will not be there for plusminus type of training.

@stweil
Copy link
Contributor

stweil commented Oct 7, 2019

That does not explain my encoding errors, because I trained from scratch. Nevertheless the critical characters are somehow missing in the unicharset:

Some characters can be encoded in different ways, notably characters with diacritical marks. For example the character ñ can be encoded as U+00F1, but also as an n with U+0303 (tilde). The first form is in the unicharset, the second form is in the ground truth text and causes encoding errors for texts like "das leben in ein beſſers werd gekert. Vñ als".

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Oct 7, 2019

Some characters can be encoded in different ways, notably characters with diacritical marks.

I have seen similar errors for Devanagari for letters with nukta -https://r12a.github.io/scripts/devanagari/block#char093C

There are both precomposed and decomposed form of these letters in Unicode.
Unicode NFC normalization produces the decomposed sequence.

However the training_texts in langdata and langdata_lstm repos have the precomposed forms.

Ray added normalization checks for training but probably didn't update the language training data in repos.

@stweil
Copy link
Contributor

stweil commented Oct 7, 2019

I wonder whether that characters are handled correctly when calculating error rates. Programs like diff or wdiff don't know that both forms are identical.

@stweil
Copy link
Contributor

stweil commented Oct 7, 2019

It looks like unicharset_extractor normalizes those characters, but tesseract ... lstm.train does not. So the created lstmf files use the same encoding as the ground truth texts which can differ from the encoding used for the unicharset. So the solution would require lstmf files with normalized encoding.

@Shreeshrii
Copy link
Collaborator

@stweil Please see comments by Ray in this thread - tesseract-ocr/langdata#63

@stweil
Copy link
Contributor

stweil commented Oct 7, 2019

I think this is a duplicate of issue #1012.

@Shreeshrii
Copy link
Collaborator

There are other related issues too - e.g. see #2267
This should now be reopened since ocrd-train is part of tesseract now.

@stweil
Copy link
Contributor

stweil commented Oct 8, 2019

I marked that issue as a duplicate of issue #1012 now, so no need to reopen it.

@peterbence3
Copy link
Author

The error will not be there for plusminus type of training.

@Shreeshrii so you believe this is a solution for now?

@Shreeshrii
Copy link
Collaborator

Do you need the characters % & = } with Arabic, if so try plusminus training?

However, when I tried it yesterday the unicharset size reduced from 80+ to 70+, so some characters got dropped. I didn't compare the unicharsets or investigate further. However accuracy of finetuned data on the Andalus font training set was better. I didn't try separately for an eval set.

If you don't need those characters, then remove just those chars from the training_text and try the Impact style training again.

Please know that there are issues while training RTL languages so fine-tuning with minimal changes to model works best.

Couple of things that I am aware of,

  • RTL languages are treated as LTR for training. This is intentional, as per Ray. However, the validation routines check that no word begin with a combining mark. But when RTL text is treated in reverse order, a word ending combining mark will come first.
  • the other issue is related to use of punctuation and numbers, which are treated as LTR or directionally neutral in RTL languages. LSTM training has a problem with these too.

FYI, I do not know Arabic or any RTL language for that matter.

@peterbence3
Copy link
Author

peterbence3 commented Oct 8, 2019

@Shreeshrii, I decided to start fine-tuning with plusminus training so I read this tutorial Fine Tuning for ± a few characters ad did the following:

  1. generated the train data files for 'Andalus' Font as usual
  2. getting a new unicharset file langdata_lstm/Arabic.unicharset, (I don't know why this unicharset contains much more chars than langdata_lstm/ara/ara.unicharset)
  3. getting the tessdata_best/ara.traineddata
  4. combining new unicharset file with the above ara.traineddata (but changing unicharset name from ara.unicharset to ara.lstm-unicharset to match the name of the unicharset in the found in the traineddata file)
  5. extracting ara.lstm from the ara.traineddata file to continue training from it
  6. executing the fine-tuning like:
OMP_THREAD_LIMIT=8 lstmtraining \
	--continue_from starter/old_traineddata_with_new_unicharset_combined/ara.lstm \
	--model_output output/araNewModel \
	--old_traineddata starter/old_traineddata/ara.traineddata \
	--traineddata starter/old_traineddata_with_new_unicharset_combined/ara.traineddata \
	--train_listfile train/ara.training_files.txt \
	--max_iterations 1000

Output:

Encoding of string failed! Failure bytes: ffffffd9 ffffff8a ffffffd9 ffffff86 ffffffd8 ffffffa7 ffffffd8 ffffffba ffffffd8 ffffffa7 20 ffffffd9 ffffff86 ffffffd8 ffffffa3 20 32 30 30 37 20 ffffffd8 ffffffaf ffffffd8 ffffffa8 ffffffd8 ffffffb9 20 3a ffffffd8 ffffffa9 ffffffd9 ffffff8a ffffffd9 ffffff88 ffffffd8 ffffffb6 ffffffd8 ffffffb9 ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd9 ffffff85 ffffffd8 ffffffa7 ffffffd8 ffffffb9 20 ffffffd9 ffffff88 ffffffd9 ffffff87 20 ffffffc2 ffffffbb ffffffc2 ffffffab 20 ffffffd8 ffffffb9 ffffffd9 ffffff8a ffffffd9 ffffff85 ffffffd8 ffffffac 20 2c 2c 20 ffffffd8 ffffffa7 ffffffd9 ffffff87 ffffffd9 ffffff8a ffffffd9 ffffff81 20 ffffffd8 ffffffa8 ffffffd8 ffffffa7 ffffffd8 ffffffb9 ffffffd9 ffffff84 ffffffd8 ffffffa7
Can't encode transcription: 'يناغا نأ 2007 دبع :ةـيوضعلا ماع وه »« عيمج ,, اهيف باعلا' in language ''
...
...

I tried combining the new unicharset with ara.traineddata generated by the tesstrain.sh and the using this ara.traineddata file in the --traineddata option of the fine-tuning but still facing the same problem.

I believe that this isn't a bug, I must be doing something wrong!! any help please.
Thanks

@Shreeshrii
Copy link
Collaborator

@peterbence3 The unicharset is extracted from the training_text file. You don't need so many steps.

I used the following script:

#!/bin/bash

time ~/tesseract/src/training/tesstrain.sh \
  --fonts_dir ~/.fonts \
  --lang ara --linedata_only \
  --noextract_font_properties \
  --langdata_dir ~/langdata \
  --tessdata_dir ~/tessdata \
  --fontlist "Andalus" \
  --training_text ~/langdata/ara/ara.training_text \
  --workspace_dir ~/tmp/ \
  --save_box_tiff \
  --output_dir ~/tesstutorial/araeval
  
echo "/n ****** Finetune one of the fully-trained existing models: ***********"

mkdir -p ~/tesstutorial/ara_from_full

combine_tessdata -e ~/tessdata_best/ara.traineddata \
  ~/tesstutorial/ara_from_full/ara.lstm
  
lstmtraining \
  --model_output ~/tesstutorial/ara_from_full/PLUS \
   --continue_from ~/tesstutorial/ara_from_full/ara.lstm \
   --traineddata ~/tesstutorial/araeval/ara/ara.traineddata \
   --old_traineddata ~/tessdata_best/ara.traineddata \
   --train_listfile ~/tesstutorial/araeval/ara.training_files.txt \
   --debug_interval -1 \
   --max_iterations 3600
  
echo -e "\n****************************  ******\n"

lstmeval \
  --model ~/tessdata_best/ara.traineddata \
  --eval_listfile ~/tesstutorial/araeval/ara.training_files.txt
  
echo -e "\n****************************  ******\n"

lstmeval \
  --model ~/tesstutorial/ara_from_full/PLUS_checkpoint \
   --traineddata ~/tesstutorial/araeval/ara/ara.traineddata \
  --eval_listfile ~/tesstutorial/araeval/ara.training_files.txt

echo -e "\n****************************  ******\n"

time lstmtraining \
  --stop_training \
  --continue_from ~/tesstutorial/ara_from_full/PLUS_checkpoint \
  --traineddata ~/tessdata_best/ara.traineddata \
  --model_output ~/tesstutorial/ara_from_full/ara.Andalus.PLUS.traineddata

@peterbence3
Copy link
Author

@Shreeshrii that was so useful, but there is one idea that I didn't get it yet. You said

The unicharset is extracted from the training_text file

Does the lstmtraining tool extract it automatically from the training_text or I should do it my self, if yes how?

Regards

@Shreeshrii
Copy link
Collaborator

It is generated by tesstrain.sh - see https://github.com/tesseract-ocr/tesseract/blob/master/src/training/tesstrain_utils.sh#L346

@peterbence3
Copy link
Author

peterbence3 commented Oct 8, 2019

@Shreeshrii thanks a lot, now its working super fine, steps I followed:

1. clone tesseract, followed installation instructions to build and install from source

2. clone langdata-lstm

3. getting the tessdata_best/ara.traineddata and place it at tessertact/tessdata inside the project cloned in step one

4. edit langdata_lstm/ara/ara.traning_text by adding few more lines

5. download a set of Arabic fonts that i need to fine tune for (place them in any folder)

6. generating the train data files as follows:

tesstrain.sh --fonts_dir fonts/win7df \
	     --fontlist 'Courier New' 'Segoe UI' 'Tahoma' 'Times New Roman' 'Arial' 'Andalus' 'Microsoft Sans Serif' 'Adobe Arabic' \
	     --lang ara \
	     --linedata_only \
	     --langdata_dir ../langdata_lstm \
	     --tessdata_dir ../tesseract/tessdata \
	     --save_box_tiff \
	     --maxpages 10 \
	     --output_dir train

this will generate a new train data, including the ara.training_files.txt plus a folder named 'ara' that contains a starter ara.traineddata (actually not trained yet) containing your unicharset file that was automatically generated by tesstrain.sh for your custom ara.training_text.

7. extracting ara.lstm from the ara.traineddata file to continue training from it later using:
combine_tessdata -e ../tesseract/tessdata/ara.traineddata ara.lstm

8. now everything is ready, execute the fine-tuning like:

OMP_THREAD_LIMIT=8 lstmtraining \
	--continue_from starter/old_extracted/ara.lstm \
	--model_output output/araNewModel \
	--old_traineddata starter/old_traineddata/ara.traineddata \
	--traineddata train/ara/ara.traineddata \
	--train_listfile train/ara.training_files.txt \
	--max_iterations 1000

9. enjoy with no encoding errors

Thanks all

@Shreeshrii
Copy link
Collaborator

@peterbence3 for extending the training_text for finetuning - you can try to reengineer the files from the trainnedata.

# unpack best traineddata file
combine_tessdata  -u ~/tessdata_best/ara.traineddata  ara.

# create wordlist from dawg file - use files extracted from best traineddata
dawg2wordlist ara.lstm-unicharset ara.lstm-word-dawg  ara.lstm-wordlist

# copy wordlist and word-bigrams from langdata/ara
cp ~/langdata/ara/ara.wordlist ./
cp ~/langdata/ara/ara.word.bigrams ./

# concatenate various wordlists, shuffle and convert to text lines
cat ara.wordlist ara.word.bigrams ara.lstm-wordlist | sort | uniq > ara.lstm-wordlist-sorted
shuf  ara.lstm-wordlist-sorted > ara.lstm-wordlist-shuffled
par 150l < ara.lstm-wordlist-shuffled > ara.lstm-wordlist-lines

# concatenate with existing training text
cat ~/langdata/ara/ara.training_text ara.lstm-wordlist-lines > ara.extended.training_text

@talha1503
Copy link

@peterbence3 Can you please tell how I should run the 6th point in your issue? How should I run tesstrain.sh ? There is no tesstrain.sh present in both the repositories.

@wolfassi123
Copy link

wolfassi123 commented Feb 27, 2022

@Shreeshrii I understand that when calling lstmeval, the words will be reversed in arabic as it is intentional.
I have tried what have been mentioned here, as I am training my model for a new arabic font and I am aiming at training on my own generated data containing dates and numbers in arabic, as they will be vital in my training.
I have added several lines of training in the "training_text" file and proceeded as mentioned above, I had the exact same pipeline approximetly.
Yet I still got the "Can't encode transcription" and"Encoding of string failed" error whenever an arabic number appears. Sometimes I even get a "Compute CTC targets failed" error, which I don't know what it means.
When I call eval on my newly generated traineddata, it is performing extremely poorly when it comes to arabic numbers.
I tried to test it on some images that I have, it performed extremely bad concerning numbers.
Any tips or ideas for help?

@AhmadHakami
Copy link

any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants
@Shreeshrii @stweil @amitdo @talha1503 @peterbence3 @wolfassi123 @AhmadHakami and others