LSTM: Training - Error msg - Encoding of string failed! #549

Shreeshrii · 2016-12-09T04:00:10Z

$   training/lstmtraining --model_output ~/tesstutorial/sanskrit2003_from_full/sanskrit2003 \
>   --continue_from ~/tesstutorial/sanskrit2003_from_full/san.lstm \
>   --train_listfile ~/tesstutorial/santrain/san.training_files.txt \
>   --target_error_rate 0.01
Loaded file /home/shree/tesstutorial/sanskrit2003_from_full/sanskrit2003_checkpoint, unpacking...
Successfully restored trainer from /home/shree/tesstutorial/sanskrit2003_from_full/sanskrit2003_checkpoint
Loaded 1746/1746 pages (0-1746) of document /home/shree/tesstutorial/santrain/san.Chandas.exp0.lstmf
Loaded 345/1760 pages (1415-1760) of document /home/shree/tesstutorial/santrain/san.Uttara.exp0.lstmf
Loaded 1814/1814 pages (0-1814) of document /home/shree/tesstutorial/santrain/san.Gargi.exp0.lstmf
Found AVX
Found SSE
At iteration 1808/17200/17229, Mean rms=0.336%, delta=0.129%, char train=0.41%, word train=1.751%, skip ratio=0.2%,  New worst char error = 0.41 wrote checkpoint.

Encoding of string failed! Failure bytes: ffffffc2 ffffffa3 20 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa4 ffffffb0 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffb5 ffffffe0 ffffffa5 ff
ffff8d ffffffe0 ffffffa4 ffffffb5
Can't encode transcription: व्यतर्कि १४. भवति ३७॥ £ सर्व्व
At iteration 1818/17300/17330, Mean rms=0.334%, delta=0.13%, char train=0.404%, word train=1.632%, skip ratio=0.3%,  wrote checkpoint.

The text was updated successfully, but these errors were encountered:

Shreeshrii · 2016-12-28T04:48:51Z

Still getting the errors with the following version -


 tesseract -v
tesseract 4.00.00alpha-219-gc124f87
 leptonica-1.74
  libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8


Can't encode transcription: सगुनल उठैलका देउता नेउता लवरना लोहमान कुदार
Encoding of string failed! Failure bytes: ffffffe0 ffffffa5 ffffff9c ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffffaf ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 fffff
fa4 ffffffb9 ffffffe0 ffffffa5 ffffff9c ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffffaf ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4 ffffffb2 ffffffe0 ffffffa
4 ffffffbe ffffffe0 ffffffa4 ffffffa6 ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4 ffffff85 ffffffe0 ffffffa4 ffffffa7 ffffffe0 ffffffa4
ffffffb8 ffffffe0 ffffffa5 ffffff87 ffffffe0 ffffffa4 ffffffb0 ffffffe0 ffffffa5 ffffff80 20 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ff
ffffac ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffff95 ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa4 ffffffbe
Can't encode transcription: बिसहरी सड़िया हड़िया लादना अधसेरी सुबुकना
Encoding of string failed! Failure bytes: ffffffe0 ffffffa5 ffffff9c ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffffaf ffffffe0 ffffffa4 ffffffa8 20 ffffffe0 fffff
fa4 ffffffac ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffffa6 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffa7 ffffffe0 ffffffa4 ffffffbf 20 ffffffe0 ffffffa
4 ffffff97 ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffffaa ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffa4 ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4
ffffffb6 ffffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa4 ffffffae ffffffe0 ffffffa5 ffffff87 20 ffffffe0 ffffffa4 ff
ffffb8 ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffffa6 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffa7 ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4 ffff
ff9c ffffffe0 ffffffa4 ffffff81 ffffffe0 ffffffa4 ffffffa4 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa4 ffffffb0 20 ffffffe0 ffffffa4 ffffff
a8 ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffff97 ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffffaf ff
ffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa4 ffffff81
Can't encode transcription: चूड़ियन बुद्धि गुप्ता शासनमे सुद्धा जँतसार निगुनियाँ
Encoding of string failed! Failure bytes: ffffffe0 ffffffa5 ffffff9c ffffffe0 ffffffa4 ffffff87 ffffffe0 ffffffa4 ffffffb2 ffffffe0 ffffffa5 ffffff82 ffffffe0 ffffffa4
 ffffff81 20 ffffffe0 ffffffa4 ffffffaa ffffffe0 ffffffa5 ffffff8b ffffffe0 ffffffa4 ffffffa5 ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4 ffffffac ffffffe0 ffffffa
5 ffffff8b ffffffe0 ffffffa4 ffffffa5 ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4 ffffffae ffffffe0 ffffffa5 ffffff8b ffffffe0 ffffffa4 ffffffa5 ffffffe0 ffffffa4
ffffffbe 20 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffb5 ffffffe0 ffffffa5 ffffff87 ffffffe0 ffffffa4 ffffff9a ffffffe0 ffffffa5 ff
ffff8d ffffffe0 ffffffa4 ffffff9b ffffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa4 ffffff81 20 ffffffe0 ffffffa4 ffffffaa ffffffe0 ffffffa4 ffff
ffbe ffffffe0 ffffffa4 ffffffb0 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffff9f ffffffe0 ffffffa5 ffffff80 20 ffffffe0 ffffffa4 ffffffb2 ffffffe0 ffffffa5 ffffff
9c ffffffe0 ffffffa4 ffffff95 ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffffaf ffffffe0 ffffffa4 ffffffa8
Can't encode transcription: दौड़इलूँ पोथा बोथा मोथा स्वेच्छासँ पार्टी लड़कियन

Shreeshrii · 2016-12-31T06:48:39Z

@also seen in finetune of Arabic


lstmtraining --model_output ~/tesstutorial/aratuned_from_ara/aratuned   --continue_from ~/tesstutorial/aratuned_from_ara/ara.lstm   --train_listfile ~/tesstutorial/ara/ara.training_files.txt     --eval_listfile ~/tesstutorial/aratest/ara.training_files.txt   --target_error_
rate 0.0001
Loaded file /home/shree/tesstutorial/aratuned_from_ara/aratuned_checkpoint, unpacking...
Successfully restored trainer from /home/shree/tesstutorial/aratuned_from_ara/aratuned_checkpoint
Loaded 229/229 pages (1-229) of document /home/shree/tesstutorial/ara/ara.Amiri.exp0.lstmf
Loaded 232/232 pages (1-232) of document /home/shree/tesstutorial/ara/ara.Arial.exp0.lstmf
Loaded 4/4 pages (1-4) of document /home/shree/tesstutorial/aratest/ara.Times_New_Roman.exp0.lstmf
Encoding of string failed! Failure bytes: ffffffd9 ffffff8e ffffffd9 ffffff8a ffffffd9 ffffff82 ffffffd9 ffffff90 ffffffd8 ffffffaf ffffffd9 ffffff90 ffffffd8 ffffffa7
 ffffffd8 ffffffb5 ffffffd9 ffffff8e 20 ffffffd9 ffffff85 ffffffd9 ffffff92 ffffffd8 ffffffaa ffffffd9 ffffff8f ffffffd9 ffffff86 ffffffd9 ffffff92 ffffffd9 ffffff83 f
fffffd9 ffffff8f 20 ffffffd9 ffffff86 ffffffd9 ffffff92 ffffffd8 ffffffa5 ffffffd9 ffffff90 20 ffffffd8 ffffffa7 ffffffd9 ffffff84 ffffffd9 ffffff84 ffffffd9 ffffff91
ffffffd9 ffffff8e ffffffd9 ffffff87 ffffffd9 ffffff90 20 ffffffd9 ffffff86 ffffffd9 ffffff90 ffffffd9 ffffff88 ffffffd8 ffffffaf ffffffd9 ffffff8f 20 ffffffd9 ffffff86
 ffffffd9 ffffff92 ffffffd9 ffffff85 ffffffd9 ffffff90 20 ffffffd9 ffffff85 ffffffd9 ffffff92 ffffffd9 ffffff83 ffffffd9 ffffff8f ffffffd8 ffffffa1 ffffffd9 ffffff8e f
fffffd8 ffffffa7 ffffffd8 ffffffaf ffffffd9 ffffff8e ffffffd9 ffffff87 ffffffd9 ffffff8e ffffffd8 ffffffb4 ffffffd9 ffffff8f
Can't encode transcription: نَيقِدِاصَ مْتُنْكُ نْإِ اللَّهِ نِودُ نْمِ مْكُءَادَهَشُ
Loaded 231/231 pages (1-231) of document /home/shree/tesstutorial/ara/ara.Arial_Unicode_MS.exp0.lstmf
Encoding of string failed! Failure bytes: ffffffd9 ffffff8e ffffffd9 ffffff88 ffffffd8 ffffffb1 ffffffd9 ffffff8f ffffffd8 ffffffb5 ffffffd9 ffffff90 ffffffd8 ffffffa8
 ffffffd9 ffffff92 ffffffd9 ffffff8a ffffffd9 ffffff8f 20 ffffffd9 ffffff84 ffffffd9 ffffff8e ffffffd8 ffffffa7 20 ffffffd8 ffffffaa ffffffd9 ffffff8d ffffffd8 ffffffa
7 ffffffd9 ffffff85 ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd9 ffffff8f ffffffd8 ffffffb8 ffffffd9 ffffff8f 20 ffffffd9 ffffff8a ffffffd9 ffffff81 ffffffd9 ffffff90
20 ffffffd9 ffffff85 ffffffd9 ffffff92 ffffffd9 ffffff87 ffffffd9 ffffff8f ffffffd9 ffffff83 ffffffd9 ffffff8e ffffffd8 ffffffb1 ffffffd9 ffffff8e ffffffd8 ffffffaa ff
ffffd9 ffffff8e ffffffd9 ffffff88 ffffffd9 ffffff8e 20 ffffffd9 ffffff85 ffffffd9 ffffff92 ffffffd9 ffffff87 ffffffd9 ffffff90 ffffffd8 ffffffb1 ffffffd9 ffffff90 ffff
ffd9 ffffff88 ffffffd9 ffffff86 ffffffd9 ffffff8f ffffffd8 ffffffa8 ffffffd9 ffffff90
Can't encode transcription: نَورُصِبْيُ لَا تٍامَلُظُ يفِ مْهُكَرَتَوَ مْهِرِونُبِ
Encoding of string failed! Failure bytes: ffffffd9 ffffff92 ffffffd9 ffffff87 ffffffd9 ffffff90

theraysmith · 2017-01-11T23:34:22Z

See new section in trainingtesseract-4.00

Shreeshrii · 2017-01-12T08:49:43Z

Wiki does not seem to have this section, https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 TrainingTesseract 4.00 Stefan Weil edited this page 28 days ago · 9 revisions We have a github outage in India just now, not sure if this is related to that or wiki updation is still in todo. ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jan 12, 2017 at 5:04 AM, theraysmith ***@***.***> wrote: See new section in trainingtesseract-4.00 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#549 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o2Kj43a8uaNjjhRJt8EBMPHq9-kgks5rRWcEgaJpZM4LIjyK> .

Brian51 · 2017-01-12T09:32:48Z

It is working correctly in Spain, Thank you all for the incredible amount of work that you have all done.

amitdo · 2017-01-12T10:36:44Z

I don't see the changes either.

The wiki can be cloned as a git repo. Ray probably did some edits locally, but didn't 'push' them yet.

theraysmith · 2017-01-12T17:33:46Z

Changes are pushed now. I got called away yesterday before I was able to do it.

…

On Thu, Jan 12, 2017 at 2:36 AM, Amit D. ***@***.***> wrote: I don't see the changes either. The wiki can be cloned as a git repo. Ray probably did some edits locally, but didn't 'push' them. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#549 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056X0eolRJLjvYL3TR3hp1-wfTyoGKks5rRgJFgaJpZM4LIjyK> .

-- Ray.

Shreeshrii · 2017-01-21T14:15:19Z


Encoding of string failed! Failure bytes: 9 31 32 30 30 45 6d 69 6c 69 65 2c 68 61 6e 73 4b 6f 6e 65 2e
Can't encode transcription: Møller.     1200Emilie,hansKone.

when trying to train frk

theraysmith · 2017-01-23T19:22:51Z

The tab character (9) at the beginning of the list of failure bytes is a dead giveaway.

…

On Sat, Jan 21, 2017 at 6:15 AM, Shreeshrii ***@***.***> wrote: Encoding of string failed! Failure bytes: 9 31 32 30 30 45 6d 69 6c 69 65 2c 68 61 6e 73 4b 6f 6e 65 2e Can't encode transcription: Møller. 1200Emilie,hansKone. when trying to train frk — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#549 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056Z_ATRDHUb3698yrRFfl1XSJTJM3ks5rUhMAgaJpZM4LIjyK> .

-- Ray.

harinath141 · 2017-02-02T08:41:13Z

@Shreeshrii
Is this issue resolved coz I'm getting the same when training with Telugu language..

Shreeshrii · 2017-02-02T09:38:04Z

Please see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#error-messages-from-training

Encoding of string failed! results when the text string for a training image 
cannot be encoded using the given unicharset. 

Possible causes are:

- There  is an un-represented character in the text, say a British Pound sign that is not in your unicharset.

- A  stray unprintable character (like tab or a control character) in the text.

- There  is an un-represented Indic grapheme/aksara in the text.

In any case it will result in that training image being ignored by the trainer. 

If the error is infrequent, it is harmless, but it may indicate that your unicharset is inadequate for representing the language that you are training.

Shreeshrii · 2017-02-02T13:58:20Z

@harinath141 If you are getting a lot of these errors during finetune, try replace top layer training. You can use the box/tiff pairs generated for finetune. Commands will be similar to the following:

mkdir -p ~/tesstutorial/tellayer_from_tel 

combine_tessdata -e ../tessdata/tel.traineddata \
  ~/tesstutorial/tellayer_from_tel/tel.lstm
  
lstmtraining -U ~/tesstutorial/tel/tel.unicharset \
  --script_dir ../langdata  --debug_interval 0 \
  --continue_from ~/tesstutorial/tellayer_from_tel/tel.lstm \
  --append_index 5 --net_spec '[Lfx256 O1c105]' \
  --model_output ~/tesstutorial/tellayer_from_tel/tellayer \
  --train_listfile ~/tesstutorial/tel/tel.training_files.txt \
  --target_error_rate 0.01

Shreeshrii · 2017-02-02T14:01:13Z

~/tesstutorial/tel/ should have your .lstmf files.

harinath141 · 2017-02-02T14:08:47Z

Thank you @Shreeshrii I'll try to replace top layer

Shreeshrii · 2017-02-03T05:32:57Z

@harinath141

When you use --debug_interval 0 you will see messages every 100 iterations like the following:

At iteration 45909/58500/58569, Mean rms=0.639%, delta=0.621%, char train=1.861%, word train=13.302%, skip ratio=0%,  wrote checkpoint.

At iteration 45960/58600/58669, Mean rms=0.64%, delta=0.616%, char train=1.844%, word train=12.933%, skip ratio=0%,  wrote checkpoint.

2 Percent improvement time=14052, best error was 3.697 @ 31958
At iteration 46010/58700/58769, Mean rms=0.634%, delta=0.561%, char train=1.686%, word train=12.343%, skip ratio=0%,  New best char error = 1.686 wrote best model:/hom
e/shree/tesstutorial/khmlayer1_from_khm/khm1.686_46010.lstm wrote checkpoint.

When you use --debug_interval -1 , messages such as the following will be shown for every iteration:


Iteration 59400: ALIGNED TRUTH : មានរូបឆ្មាំ អេស៊ីលីដា
Iteration 59400: BEST OCR TEXT : មានរូបឆ្មាំ អេស៊ីលីដា
File /tmp/tmp.BjsuuQ0dgJ/khm/khm.Noto_Serif_Khmer_Bold.exp0.lstmf page 53 (Perfect):
Mean rms=0.646%, delta=0.553%, train=1.878%(13.168%), skip ratio=0.1%
Iteration 59401: ALIGNED TRUTH : ឆ្កៀលយកភ្នែក ជួនឆ្លងវគ្គ ចាប់ពីពេលនោះមក របស់គាត់ កុំធេ្វសគំនិត។ អូនហ្អើយ =
Iteration 59401: BEST OCR TEXT : ឆ្លៀលយកភ្នែក ជួនឆ្លងវគត ចាប់ពីពេលនោះមក របស់គាត់ កុំធេ្វសគំនិត។ អូនហ្អើយ =
File /tmp/tmp.BjsuuQ0dgJ/khm/khm.Noto_Serif_Khmer.exp0.lstmf page 1 :
Mean rms=0.647%, delta=0.555%, train=1.881%(13.157%), skip ratio=0.1%
Iteration 59402: ALIGNED TRUTH : សឹងមានះរឹងត្អឹងមហិមា គុណ នៅប៉ែកឦសាននៃភ្នំ ទុលល្យូ ខេត្តស្ទឺងត្រែង,
Iteration 59402: BEST OCR TEXT : សឹងមានះរឹងត្អឹងមហិមា គុណ នៅប៉ែកឦសាននៃភ្នំ ទុលល្យូ ខេត្តស្ទឺងត្រែង,
File /tmp/tmp.BjsuuQ0dgJ/khm/khm.Leelawadee_UI_Bold.exp0.lstmf page 56 :
Mean rms=0.647%, delta=0.556%, train=1.881%(13.157%), skip ratio=0.1%
Iteration 59403: ALIGNED TRUTH : រឺគៃបន្លំបាន។ (រឿងអាខ្វាក់អាខ្វិន) អន្នំលោកង្សិ = ឧទាហរណ៍់៖តំបន់ខ្លះ ផ្ទះសម្បែង
Iteration 59403: BEST OCR TEXT : រឺគៃបន្លំបាន។ (រឿងអាខ្វាក់អាខ្វិន) អន្នំលោកង្សិ = ឧទាហរណ៍៖តំបន់ខ្លះ ផ្ទះសម្បែង
File /tmp/tmp.BjsuuQ0dgJ/khm/khm.Leelawadee_UI.exp0.lstmf page 51 :

intermediate checkpoint and .lstm files will be written to the output directory eg. ~/tesstutorial/tellayer_from_tel
You can also see visual debugging output with scrollview.

Shreeshrii · 2017-06-14T11:50:18Z

@theraysmith

I am still getting this error, for a new replace top layer training for Devanagari script, where the eval_listfile is based on a different training text. eg.

Encoding of string failed! Failure bytes: ffffffe0 ffffffa4 ffffff81 ffffffe0 ffffffa4 ffffff9a ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffff9a ffffffe0 ffffffa5 ffffff88 ffffffe0 ffffffa4 ffffff95 ffffffe0 ffffffa5 ffffff8b 20 ffffffe0 ffffffa4 ffffff9c ffffffe0 ffffffa5 ffffff80 ffffffe0 ffffffa4 ffffffb5 ffffffe0 ffffffa4 ffffffa8
Can't encode transcription: वैशाख साल देखि साथै यो साँच्चैको जीवन

Encoding of string failed! Failure bytes: ffffffe0 ffffffa4 ffffff81 ffffffe0 ffffffa4 ffffffa6 ffffffe0 ffffffa4 ffffffbe
Can't encode transcription: रूपांतरित जैबुन्निसा केंद्रित छँदा

While each unicode character (स ा ँ ) is there in the Devanagari unicharset, the combined akshara (साँ, छँ) is not there as part of training text/unicharset, but is there as part of eval text/unicharset.

The training unicharset is of the following format:

3784
NULL 0 NULL 0
Joined 7 0,69,188,255,486,1218,0,30,486,1188 Latin 1 0 1 Joined	# Joined [4a 6f 69 6e 65 64 ]a
|Broken|0|1 f 0,69,186,255,892,2138,0,80,892,2058 Common 3625 10 3625 |Broken|0|1	# Broken
र्ध्रु 1 0,64,61,197,280,356,0,0,280,356 Devanagari 18 0 18 र्ध्रु	# र्ध्रु [930 94d 927 94d 930 941 ]x
र्बृ 1 3,64,61,197,181,236,0,0,181,236 Devanagari 18 0 18 र्बृ	# र्बृ [930 94d 92c 943 ]x
श्चु 1 0,64,61,197,251,303,0,12,251,291 Devanagari 240 0 240 श्चु	# श्चु [936 94d 91a 941 ]x
श्चौ 1 3,65,61,255,294,367,0,12,294,355 Devanagari 240 0 240 श्चौ	# श्चौ [936 94d 91a 94c ]x
श्च् 1 3,64,61,197,251,303,0,12,251,291 Devanagari 240 0 240 श्च्	# श्च् [936 94d 91a 94d ]x
य 1 63,64,192,192,114,142,0,0,111,133 Devanagari 8 0 8 य	# य [92f ]x
श्रीः 1 3,74,61,253,295,412,0,12,295,400 Devanagari 240 0 240 श्रीः	# श्रीः [936 94d 930 940 903 ]x
ष्ठु 1 0,75,61,197,204,243,0,0,204,243 Devanagari 241 0 241 ष्ठु	# ष्ठु [937 94d 920 941 ]x
ष्ठौ 1 3,75,61,255,247,307,0,0,247,307 Devanagari 241 0 241 ष्ठौ	# ष्ठौ [937 94d 920 94c ]x
स्रैः 1 3,76,61,255,243,449,0,0,243,449 Devanagari 280 0 280 स्रैः	# स्रैः [938 94d 930 948 903 ]x
...

Does this mean that the training text needs to be expanded to include all possible akshara combinations?

zc813 · 2018-02-02T04:01:43Z

@Shreeshrii Thanks for your help yesterday.
I encountered the same error (Encoding of string failed! Failure bytes: ffffffe0...) when training langdata/bod(Tibetan). It seemed most of the unicode characters are mis-decoded. I tried replacing top layers but still encountered the same error.
Since I'm already using the latest langdata, is there anything I can do to correct the encoding? Could you help me?
Thanks very much!

Shreeshrii · 2018-02-02T05:06:33Z

As per @theraysmith

There is an un-represented Indic grapheme/aksara in the text.
In any case it will result in that training image being ignored by the trainer.
If the error is infrequent, it is harmless, but it may indicate that your unicharset is inadequate for representing the language that you are training.

@zc813

tesstrain.sh has a limit of max_pages 3, you should change that so that complete training_text is used.

You can review the training_text to see that it is correct representation of bod(Tibetan).

Also test with 'Tibetan' script traineddata from both 'tessdata_best' and 'tessdata_fast' repo for OCR.

Authoritative answer can only be provided by @theraysmith.

zc813 · 2018-02-02T05:19:53Z

@Shreeshrii Thanks a lot for the reply! I'll try the solution.

btw I tried to decode the error message and found most of them started with

ffffffe0 ffffffbc ffffff8c ffffffe0 ffffffbc ffffff8d

i.e. ༌། (0xf0c 0xf0d)
The ༌(0xf0c) and །(0xf0d) are already stored separately in my Tibetan.unicharset, I am kind of confused why they cannot be encoded when presented together.

Shreeshrii · 2018-02-02T05:25:02Z

Same problem as I had mentioned in one of my earlier comments -

While each unicode character (स ा ँ ) is there in the Devanagari unicharset, the combined akshara (साँ, छँ) is not there.

No answer from @theraysmith yet.. He has also marked this as a closed issue.

Shreeshrii · 2018-07-02T20:17:06Z

@zdenop Ray had closed this so I can not reopen.

Please reopen this issue, because the problem is still there. It is related to utf-8/utf-16/utf-32 conversion.

Example:

Encoding of string failed! Failure bytes: cc 84 67 6e 65
Can't encode transcription: 'mamāgne' in language ''
utf8
6D 61 6D 61 CC 84 67 6E 65
utf16
006D 0061 006D 0061 0304 0067 006E 0065
hex
006D 0061 006D 0061 0304 0067 006E 0065

Error is related to 'CC 84' in utf-8 which is '0304' in utf16 or hex.

string converted using the converter at https://r12a.github.io/app-conversion/

Shreeshrii · 2018-07-03T04:36:32Z

https://stackoverflow.com/questions/42012563/convert-unicode-code-points-to-utf-8-and-utf-32

Shreeshrii · 2018-07-03T04:41:33Z

tesseract/src/lstm/lstmtrainer.cpp

Line 785 in a80a8f1

tprintf("Encoding of string failed! Failure bytes:");

Shreeshrii · 2018-07-03T04:49:34Z

@ivanzz1001Any ideas.

xhuvom · 2018-10-19T13:42:22Z

Can't encode transcription: 'ঢাকা মেটো-গ' in language ''
Encoding of string failed! Failure bytes: ffffffe0 ffffffa6 ffffffbe ffffffe0 ffffffa6 ffffff95 ffffffe0 ffffffa6 ffffffbe 20 ffffffe0 ffffffa6 ffffffae ffffffe0 ffffffa7 ffffff87 ffffffe0 ffffffa6 ffffff9f ffffffe0 ffffffa7 ffffff87 ffffffe0 ffffffa6 ffffff97
Can't encode transcription: '|ঢাকা মেটেগ' in language ''
^Cmake: *** Deleting file 'data/checkpoints/banglaLPRNew_checkpoint'
Makefile:129: recipe for target 'data/checkpoints/banglaLPRNew_checkpoint' failed

stweil · 2019-10-09T17:54:26Z

It looks like this was the first report of the encoding problem, so I re-open it until it is (hopefully soon) solved.

stweil · 2019-10-09T17:57:57Z

See also later errors with "Encoding of string failed".

Shreeshrii · 2019-10-10T11:25:50Z

@stweil After this initial error report, Ray changed the LSTM training process so some of the comments will not be applicable with current code. Regardless, the issue is still there.

…

On Wed, Oct 9, 2019 at 11:29 PM Stefan Weil ***@***.***> wrote: See also later errors with "Encoding of string failed" <https://github.com/tesseract-ocr/tesseract/issues?utf8=%E2%9C%93&q=%22Encoding+of+string+failed%22> . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#549?email_source=notifications&email_token=ABG37I2J4Q5AXOR6EOSOFITQNYLXRA5CNFSM4CZCHSFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAYYODQ#issuecomment-540116750>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABG37I3YLPMEKG5GIWNBHHTQNYLXRANCNFSM4CZCHSFA> .

--

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

stweil · 2019-10-11T16:53:11Z

I could fix the encoding errors for tesstrain by normalizing the ground truth texts, see tesseract-ocr/tesstrain#111.

Shreeshrii · 2019-10-12T05:32:25Z

@stweil If I understand the change correctly this normalizes the ground-truth text within the box file so errors will be avoided during LSTM training.

so any comparisons using the original ground truth files using diff, wdiff or or evaluation tools may still show errors for the normalized characters.

Also, this does not address the case when training is done using training_text and fonts.

I will suggest adding a new script normalize.py which can be used to normalize any training text before beginning training process and also adding normalization as part of creating the training text process in wiki.

Also, it maybe helpful to normalize all existing training_text files in langdata_lstm and langdata repos.

stweil · 2019-10-12T08:05:21Z

See tesseract-ocr/tesstrain#111. I just added a normalize.py.

stweil · 2019-10-12T08:22:27Z

See tesseract-ocr/langdata#148 and tesseract-ocr/langdata_lstm#26 which normalize the training texts. I noticed that more files (mostly *.unicharset) also contain unnormalized unicode, but I am not sure what to do with those.

Shreeshrii · 2019-10-12T08:22:54Z

Thanks, @stweil.

Shreeshrii · 2019-10-13T05:04:37Z

Possible causes as per Ray:

There is an un-represented character in the text, say a British Pound sign that is not in your unicharset.
A stray unprintable character (like tab or a control character) in the text.
There is an un-represented Indic grapheme/aksara in the text.

Additional cause:

Training text not being normalized

SOLUTIONS:

There is an un-represented Indic grapheme/aksara in the text. - FIXED by Ray with new norm_mode, combine_lang_model and other related changes
Training text not being normalized. - FIXED by @stweil via
tesseract-ocr/tesstrain@6c88fb3
tesseract-ocr/tesstrain@0dd3bcd
tesseract-ocr/tesstrain@1b15bf3
and
tesseract-ocr/langdata_lstm@5bc4732
tesseract-ocr/langdata@3bf26eb
- A stray unprintable character (like tab or a control character) in the text. - SUGGESTION - Python script - Can't encode transcription #1012 (comment)
- There is an un-represented character in the text, say a British Pound sign that is not in your unicharset. - EXAMPLE issue - Can't encode transcription #2695 (comment)

stweil · 2019-10-13T07:53:10Z

normalize.py can now also be used to show which files contain unnormalized unicode: ./normalize.py -n .... I used that to examine all unpacked traineddata (dawg converted to wordlist) and found that some of it is not normalized.

Shreeshrii · 2019-10-13T08:36:22Z

(dawg converted to wordlist) and found that some of it is not normalized.

Which languages? tessdata_best or tessdata_fast?

stweil · 2019-10-13T10:57:16Z

Here is the list of all unnormalized components (extracted from traineddata):

tessdata/osd/osd.pffmtable
tessdata/osd/osd.unicharset
tessdata/osd/osd.normproto
tessdata/script/Arabic/Arabic.lstm-word-dawg.wordlist
tessdata/heb/heb.unicharambigs
tessdata/uig/uig.lstm-word-dawg.wordlist
tessdata_best/osd/osd.pffmtable
tessdata_best/osd/osd.unicharset
tessdata_best/osd/osd.normproto
tessdata_best/script/Arabic/Arabic.lstm-word-dawg.wordlist
tessdata_best/uig/uig.lstm-word-dawg.wordlist
tessdata_fast/osd/osd.pffmtable
tessdata_fast/osd/osd.unicharset
tessdata_fast/osd/osd.normproto
tessdata_fast/script/Arabic/Arabic.lstm-word-dawg.wordlist
tessdata_fast/uig/uig.lstm-word-dawg.wordlist

johnlockejrr · 2024-09-05T20:32:48Z

This happens when are present some control characters like:
CHARACTER TABULATION, CARRIAGE RETURN, RIGHT-TO-LEFT MARK [RLM], LEFT-TO-RIGHT MARK [LRM], NO-BREAK SPACE, they are mostly not visible to the naked eye.

So with sed (or python/perl script whatever) you can remove/replace them.

s/\x09//g
s/\x0d//g
s/\xc2\xa0/ /g
s/\x20\x0e//g
s/\x20\x0f//g

theraysmith closed this as completed Jan 11, 2017

anonynamja mentioned this issue Oct 5, 2018

LSTM Training: Can't encode transcription ERROR (for jpn language) #1227

Closed

stweil reopened this Oct 9, 2019

stweil added the bug label Oct 9, 2019

amitdo added the training label Mar 18, 2021

amitdo added the encoding failed label Jun 23, 2022

LSTM: Training - Error msg - Encoding of string failed! #549

LSTM: Training - Error msg - Encoding of string failed! #549

Comments

Shreeshrii commented Dec 9, 2016

Shreeshrii commented Dec 28, 2016

Shreeshrii commented Dec 31, 2016 • edited Loading

theraysmith commented Jan 11, 2017

Shreeshrii commented Jan 12, 2017 via email

Brian51 commented Jan 12, 2017

amitdo commented Jan 12, 2017 • edited Loading

theraysmith commented Jan 12, 2017 via email

Shreeshrii commented Jan 21, 2017

theraysmith commented Jan 23, 2017 via email

harinath141 commented Feb 2, 2017

Shreeshrii commented Feb 2, 2017

Shreeshrii commented Feb 2, 2017 • edited Loading

Shreeshrii commented Feb 2, 2017

harinath141 commented Feb 2, 2017

Shreeshrii commented Feb 3, 2017 • edited Loading

Shreeshrii commented Jun 14, 2017

zc813 commented Feb 2, 2018

Shreeshrii commented Feb 2, 2018

zc813 commented Feb 2, 2018

Shreeshrii commented Feb 2, 2018

Shreeshrii commented Jul 2, 2018 • edited Loading

Shreeshrii commented Jul 3, 2018

Shreeshrii commented Jul 3, 2018

Shreeshrii commented Jul 3, 2018

xhuvom commented Oct 19, 2018

stweil commented Oct 9, 2019

stweil commented Oct 9, 2019

Shreeshrii commented Oct 10, 2019 via email

stweil commented Oct 11, 2019

Shreeshrii commented Oct 12, 2019

stweil commented Oct 12, 2019

stweil commented Oct 12, 2019

Shreeshrii commented Oct 12, 2019

Shreeshrii commented Oct 13, 2019

stweil commented Oct 13, 2019

Shreeshrii commented Oct 13, 2019

stweil commented Oct 13, 2019

johnlockejrr commented Sep 5, 2024 • edited Loading

Shreeshrii commented Dec 31, 2016 •

edited

Loading

amitdo commented Jan 12, 2017 •

edited

Loading

Shreeshrii commented Feb 2, 2017 •

edited

Loading

Shreeshrii commented Feb 3, 2017 •

edited

Loading

Shreeshrii commented Jul 2, 2018 •

edited

Loading

johnlockejrr commented Sep 5, 2024 •

edited

Loading