Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSTM: Training - Error msg - Encoding of string failed! #549

Open
Shreeshrii opened this issue Dec 9, 2016 · 38 comments
Open

LSTM: Training - Error msg - Encoding of string failed! #549

Shreeshrii opened this issue Dec 9, 2016 · 38 comments

Comments

@Shreeshrii
Copy link
Collaborator

$   training/lstmtraining --model_output ~/tesstutorial/sanskrit2003_from_full/sanskrit2003 \
>   --continue_from ~/tesstutorial/sanskrit2003_from_full/san.lstm \
>   --train_listfile ~/tesstutorial/santrain/san.training_files.txt \
>   --target_error_rate 0.01
Loaded file /home/shree/tesstutorial/sanskrit2003_from_full/sanskrit2003_checkpoint, unpacking...
Successfully restored trainer from /home/shree/tesstutorial/sanskrit2003_from_full/sanskrit2003_checkpoint
Loaded 1746/1746 pages (0-1746) of document /home/shree/tesstutorial/santrain/san.Chandas.exp0.lstmf
Loaded 345/1760 pages (1415-1760) of document /home/shree/tesstutorial/santrain/san.Uttara.exp0.lstmf
Loaded 1814/1814 pages (0-1814) of document /home/shree/tesstutorial/santrain/san.Gargi.exp0.lstmf
Found AVX
Found SSE
At iteration 1808/17200/17229, Mean rms=0.336%, delta=0.129%, char train=0.41%, word train=1.751%, skip ratio=0.2%,  New worst char error = 0.41 wrote checkpoint.

Encoding of string failed! Failure bytes: ffffffc2 ffffffa3 20 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa4 ffffffb0 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffb5 ffffffe0 ffffffa5 ff
ffff8d ffffffe0 ffffffa4 ffffffb5
Can't encode transcription: व्यतर्कि १४. भवति ३७॥ £ सर्व्व
At iteration 1818/17300/17330, Mean rms=0.334%, delta=0.13%, char train=0.404%, word train=1.632%, skip ratio=0.3%,  wrote checkpoint.


@Shreeshrii
Copy link
Collaborator Author

Still getting the errors with the following version -


 tesseract -v
tesseract 4.00.00alpha-219-gc124f87
 leptonica-1.74
  libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8


Can't encode transcription: सगुनल उठैलका देउता नेउता लवरना लोहमान कुदार
Encoding of string failed! Failure bytes: ffffffe0 ffffffa5 ffffff9c ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffffaf ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 fffff
fa4 ffffffb9 ffffffe0 ffffffa5 ffffff9c ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffffaf ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4 ffffffb2 ffffffe0 ffffffa
4 ffffffbe ffffffe0 ffffffa4 ffffffa6 ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4 ffffff85 ffffffe0 ffffffa4 ffffffa7 ffffffe0 ffffffa4
ffffffb8 ffffffe0 ffffffa5 ffffff87 ffffffe0 ffffffa4 ffffffb0 ffffffe0 ffffffa5 ffffff80 20 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ff
ffffac ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffff95 ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa4 ffffffbe
Can't encode transcription: बिसहरी सड़िया हड़िया लादना अधसेरी सुबुकना
Encoding of string failed! Failure bytes: ffffffe0 ffffffa5 ffffff9c ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffffaf ffffffe0 ffffffa4 ffffffa8 20 ffffffe0 fffff
fa4 ffffffac ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffffa6 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffa7 ffffffe0 ffffffa4 ffffffbf 20 ffffffe0 ffffffa
4 ffffff97 ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffffaa ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffa4 ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4
ffffffb6 ffffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa4 ffffffae ffffffe0 ffffffa5 ffffff87 20 ffffffe0 ffffffa4 ff
ffffb8 ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffffa6 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffa7 ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4 ffff
ff9c ffffffe0 ffffffa4 ffffff81 ffffffe0 ffffffa4 ffffffa4 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa4 ffffffb0 20 ffffffe0 ffffffa4 ffffff
a8 ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffff97 ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffffaf ff
ffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa4 ffffff81
Can't encode transcription: चूड़ियन बुद्धि गुप्ता शासनमे सुद्धा जँतसार निगुनियाँ
Encoding of string failed! Failure bytes: ffffffe0 ffffffa5 ffffff9c ffffffe0 ffffffa4 ffffff87 ffffffe0 ffffffa4 ffffffb2 ffffffe0 ffffffa5 ffffff82 ffffffe0 ffffffa4
 ffffff81 20 ffffffe0 ffffffa4 ffffffaa ffffffe0 ffffffa5 ffffff8b ffffffe0 ffffffa4 ffffffa5 ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4 ffffffac ffffffe0 ffffffa
5 ffffff8b ffffffe0 ffffffa4 ffffffa5 ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4 ffffffae ffffffe0 ffffffa5 ffffff8b ffffffe0 ffffffa4 ffffffa5 ffffffe0 ffffffa4
ffffffbe 20 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffb5 ffffffe0 ffffffa5 ffffff87 ffffffe0 ffffffa4 ffffff9a ffffffe0 ffffffa5 ff
ffff8d ffffffe0 ffffffa4 ffffff9b ffffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa4 ffffff81 20 ffffffe0 ffffffa4 ffffffaa ffffffe0 ffffffa4 ffff
ffbe ffffffe0 ffffffa4 ffffffb0 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffff9f ffffffe0 ffffffa5 ffffff80 20 ffffffe0 ffffffa4 ffffffb2 ffffffe0 ffffffa5 ffffff
9c ffffffe0 ffffffa4 ffffff95 ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffffaf ffffffe0 ffffffa4 ffffffa8
Can't encode transcription: दौड़इलूँ पोथा बोथा मोथा स्वेच्छासँ पार्टी लड़कियन

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Dec 31, 2016

@also seen in finetune of Arabic


lstmtraining --model_output ~/tesstutorial/aratuned_from_ara/aratuned   --continue_from ~/tesstutorial/aratuned_from_ara/ara.lstm   --train_listfile ~/tesstutorial/ara/ara.training_files.txt     --eval_listfile ~/tesstutorial/aratest/ara.training_files.txt   --target_error_
rate 0.0001
Loaded file /home/shree/tesstutorial/aratuned_from_ara/aratuned_checkpoint, unpacking...
Successfully restored trainer from /home/shree/tesstutorial/aratuned_from_ara/aratuned_checkpoint
Loaded 229/229 pages (1-229) of document /home/shree/tesstutorial/ara/ara.Amiri.exp0.lstmf
Loaded 232/232 pages (1-232) of document /home/shree/tesstutorial/ara/ara.Arial.exp0.lstmf
Loaded 4/4 pages (1-4) of document /home/shree/tesstutorial/aratest/ara.Times_New_Roman.exp0.lstmf
Encoding of string failed! Failure bytes: ffffffd9 ffffff8e ffffffd9 ffffff8a ffffffd9 ffffff82 ffffffd9 ffffff90 ffffffd8 ffffffaf ffffffd9 ffffff90 ffffffd8 ffffffa7
 ffffffd8 ffffffb5 ffffffd9 ffffff8e 20 ffffffd9 ffffff85 ffffffd9 ffffff92 ffffffd8 ffffffaa ffffffd9 ffffff8f ffffffd9 ffffff86 ffffffd9 ffffff92 ffffffd9 ffffff83 f
fffffd9 ffffff8f 20 ffffffd9 ffffff86 ffffffd9 ffffff92 ffffffd8 ffffffa5 ffffffd9 ffffff90 20 ffffffd8 ffffffa7 ffffffd9 ffffff84 ffffffd9 ffffff84 ffffffd9 ffffff91
ffffffd9 ffffff8e ffffffd9 ffffff87 ffffffd9 ffffff90 20 ffffffd9 ffffff86 ffffffd9 ffffff90 ffffffd9 ffffff88 ffffffd8 ffffffaf ffffffd9 ffffff8f 20 ffffffd9 ffffff86
 ffffffd9 ffffff92 ffffffd9 ffffff85 ffffffd9 ffffff90 20 ffffffd9 ffffff85 ffffffd9 ffffff92 ffffffd9 ffffff83 ffffffd9 ffffff8f ffffffd8 ffffffa1 ffffffd9 ffffff8e f
fffffd8 ffffffa7 ffffffd8 ffffffaf ffffffd9 ffffff8e ffffffd9 ffffff87 ffffffd9 ffffff8e ffffffd8 ffffffb4 ffffffd9 ffffff8f
Can't encode transcription: نَيقِدِاصَ مْتُنْكُ نْإِ اللَّهِ نِودُ نْمِ مْكُءَادَهَشُ
Loaded 231/231 pages (1-231) of document /home/shree/tesstutorial/ara/ara.Arial_Unicode_MS.exp0.lstmf
Encoding of string failed! Failure bytes: ffffffd9 ffffff8e ffffffd9 ffffff88 ffffffd8 ffffffb1 ffffffd9 ffffff8f ffffffd8 ffffffb5 ffffffd9 ffffff90 ffffffd8 ffffffa8
 ffffffd9 ffffff92 ffffffd9 ffffff8a ffffffd9 ffffff8f 20 ffffffd9 ffffff84 ffffffd9 ffffff8e ffffffd8 ffffffa7 20 ffffffd8 ffffffaa ffffffd9 ffffff8d ffffffd8 ffffffa
7 ffffffd9 ffffff85 ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd9 ffffff8f ffffffd8 ffffffb8 ffffffd9 ffffff8f 20 ffffffd9 ffffff8a ffffffd9 ffffff81 ffffffd9 ffffff90
20 ffffffd9 ffffff85 ffffffd9 ffffff92 ffffffd9 ffffff87 ffffffd9 ffffff8f ffffffd9 ffffff83 ffffffd9 ffffff8e ffffffd8 ffffffb1 ffffffd9 ffffff8e ffffffd8 ffffffaa ff
ffffd9 ffffff8e ffffffd9 ffffff88 ffffffd9 ffffff8e 20 ffffffd9 ffffff85 ffffffd9 ffffff92 ffffffd9 ffffff87 ffffffd9 ffffff90 ffffffd8 ffffffb1 ffffffd9 ffffff90 ffff
ffd9 ffffff88 ffffffd9 ffffff86 ffffffd9 ffffff8f ffffffd8 ffffffa8 ffffffd9 ffffff90
Can't encode transcription: نَورُصِبْيُ لَا تٍامَلُظُ يفِ مْهُكَرَتَوَ مْهِرِونُبِ
Encoding of string failed! Failure bytes: ffffffd9 ffffff92 ffffffd9 ffffff87 ffffffd9 ffffff90 

@theraysmith
Copy link
Contributor

See new section in trainingtesseract-4.00

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jan 12, 2017 via email

@Brian51
Copy link

Brian51 commented Jan 12, 2017

It is working correctly in Spain, Thank you all for the incredible amount of work that you have all done.

@amitdo
Copy link
Collaborator

amitdo commented Jan 12, 2017

I don't see the changes either.

The wiki can be cloned as a git repo. Ray probably did some edits locally, but didn't 'push' them yet.

@theraysmith
Copy link
Contributor

theraysmith commented Jan 12, 2017 via email

@Shreeshrii
Copy link
Collaborator Author


Encoding of string failed! Failure bytes: 9 31 32 30 30 45 6d 69 6c 69 65 2c 68 61 6e 73 4b 6f 6e 65 2e
Can't encode transcription: Møller.     1200Emilie,hansKone.

when trying to train frk

@theraysmith
Copy link
Contributor

theraysmith commented Jan 23, 2017 via email

@harinath141
Copy link

@Shreeshrii
Is this issue resolved coz I'm getting the same when training with Telugu language..

@Shreeshrii
Copy link
Collaborator Author

Please see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#error-messages-from-training

Encoding of string failed! results when the text string for a training image 
cannot be encoded using the given unicharset. 

Possible causes are:

- There  is an un-represented character in the text, say a British Pound sign that is not in your unicharset.

- A  stray unprintable character (like tab or a control character) in the text.

- There  is an un-represented Indic grapheme/aksara in the text.

In any case it will result in that training image being ignored by the trainer. 

If the error is infrequent, it is harmless, but it may indicate that your unicharset is inadequate for representing the language that you are training.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Feb 2, 2017

@harinath141 If you are getting a lot of these errors during finetune, try replace top layer training. You can use the box/tiff pairs generated for finetune. Commands will be similar to the following:

mkdir -p ~/tesstutorial/tellayer_from_tel 

combine_tessdata -e ../tessdata/tel.traineddata \
  ~/tesstutorial/tellayer_from_tel/tel.lstm
  
lstmtraining -U ~/tesstutorial/tel/tel.unicharset \
  --script_dir ../langdata  --debug_interval 0 \
  --continue_from ~/tesstutorial/tellayer_from_tel/tel.lstm \
  --append_index 5 --net_spec '[Lfx256 O1c105]' \
  --model_output ~/tesstutorial/tellayer_from_tel/tellayer \
  --train_listfile ~/tesstutorial/tel/tel.training_files.txt \
  --target_error_rate 0.01

@Shreeshrii
Copy link
Collaborator Author

~/tesstutorial/tel/ should have your .lstmf files.

@harinath141
Copy link

Thank you @Shreeshrii I'll try to replace top layer

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Feb 3, 2017

@harinath141

When you use --debug_interval 0 you will see messages every 100 iterations like the following:

At iteration 45909/58500/58569, Mean rms=0.639%, delta=0.621%, char train=1.861%, word train=13.302%, skip ratio=0%,  wrote checkpoint.

At iteration 45960/58600/58669, Mean rms=0.64%, delta=0.616%, char train=1.844%, word train=12.933%, skip ratio=0%,  wrote checkpoint.

2 Percent improvement time=14052, best error was 3.697 @ 31958
At iteration 46010/58700/58769, Mean rms=0.634%, delta=0.561%, char train=1.686%, word train=12.343%, skip ratio=0%,  New best char error = 1.686 wrote best model:/hom
e/shree/tesstutorial/khmlayer1_from_khm/khm1.686_46010.lstm wrote checkpoint.

When you use --debug_interval -1 , messages such as the following will be shown for every iteration:


Iteration 59400: ALIGNED TRUTH : មានរូបឆ្មាំ អេស៊ីលីដា
Iteration 59400: BEST OCR TEXT : មានរូបឆ្មាំ អេស៊ីលីដា
File /tmp/tmp.BjsuuQ0dgJ/khm/khm.Noto_Serif_Khmer_Bold.exp0.lstmf page 53 (Perfect):
Mean rms=0.646%, delta=0.553%, train=1.878%(13.168%), skip ratio=0.1%
Iteration 59401: ALIGNED TRUTH : ឆ្កៀលយកភ្នែក ជួនឆ្លងវគ្គ ចាប់ពីពេលនោះមក របស់គាត់ កុំធេ្វសគំនិត។ អូនហ្អើយ =
Iteration 59401: BEST OCR TEXT : ឆ្លៀលយកភ្នែក ជួនឆ្លងវគត ចាប់ពីពេលនោះមក របស់គាត់ កុំធេ្វសគំនិត។ អូនហ្អើយ =
File /tmp/tmp.BjsuuQ0dgJ/khm/khm.Noto_Serif_Khmer.exp0.lstmf page 1 :
Mean rms=0.647%, delta=0.555%, train=1.881%(13.157%), skip ratio=0.1%
Iteration 59402: ALIGNED TRUTH : សឹងមានះរឹងត្អឹងមហិមា គុណ នៅប៉ែកឦសាននៃភ្នំ ទុលល្យូ ខេត្តស្ទឺងត្រែង,
Iteration 59402: BEST OCR TEXT : សឹងមានះរឹងត្អឹងមហិមា គុណ នៅប៉ែកឦសាននៃភ្នំ ទុលល្យូ ខេត្តស្ទឺងត្រែង,
File /tmp/tmp.BjsuuQ0dgJ/khm/khm.Leelawadee_UI_Bold.exp0.lstmf page 56 :
Mean rms=0.647%, delta=0.556%, train=1.881%(13.157%), skip ratio=0.1%
Iteration 59403: ALIGNED TRUTH : រឺគៃបន្លំបាន។ (រឿងអាខ្វាក់អាខ្វិន) អន្នំលោកង្សិ = ឧទាហរណ៍់៖តំបន់ខ្លះ ផ្ទះសម្បែង
Iteration 59403: BEST OCR TEXT : រឺគៃបន្លំបាន។ (រឿងអាខ្វាក់អាខ្វិន) អន្នំលោកង្សិ = ឧទាហរណ៍៖តំបន់ខ្លះ ផ្ទះសម្បែង
File /tmp/tmp.BjsuuQ0dgJ/khm/khm.Leelawadee_UI.exp0.lstmf page 51 :

intermediate checkpoint and .lstm files will be written to the output directory eg. ~/tesstutorial/tellayer_from_tel
You can also see visual debugging output with scrollview.

@Shreeshrii
Copy link
Collaborator Author

@theraysmith

I am still getting this error, for a new replace top layer training for Devanagari script, where the eval_listfile is based on a different training text. eg.

Encoding of string failed! Failure bytes: ffffffe0 ffffffa4 ffffff81 ffffffe0 ffffffa4 ffffff9a ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffff9a ffffffe0 ffffffa5 ffffff88 ffffffe0 ffffffa4 ffffff95 ffffffe0 ffffffa5 ffffff8b 20 ffffffe0 ffffffa4 ffffff9c ffffffe0 ffffffa5 ffffff80 ffffffe0 ffffffa4 ffffffb5 ffffffe0 ffffffa4 ffffffa8
Can't encode transcription: वैशाख साल देखि साथै यो साँच्चैको जीवन

Encoding of string failed! Failure bytes: ffffffe0 ffffffa4 ffffff81 ffffffe0 ffffffa4 ffffffa6 ffffffe0 ffffffa4 ffffffbe
Can't encode transcription: रूपांतरित जैबुन्निसा केंद्रित छँदा

While each unicode character (स ा ँ ) is there in the Devanagari unicharset, the combined akshara (साँ, छँ) is not there as part of training text/unicharset, but is there as part of eval text/unicharset.

The training unicharset is of the following format:

3784
NULL 0 NULL 0
Joined 7 0,69,188,255,486,1218,0,30,486,1188 Latin 1 0 1 Joined	# Joined [4a 6f 69 6e 65 64 ]a
|Broken|0|1 f 0,69,186,255,892,2138,0,80,892,2058 Common 3625 10 3625 |Broken|0|1	# Broken
र्ध्रु 1 0,64,61,197,280,356,0,0,280,356 Devanagari 18 0 18 र्ध्रु	# र्ध्रु [930 94d 927 94d 930 941 ]x
र्बृ 1 3,64,61,197,181,236,0,0,181,236 Devanagari 18 0 18 र्बृ	# र्बृ [930 94d 92c 943 ]x
श्चु 1 0,64,61,197,251,303,0,12,251,291 Devanagari 240 0 240 श्चु	# श्चु [936 94d 91a 941 ]x
श्चौ 1 3,65,61,255,294,367,0,12,294,355 Devanagari 240 0 240 श्चौ	# श्चौ [936 94d 91a 94c ]x
श्च् 1 3,64,61,197,251,303,0,12,251,291 Devanagari 240 0 240 श्च्	# श्च् [936 94d 91a 94d ]x
य 1 63,64,192,192,114,142,0,0,111,133 Devanagari 8 0 8 य	# य [92f ]x
श्रीः 1 3,74,61,253,295,412,0,12,295,400 Devanagari 240 0 240 श्रीः	# श्रीः [936 94d 930 940 903 ]x
ष्ठु 1 0,75,61,197,204,243,0,0,204,243 Devanagari 241 0 241 ष्ठु	# ष्ठु [937 94d 920 941 ]x
ष्ठौ 1 3,75,61,255,247,307,0,0,247,307 Devanagari 241 0 241 ष्ठौ	# ष्ठौ [937 94d 920 94c ]x
स्रैः 1 3,76,61,255,243,449,0,0,243,449 Devanagari 280 0 280 स्रैः	# स्रैः [938 94d 930 948 903 ]x
...

Does this mean that the training text needs to be expanded to include all possible akshara combinations?

@zc813
Copy link

zc813 commented Feb 2, 2018

@Shreeshrii Thanks for your help yesterday.
I encountered the same error (Encoding of string failed! Failure bytes: ffffffe0...) when training langdata/bod(Tibetan). It seemed most of the unicode characters are mis-decoded. I tried replacing top layers but still encountered the same error.
Since I'm already using the latest langdata, is there anything I can do to correct the encoding? Could you help me?
Thanks very much!

@Shreeshrii
Copy link
Collaborator Author

As per @theraysmith

  • There is an un-represented Indic grapheme/aksara in the text.
    In any case it will result in that training image being ignored by the trainer.
    If the error is infrequent, it is harmless, but it may indicate that your unicharset is inadequate for representing the language that you are training.

@zc813

tesstrain.sh has a limit of max_pages 3, you should change that so that complete training_text is used.

You can review the training_text to see that it is correct representation of bod(Tibetan).

Also test with 'Tibetan' script traineddata from both 'tessdata_best' and 'tessdata_fast' repo for OCR.

Authoritative answer can only be provided by @theraysmith.

@zc813
Copy link

zc813 commented Feb 2, 2018

@Shreeshrii Thanks a lot for the reply! I'll try the solution.

btw I tried to decode the error message and found most of them started with

ffffffe0 ffffffbc ffffff8c ffffffe0 ffffffbc ffffff8d

i.e. ༌། (0xf0c 0xf0d)
The (0xf0c) and (0xf0d) are already stored separately in my Tibetan.unicharset, I am kind of confused why they cannot be encoded when presented together.

@Shreeshrii
Copy link
Collaborator Author

Same problem as I had mentioned in one of my earlier comments -

While each unicode character (स ा ँ ) is there in the Devanagari unicharset, the combined akshara (साँ, छँ) is not there.

No answer from @theraysmith yet.. He has also marked this as a closed issue.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jul 2, 2018

@zdenop Ray had closed this so I can not reopen.

Please reopen this issue, because the problem is still there. It is related to utf-8/utf-16/utf-32 conversion.

Example:

Encoding of string failed! Failure bytes: cc 84 67 6e 65
Can't encode transcription: 'mamāgne' in language ''
utf8
6D 61 6D 61 CC 84 67 6E 65
utf16
006D 0061 006D 0061 0304 0067 006E 0065
hex
006D 0061 006D 0061 0304 0067 006E 0065

Error is related to 'CC 84' in utf-8 which is '0304' in utf16 or hex.

string converted using the converter at https://r12a.github.io/app-conversion/

@Shreeshrii
Copy link
Collaborator Author

@Shreeshrii
Copy link
Collaborator Author

tprintf("Encoding of string failed! Failure bytes:");

@Shreeshrii
Copy link
Collaborator Author

@ivanzz1001Any ideas.

@xhuvom
Copy link

xhuvom commented Oct 19, 2018

Can't encode transcription: 'ঢাকা মেটো-গ' in language ''
Encoding of string failed! Failure bytes: ffffffe0 ffffffa6 ffffffbe ffffffe0 ffffffa6 ffffff95 ffffffe0 ffffffa6 ffffffbe 20 ffffffe0 ffffffa6 ffffffae ffffffe0 ffffffa7 ffffff87 ffffffe0 ffffffa6 ffffff9f ffffffe0 ffffffa7 ffffff87 ffffffe0 ffffffa6 ffffff97
Can't encode transcription: '|ঢাকা মেটেগ' in language ''
^Cmake: *** Deleting file 'data/checkpoints/banglaLPRNew_checkpoint'
Makefile:129: recipe for target 'data/checkpoints/banglaLPRNew_checkpoint' failed

@stweil stweil reopened this Oct 9, 2019
@stweil
Copy link
Contributor

stweil commented Oct 9, 2019

It looks like this was the first report of the encoding problem, so I re-open it until it is (hopefully soon) solved.

@stweil stweil added the bug label Oct 9, 2019
@stweil
Copy link
Contributor

stweil commented Oct 9, 2019

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Oct 10, 2019 via email

@stweil
Copy link
Contributor

stweil commented Oct 11, 2019

I could fix the encoding errors for tesstrain by normalizing the ground truth texts, see tesseract-ocr/tesstrain#111.

@Shreeshrii
Copy link
Collaborator Author

@stweil If I understand the change correctly this normalizes the ground-truth text within the box file so errors will be avoided during LSTM training.

so any comparisons using the original ground truth files using diff, wdiff or or evaluation tools may still show errors for the normalized characters.

Also, this does not address the case when training is done using training_text and fonts.

I will suggest adding a new script normalize.py which can be used to normalize any training text before beginning training process and also adding normalization as part of creating the training text process in wiki.

Also, it maybe helpful to normalize all existing training_text files in langdata_lstm and langdata repos.

@stweil
Copy link
Contributor

stweil commented Oct 12, 2019

See tesseract-ocr/tesstrain#111. I just added a normalize.py.

@stweil
Copy link
Contributor

stweil commented Oct 12, 2019

See tesseract-ocr/langdata#148 and tesseract-ocr/langdata_lstm#26 which normalize the training texts. I noticed that more files (mostly *.unicharset) also contain unnormalized unicode, but I am not sure what to do with those.

@Shreeshrii
Copy link
Collaborator Author

Thanks, @stweil.

@Shreeshrii
Copy link
Collaborator Author

Possible causes as per Ray:

  • There is an un-represented character in the text, say a British Pound sign that is not in your unicharset.

  • A stray unprintable character (like tab or a control character) in the text.

  • There is an un-represented Indic grapheme/aksara in the text.

Additional cause:

  • Training text not being normalized

SOLUTIONS:

@stweil
Copy link
Contributor

stweil commented Oct 13, 2019

normalize.py can now also be used to show which files contain unnormalized unicode: ./normalize.py -n .... I used that to examine all unpacked traineddata (dawg converted to wordlist) and found that some of it is not normalized.

@Shreeshrii
Copy link
Collaborator Author

(dawg converted to wordlist) and found that some of it is not normalized.

Which languages? tessdata_best or tessdata_fast?

@stweil
Copy link
Contributor

stweil commented Oct 13, 2019

Here is the list of all unnormalized components (extracted from traineddata):

tessdata/osd/osd.pffmtable
tessdata/osd/osd.unicharset
tessdata/osd/osd.normproto
tessdata/script/Arabic/Arabic.lstm-word-dawg.wordlist
tessdata/heb/heb.unicharambigs
tessdata/uig/uig.lstm-word-dawg.wordlist
tessdata_best/osd/osd.pffmtable
tessdata_best/osd/osd.unicharset
tessdata_best/osd/osd.normproto
tessdata_best/script/Arabic/Arabic.lstm-word-dawg.wordlist
tessdata_best/uig/uig.lstm-word-dawg.wordlist
tessdata_fast/osd/osd.pffmtable
tessdata_fast/osd/osd.unicharset
tessdata_fast/osd/osd.normproto
tessdata_fast/script/Arabic/Arabic.lstm-word-dawg.wordlist
tessdata_fast/uig/uig.lstm-word-dawg.wordlist

@johnlockejrr
Copy link

johnlockejrr commented Sep 5, 2024

This happens when are present some control characters like:
CHARACTER TABULATION, CARRIAGE RETURN, RIGHT-TO-LEFT MARK [RLM], LEFT-TO-RIGHT MARK [LRM], NO-BREAK SPACE, they are mostly not visible to the naked eye.

So with sed (or python/perl script whatever) you can remove/replace them.

s/\x09//g
s/\x0d//g
s/\xc2\xa0/ /g
s/\x20\x0e//g
s/\x20\x0f//g

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants