-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LSTM: Training - Error msg - Encoding of string failed! #549
Comments
Still getting the errors with the following version -
|
@also seen in finetune of Arabic
|
See new section in trainingtesseract-4.00 |
Wiki does not seem to have this section,
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
TrainingTesseract 4.00
Stefan Weil edited this page 28 days ago · 9 revisions
We have a github outage in India just now, not sure if this is related to
that or wiki updation is still in todo.
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Thu, Jan 12, 2017 at 5:04 AM, theraysmith ***@***.***> wrote:
See new section in trainingtesseract-4.00
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#549 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o2Kj43a8uaNjjhRJt8EBMPHq9-kgks5rRWcEgaJpZM4LIjyK>
.
|
It is working correctly in Spain, Thank you all for the incredible amount of work that you have all done. |
I don't see the changes either. The wiki can be cloned as a git repo. Ray probably did some edits locally, but didn't 'push' them yet. |
Changes are pushed now. I got called away yesterday before I was able to do
it.
…On Thu, Jan 12, 2017 at 2:36 AM, Amit D. ***@***.***> wrote:
I don't see the changes either.
The wiki can be cloned as a git repo. Ray probably did some edits locally,
but didn't 'push' them.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#549 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056X0eolRJLjvYL3TR3hp1-wfTyoGKks5rRgJFgaJpZM4LIjyK>
.
--
Ray.
|
when trying to train frk |
The tab character (9) at the beginning of the list of failure bytes is a
dead giveaway.
…On Sat, Jan 21, 2017 at 6:15 AM, Shreeshrii ***@***.***> wrote:
Encoding of string failed! Failure bytes: 9 31 32 30 30 45 6d 69 6c 69 65 2c 68 61 6e 73 4b 6f 6e 65 2e
Can't encode transcription: Møller. 1200Emilie,hansKone.
when trying to train frk
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#549 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056Z_ATRDHUb3698yrRFfl1XSJTJM3ks5rUhMAgaJpZM4LIjyK>
.
--
Ray.
|
@Shreeshrii |
Please see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#error-messages-from-training
|
@harinath141 If you are getting a lot of these errors during finetune, try replace top layer training. You can use the box/tiff pairs generated for finetune. Commands will be similar to the following:
|
~/tesstutorial/tel/ should have your .lstmf files. |
Thank you @Shreeshrii I'll try to replace top layer |
When you use
When you use
intermediate checkpoint and .lstm files will be written to the output directory eg. ~/tesstutorial/tellayer_from_tel |
I am still getting this error, for a new replace top layer training for Devanagari script, where the eval_listfile is based on a different training text. eg.
While each unicode character (स ा ँ ) is there in the Devanagari unicharset, the combined akshara (साँ, छँ) is not there as part of training text/unicharset, but is there as part of eval text/unicharset. The training unicharset is of the following format:
Does this mean that the training text needs to be expanded to include all possible akshara combinations? |
@Shreeshrii Thanks for your help yesterday. |
As per @theraysmith
tesstrain.sh has a limit of max_pages 3, you should change that so that complete training_text is used. You can review the training_text to see that it is correct representation of bod(Tibetan). Also test with 'Tibetan' script traineddata from both 'tessdata_best' and 'tessdata_fast' repo for OCR. Authoritative answer can only be provided by @theraysmith. |
@Shreeshrii Thanks a lot for the reply! I'll try the solution. btw I tried to decode the error message and found most of them started with
i.e. ༌། (0xf0c 0xf0d) |
Same problem as I had mentioned in one of my earlier comments -
No answer from @theraysmith yet.. He has also marked this as a closed issue. |
@zdenop Ray had closed this so I can not reopen. Please reopen this issue, because the problem is still there. It is related to utf-8/utf-16/utf-32 conversion. Example: Encoding of string failed! Failure bytes: cc 84 67 6e 65 Error is related to 'CC 84' in utf-8 which is '0304' in utf16 or hex. string converted using the converter at https://r12a.github.io/app-conversion/ |
tesseract/src/lstm/lstmtrainer.cpp Line 785 in a80a8f1
|
@ivanzz1001Any ideas. |
Can't encode transcription: 'ঢাকা মেটো-গ' in language '' |
It looks like this was the first report of the encoding problem, so I re-open it until it is (hopefully soon) solved. |
@stweil After this initial error report, Ray changed the LSTM training
process so some of the comments will not be applicable with current code.
Regardless, the issue is still there.
…On Wed, Oct 9, 2019 at 11:29 PM Stefan Weil ***@***.***> wrote:
See also later errors with "Encoding of string failed"
<https://github.com/tesseract-ocr/tesseract/issues?utf8=%E2%9C%93&q=%22Encoding+of+string+failed%22>
.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#549?email_source=notifications&email_token=ABG37I2J4Q5AXOR6EOSOFITQNYLXRA5CNFSM4CZCHSFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAYYODQ#issuecomment-540116750>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABG37I3YLPMEKG5GIWNBHHTQNYLXRANCNFSM4CZCHSFA>
.
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
|
I could fix the encoding errors for tesstrain by normalizing the ground truth texts, see tesseract-ocr/tesstrain#111. |
@stweil If I understand the change correctly this normalizes the ground-truth text within the box file so errors will be avoided during LSTM training. so any comparisons using the original ground truth files using Also, this does not address the case when training is done using training_text and fonts. I will suggest adding a new script Also, it maybe helpful to normalize all existing training_text files in langdata_lstm and langdata repos. |
See tesseract-ocr/tesstrain#111. I just added a |
See tesseract-ocr/langdata#148 and tesseract-ocr/langdata_lstm#26 which normalize the training texts. I noticed that more files (mostly |
Thanks, @stweil. |
Possible causes as per Ray:
Additional cause:
SOLUTIONS:
|
|
Which languages? tessdata_best or tessdata_fast? |
Here is the list of all unnormalized components (extracted from traineddata):
|
This happens when are present some control characters like: So with sed (or python/perl script whatever) you can remove/replace them.
|
The text was updated successfully, but these errors were encountered: