-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't encode transcription: adding 0 automatically at start of ground truth file. #172
Comments
@jayawantkarale Without sample data, it is not possible to assist you since we cannot reproduce the erroneous behavior. If you could provide us with a minimal example which causes your issue? |
@wrznr thanks for your prompt reply. This erroneous behavior appear while training on large dataset. it is not possible to upload whole data, I am trying to reproduce same behavior on small sample data. so after getting error on small dataset i will upload sample data. |
Can't encode transcription: ' 0अंशक्लृप्ति ( aṁśa-kḷpti) f. A division into parts औपाधिक्यंशक्लृप्तिर्घटग-' in language '' Try to find the above text in your large dataset. Then using uniview or similar utility check whether there is any unprintable/invisible character in the line. |
I am able to reproduce same behaviour on smaller dataset, on attaching dataset along with its log file. Can't encode transcription: ' 0MahāBhā. xii. 47. 42; 6 A ii 1/124 of a day द्यु हेयं पर्व चेत् पादे पादर्स्रिशत्तु' in language '' Can't encode transcription: ' 0MahāBhā. ii. 8. 14; विजिती वीतिहोत्रोंऽशः [ v. 1 ºञश्च ] MahāBhā. i. 1. 173.' in language '' Can't encode transcription: ' 0day as a meaning found in L. 9 adv. (aṁśena aṁśe) in part, partly,' in language '' Can't encode transcription: ' 0नुकूलता Kād. 159.1; उभयाश्रयत्वात् पथ्यालक्षणस्य विपुलायास्तत्रांशेनापि प्रवेशो' in language '' Can't encode transcription: ' 0221. 16 (aṁśāni=probably aṁśvālyāni=vālyāṁśāḥ) सर्वेत्र चाष्टंसमशं कल्कस्य' in language '' Can't encode transcription: ' 0तत्समं तोरणं चान्यन्न्यस्येद् भूमौ द्विजांशकम् PauṣkS. 4. 123; F v part or' in language '' Can't encode transcription: ' 0तन्मूले व्ययहर्म्यनामसहिते भक्ते त्रिभिस्त्वंशकः ।. स्यादिन्द्रो यमभूपतिक्रमवशात् Rāja-' in language '' Can't encode transcription: ' 0द्रव्यत्वेऽप्यंशकल्पनमाकाशस्य TattvPradī. ( A.) 198. 7 (on 2. 48) .' in language '' Can't encode transcription: ' 0विभजेल्लब्धं भवेद्वर्गः ĀryaSi. 15. 16 (147)' in language '' |
I have checked it with BabelPad Viewer but is not showing any unprintable/invisible character |
Thank you for sharing the test dataset. Please also share the command you used to invoke makefile and tesseract version. |
From your log file:
unicharset of size 3 from file data/san9_test/my.unicharset THIS IS THE ISSUE. I tried with just a few lines from your dataset with the following command and it seems to work.
|
https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile#L191
When single line ground truth text is being concatenated, it becomes one huge line and so if there is any error in even one file, the unicharset generation fails. I changed the above in makefile to add a linebreak after each groundtruth file.
Now the unicharset is being generated from the training text.
|
The generated unicharset has a line for feff (BOM) - I see that it is part of many of the gt lines. This can also cause the error.
You can search for it in your groundtruth with
@kba @wrznr @stweil Can normalize process be used to strip the groundtruth of BOM? |
@Shreeshrii This could indeed be a problem! According to https://stackoverflow.com/questions/8898294/convert-utf-8-with-bom-to-utf-8-with-no-bom-in-python, there is a special encoding for UTF-8 with BOM @jayawantkarale Could you try to remove the BOMs and check whether the problem persists? |
Decoding as utf-8-sig would not hurt to do by default IIUC. If it's a UTF-8 string without BOM, the behavior should be the same as decoding as utf-8. |
@Shreeshrii As suggested i make changes in Makfile. Encoding of string failed! Failure bytes: ffffffe2 ffffff80 ffffff8c 75 2e 20 32 35 33 2e 20 31 and i actually don't understand how to remove BOM from file. Thanks for Quick help @Shreeshrii @wrznr @kba |
@Shreeshrii i removed BOM from ground-truth files. Can't encode transcription: ' ( राजा) निसर्गस्नेहविषयेषु मित्रेष्वकुटिलः स्यान्न कार्यमित्रेषु ManvaMu. 253. 1' in language '' Thanks a lot it solved my problem @Shreeshrii @wrznr @kba |
@jayawantkarale please share the error log from training. It might be helpful in finding out why certain lines are still getting the error 'can't encode transcription' even though the unicharset is generated from the training text. |
Should we implement decoding from utf-8-sig to prevent BOM issues in the future? |
Good question. I am really not sure. If I got you correctly it wouldn't hurt but somehow I am still not a huge fan of it. @Shreeshrii @stweil What do you think? |
It could be implemented in the Tesseract code. But maybe just failing with a reasonable error message would be better. That helps to get uniform GT texts (instead of supporting many variants which make also problems elsewhere). |
S i have attached log file from training |
The log shows two text lines getting the error. By using https://r12a.github.io/uniview/, it seems to me that the problem maybe caused by ZWNJ.
and
Please try after removing ZWNJ (200C) from the groundtruth and see if it works. @stweil I think tesseract normalization process removes ZWNJ from the text. Would that be causing this issue? |
@jayawantkarale Please also share the generated unicharset. |
Is ZWNJ being used to create old style ligatures? Please share the images and groundtruth for the two lines in error:
|
@Shreeshrii In these lines ZWNJ is not required. After removing it I am not getting any error. I will share images where we use ZWNJ but not getting error. |
@Shreeshrii These sample files contain ZWNJ character but still i am not getting error |
@jayawantkarale Sorry, it took me so long to look at the files. I see that ZWNJ is being used for explicit halant (virama) in middle of words. I am curious to know about the results of your training. Please do share when completed. One suggestion: The groundtruth needs to be reviewed carefully otherwise training results will be wrong. eg. in the sample that you shared, 'small u maatraa' is being used for 'roopam' two times, when it actually needs to be 'uu maatraa'.
needs to have ब्रह्मरूपं जगद्रूपं instead of ब्रह्मरुपं जगद्रुपं. |
Can't encode transcription: ' 0अंशक्लृप्ति ( aṁśa-kḷpti) f. A division into parts औपाधिक्यंशक्लृप्तिर्घटग-' in language ''
Above error display during training process. while providing ground truth text we given input as
'अंशक्लृप्ति ( aṁśa-kḷpti) f. A division into parts औपाधिक्यंशक्लृप्तिर्घटग-' but during training process it append 0 at start of line.
so this also add o after performing ocr extraction using newly genrated traineddata.
why it is adding 0 in start of ground truth line.
Please help me to resolve the issue.
The text was updated successfully, but these errors were encountered: