Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't encode transcription: adding 0 automatically at start of ground truth file. #172

Closed
jayawantkarale opened this issue Jun 30, 2020 · 25 comments
Assignees
Labels
bug Something isn't working

Comments

@jayawantkarale
Copy link

Can't encode transcription: ' 0अंशक्लृप्ति ( aṁśa-kḷpti) f. A division into parts औपाधिक्यंशक्लृप्तिर्घटग-' in language ''

Above error display during training process. while providing ground truth text we given input as
'अंशक्लृप्ति ( aṁśa-kḷpti) f. A division into parts औपाधिक्यंशक्लृप्तिर्घटग-' but during training process it append 0 at start of line.
so this also add o after performing ocr extraction using newly genrated traineddata.
why it is adding 0 in start of ground truth line.
Please help me to resolve the issue.

@wrznr
Copy link
Collaborator

wrznr commented Jun 30, 2020

@jayawantkarale Without sample data, it is not possible to assist you since we cannot reproduce the erroneous behavior. If you could provide us with a minimal example which causes your issue?

@wrznr wrznr self-assigned this Jun 30, 2020
@wrznr wrznr added the bug Something isn't working label Jun 30, 2020
@jayawantkarale
Copy link
Author

@wrznr thanks for your prompt reply. This erroneous behavior appear while training on large dataset. it is not possible to upload whole data, I am trying to reproduce same behavior on small sample data. so after getting error on small dataset i will upload sample data.

@Shreeshrii
Copy link
Collaborator

Can't encode transcription: ' 0अंशक्लृप्ति ( aṁśa-kḷpti) f. A division into parts औपाधिक्यंशक्लृप्तिर्घटग-' in language ''

Try to find the above text in your large dataset. Then using uniview or similar utility check whether there is any unprintable/invisible character in the line.

@jayawantkarale
Copy link
Author

I am able to reproduce same behaviour on smaller dataset, on attaching dataset along with its log file.
Now it is attaching 0 at start of most of the lines, and gives error can't encode transcription.

Can't encode transcription: ' 0MahāBhā. xii. 47. 42; 6 A ii 1/124 of a day द्यु हेयं पर्व चेत् पादे पादर्स्रिशत्तु' in language ''
page no : vol1_101_1-029.exp0.tif

Can't encode transcription: ' 0MahāBhā. ii. 8. 14; विजिती वीतिहोत्रोंऽशः [ v. 1 ºञश्च ] MahāBhā. i. 1. 173.' in language ''
page no : vol1_101_1-048.exp0.tif

Can't encode transcription: ' 0day as a meaning found in L. 9 adv. (aṁśena aṁśe) in part, partly,' in language ''
page no : vol1_101_2-004.exp0.tif

Can't encode transcription: ' 0नुकूलता Kād. 159.1; उभयाश्रयत्वात् पथ्यालक्षणस्य विपुलायास्तत्रांशेनापि प्रवेशो' in language ''
page no : vol1_101_2-006.exp0.tif

Can't encode transcription: ' 0221. 16 (aṁśāni=probably aṁśvālyāni=vālyāṁśāḥ) सर्वेत्र चाष्टंसमशं कल्कस्य' in language ''
page no : vol1_101_2-019.exp0.tif

Can't encode transcription: ' 0तत्समं तोरणं चान्यन्न्यस्येद् भूमौ द्विजांशकम् PauṣkS. 4. 123; F v part or' in language ''
page no : vol1_102_2-026.exp0.tif

Can't encode transcription: ' 0तन्मूले व्ययहर्म्यनामसहिते भक्ते त्रिभिस्त्वंशकः ।. स्यादिन्द्रो यमभूपतिक्रमवशात् Rāja-' in language ''
page no : vol1_102_2-033.exp0.tif

Can't encode transcription: ' 0द्रव्यत्वेऽप्यंशकल्पनमाकाशस्य TattvPradī. ( A.) 198. 7 (on 2. 48) .' in language ''
page no : vol1_103_1-029.exp0.tif

Can't encode transcription: ' 0विभजेल्लब्धं भवेद्वर्गः ĀryaSi. 15. 16 (147)' in language ''
page no : vol1_103_2-003.exp0.tif

san9_test-ground-truth.zip
san9_test.log

@jayawantkarale
Copy link
Author

Can't encode transcription: ' 0अंशक्लृप्ति ( aṁśa-kḷpti) f. A division into parts औपाधिक्यंशक्लृप्तिर्घटग-' in language ''

Try to find the above text in your large dataset. Then using uniview or similar utility check whether there is any unprintable/invisible character in the line.

I have checked it with BabelPad Viewer but is not showing any unprintable/invisible character

@Shreeshrii
Copy link
Collaborator

Thank you for sharing the test dataset. Please also share the command you used to invoke makefile and tesseract version.

@Shreeshrii
Copy link
Collaborator

From your log file:

Wrote unicharset file data/san9_test/my.unicharset
merge_unicharsets data/san/san9_test.lstm-unicharset data/san9_test/my.unicharset  "data/san9_test/unicharset"
Loaded unicharset of size 225 from file data/san/san9_test.lstm-unicharset
Loaded unicharset of size 3 from file data/san9_test/my.unicharset
Wrote unicharset file data/san9_test/unicharset.

unicharset of size 3 from file data/san9_test/my.unicharset

THIS IS THE ISSUE.

I tried with just a few lines from your dataset with the following command and it seems to work.

 make training MODEL_NAME=san9_mwe LANG_TYPE=Indic START_MODEL=san TESSDATA=/home/ubuntu/tessdata_best

@Shreeshrii
Copy link
Collaborator

https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile#L191

	find $(GROUND_TRUTH_DIR) -name '*.gt.txt' | xargs cat | sort | uniq > "$@"

When single line ground truth text is being concatenated, it becomes one huge line and so if there is any error in even one file, the unicharset generation fails. I changed the above in makefile to add a linebreak after each groundtruth file.

	find $(GROUND_TRUTH_DIR) -name '*.gt.txt' | xargs -I{} sh -c "cat {}; echo ''" > "$@"

Now the unicharset is being generated from the training text.

unicharset_extractor --output_unicharset "data/san9_test/my.unicharset" --norm_mode 2 "data/san9_test/all-gt"
Bad box coordinates in boxfile string! नाम् । रसगुणबलिभिर्विधाय RaseCi. 2. 10; त्वक्पथ्ययोः समावंशौ शशिभागार्धसं-
Extracting unicharset from plain text file data/san9_test/all-gt
Invalid start of grapheme sequence:M=0x943
Normalization failed for string 'iii. 3.65; 8 C a quarter [अंशः] पादार्द्धयोदेृष्टः ParyāRaMā. 1539; 8 D a half;'
Other case I of i is not in unicharset
Other case O of o is not in unicharset
Other case Ṁ of ṁ is not in unicharset
Other case Ṇ of ṇ is not in unicharset
Other case Ḍ of ḍ is not in unicharset
Other case Ṛ of ṛ is not in unicharset
Other case Z of z is not in unicharset
Other case Q of q is not in unicharset
Other case X of x is not in unicharset
Other case Ū of ū is not in unicharset
Other case Ḥ of ḥ is not in unicharset
Other case Ṅ of ṅ is not in unicharset
Other case Ṭ of ṭ is not in unicharset
Other case Ḷ of ḷ is not in unicharset
Other case Ñ of ñ is not in unicharset
Wrote unicharset file data/san9_test/my.unicharset
merge_unicharsets data/Devanagari/san9_test.lstm-unicharset data/san9_test/my.unicharset  "data/san9_test/unicharset"
Loaded unicharset of size 217 from file data/Devanagari/san9_test.lstm-unicharset
Loaded unicharset of size 162 from file data/san9_test/my.unicharset
Wrote unicharset file data/san9_test/unicharset.

@Shreeshrii
Copy link
Collaborator

@jayawantkarale

I have checked it with BabelPad Viewer but is not showing any unprintable/invisible character

The generated unicharset has a line for feff (BOM) - I see that it is part of many of the gt lines. This can also cause the error.

 0 0,255,0,255,0,0,0,0,0,0 Common 3 18 3 	#  [feff ]

You can search for it in your groundtruth with

grep -rl $'\xEF\xBB\xBF' .

@kba @wrznr @stweil Can normalize process be used to strip the groundtruth of BOM?

@wrznr
Copy link
Collaborator

wrznr commented Jul 8, 2020

@Shreeshrii This could indeed be a problem! According to https://stackoverflow.com/questions/8898294/convert-utf-8-with-bom-to-utf-8-with-no-bom-in-python, there is a special encoding for UTF-8 with BOM utf-8-sig. However, I do not think that we can add a corresponding conversion per default.

@jayawantkarale Could you try to remove the BOMs and check whether the problem persists?

@kba
Copy link
Collaborator

kba commented Jul 8, 2020

However, I do not think that we can add a corresponding conversion per default.

Decoding as utf-8-sig would not hurt to do by default IIUC. If it's a UTF-8 string without BOM, the behavior should be the same as decoding as utf-8.

@jayawantkarale
Copy link
Author

@Shreeshrii As suggested i make changes in Makfile.
find $(GROUND_TRUTH_DIR) -name '*.gt.txt' | xargs -I{} sh -c "cat {}; echo ''" > "$@"
Now its working on testdata i have sent but while running on full dataset it again shows 0 at start of some lines.

Encoding of string failed! Failure bytes: ffffffe2 ffffff80 ffffff8c 75 2e 20 32 35 33 2e 20 31
Can't encode transcription: ' 0( राजा) निसर्गस्नेहविषयेषु मित्रेष्वकुटिलः स्यान्न कार्यमित्रेषु ManvaM‌u. 253. 1' in language ''

and i actually don't understand how to remove BOM from file.
I need to check after removing BOM from file.

Thanks for Quick help @Shreeshrii @wrznr @kba

@Shreeshrii
Copy link
Collaborator

@jayawantkarale
Copy link
Author

@Shreeshrii i removed BOM from ground-truth files.
Now i not getting 0 at the start of line. for those lines i am getting following error.

Can't encode transcription: ' ( राजा) निसर्गस्नेहविषयेषु मित्रेष्वकुटिलः स्यान्न कार्यमित्रेषु ManvaM‌u. 253. 1' in language ''

Thanks a lot it solved my problem @Shreeshrii @wrznr @kba

@wrznr wrznr closed this as completed Jul 10, 2020
@Shreeshrii
Copy link
Collaborator

@jayawantkarale please share the error log from training. It might be helpful in finding out why certain lines are still getting the error 'can't encode transcription' even though the unicharset is generated from the training text.

@kba
Copy link
Collaborator

kba commented Jul 10, 2020

Should we implement decoding from utf-8-sig to prevent BOM issues in the future?

@wrznr
Copy link
Collaborator

wrznr commented Jul 10, 2020

Good question. I am really not sure. If I got you correctly it wouldn't hurt but somehow I am still not a huge fan of it.

@Shreeshrii @stweil What do you think?

@stweil
Copy link
Collaborator

stweil commented Jul 10, 2020

It could be implemented in the Tesseract code. But maybe just failing with a reasonable error message would be better. That helps to get uniform GT texts (instead of supporting many variants which make also problems elsewhere).

@jayawantkarale
Copy link
Author

S i have attached log file from training
san9_test.log

@Shreeshrii
Copy link
Collaborator

        Line 23161: Can't encode transcription: 'द‌शा यत्रास्ति सामान्यस्पन्दरूपा तदकुलम्  ParāTri. 229.7; प्रसादात्ते जन्तुः' in language ''
	Line 25108: Can't encode transcription: '( राजा) निसर्गस्नेहविषयेषु मित्रेष्वकुटिलः स्यान्न कार्यमित्रेषु ManvaM‌u. 253. 1' in language ''
	Line 27514: Can't encode transcription: 'द‌शा यत्रास्ति सामान्यस्पन्दरूपा तदकुलम्  ParāTri. 229.7; प्रसादात्ते जन्तुः' in language ''
	Line 27556: Can't encode transcription: '( राजा) निसर्गस्नेहविषयेषु मित्रेष्वकुटिलः स्यान्न कार्यमित्रेषु ManvaM‌u. 253. 1' in language ''
	Line 27625: Can't encode transcription: 'द‌शा यत्रास्ति सामान्यस्पन्दरूपा तदकुलम्  ParāTri. 229.7; प्रसादात्ते जन्तुः' in language ''

The log shows two text lines getting the error.

By using https://r12a.github.io/uniview/, it seems to me that the problem maybe caused by ZWNJ.

 ‎004D LATIN CAPITAL LETTER M
 ‎0061 LATIN SMALL LETTER A
 ‎006E LATIN SMALL LETTER N
 ‎0076 LATIN SMALL LETTER V
 ‎0061 LATIN SMALL LETTER A
 ‎004D LATIN CAPITAL LETTER M
 ‎200C ZERO WIDTH NON-JOINER
 ‎0075 LATIN SMALL LETTER U

and

 ‎0926 DEVANAGARI LETTER DA
 ‎200C ZERO WIDTH NON-JOINER
 ‎0936 DEVANAGARI LETTER SHA
 ‎093E DEVANAGARI VOWEL SIGN AA
 ‎0020 SPACE

Please try after removing ZWNJ (200C) from the groundtruth and see if it works.

@stweil I think tesseract normalization process removes ZWNJ from the text. Would that be causing this issue?

@Shreeshrii Shreeshrii reopened this Jul 12, 2020
@Shreeshrii
Copy link
Collaborator

@jayawantkarale Please also share the generated unicharset.

@Shreeshrii
Copy link
Collaborator

@Shreeshrii ZWNJ is required as it is required in prachin sanskrit books.

Is ZWNJ being used to create old style ligatures?

Please share the images and groundtruth for the two lines in error:

    Line 23161: Can't encode transcription: 'द‌शा यत्रास्ति सामान्यस्पन्दरूपा तदकुलम्  ParāTri. 229.7; प्रसादात्ते जन्तुः' in language ''
Line 25108: Can't encode transcription: '( राजा) निसर्गस्नेहविषयेषु मित्रेष्वकुटिलः स्यान्न कार्यमित्रेषु ManvaM‌u. 253. 1' in language ''

@jayawantkarale
Copy link
Author

@Shreeshrii In these lines ZWNJ is not required. After removing it I am not getting any error.
But in certain lines ZWNJ is required but in those lines i am not getting error, So we can not remove ZWNJ from all lines.

I will share images where we use ZWNJ but not getting error.

@jayawantkarale
Copy link
Author

@Shreeshrii These sample files contain ZWNJ character but still i am not getting error

ZWNJ.zip

@Shreeshrii
Copy link
Collaborator

@jayawantkarale Sorry, it took me so long to look at the files. I see that ZWNJ is being used for explicit halant (virama) in middle of words.

I am curious to know about the results of your training. Please do share when completed.

One suggestion:

The groundtruth needs to be reviewed carefully otherwise training results will be wrong. eg. in the sample that you shared, 'small u maatraa' is being used for 'roopam' two times, when it actually needs to be 'uu maatraa'.

vol1_106_1-002 exp0

aspects अस्ति भाति प्रियं रूपं नाम चेत्यंशपञ्चकम् । आद्यत्रयं ब्रह्मरुपं जगद्‌रुपं

needs to have ब्रह्मरूपं जगद्‌रूपं

instead of ब्रह्मरुपं जगद्‌रुपं.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants