Can't encode transcription: adding 0 automatically at start of ground truth file. #172

jayawantkarale · 2020-06-30T08:48:14Z

Can't encode transcription: ' 0अंशक्लृप्ति ( aṁśa-kḷpti) f. A division into parts औपाधिक्यंशक्लृप्तिर्घटग-' in language ''

Above error display during training process. while providing ground truth text we given input as
'अंशक्लृप्ति ( aṁśa-kḷpti) f. A division into parts औपाधिक्यंशक्लृप्तिर्घटग-' but during training process it append 0 at start of line.
so this also add o after performing ocr extraction using newly genrated traineddata.
why it is adding 0 in start of ground truth line.
Please help me to resolve the issue.

wrznr · 2020-06-30T12:02:23Z

@jayawantkarale Without sample data, it is not possible to assist you since we cannot reproduce the erroneous behavior. If you could provide us with a minimal example which causes your issue?

jayawantkarale · 2020-07-03T05:14:00Z

@wrznr thanks for your prompt reply. This erroneous behavior appear while training on large dataset. it is not possible to upload whole data, I am trying to reproduce same behavior on small sample data. so after getting error on small dataset i will upload sample data.

Shreeshrii · 2020-07-03T06:33:45Z

Can't encode transcription: ' 0अंशक्लृप्ति ( aṁśa-kḷpti) f. A division into parts औपाधिक्यंशक्लृप्तिर्घटग-' in language ''

Try to find the above text in your large dataset. Then using uniview or similar utility check whether there is any unprintable/invisible character in the line.

jayawantkarale · 2020-07-07T09:43:21Z

I am able to reproduce same behaviour on smaller dataset, on attaching dataset along with its log file.
Now it is attaching 0 at start of most of the lines, and gives error can't encode transcription.

Can't encode transcription: ' 0MahāBhā. xii. 47. 42; 6 A ii 1/124 of a day द्यु हेयं पर्व चेत् पादे पादर्स्रिशत्तु' in language ''
page no : vol1_101_1-029.exp0.tif

Can't encode transcription: ' 0MahāBhā. ii. 8. 14; विजिती वीतिहोत्रोंऽशः [ v. 1 ºञश्च ] MahāBhā. i. 1. 173.' in language ''
page no : vol1_101_1-048.exp0.tif

Can't encode transcription: ' 0day as a meaning found in L. 9 adv. (aṁśena aṁśe) in part, partly,' in language ''
page no : vol1_101_2-004.exp0.tif

Can't encode transcription: ' 0नुकूलता Kād. 159.1; उभयाश्रयत्वात् पथ्यालक्षणस्य विपुलायास्तत्रांशेनापि प्रवेशो' in language ''
page no : vol1_101_2-006.exp0.tif

Can't encode transcription: ' 0221. 16 (aṁśāni=probably aṁśvālyāni=vālyāṁśāḥ) सर्वेत्र चाष्टंसमशं कल्कस्य' in language ''
page no : vol1_101_2-019.exp0.tif

Can't encode transcription: ' 0तत्समं तोरणं चान्यन्न्यस्येद् भूमौ द्विजांशकम् PauṣkS. 4. 123; F v part or' in language ''
page no : vol1_102_2-026.exp0.tif

Can't encode transcription: ' 0तन्मूले व्ययहर्म्यनामसहिते भक्ते त्रिभिस्त्वंशकः ।. स्यादिन्द्रो यमभूपतिक्रमवशात् Rāja-' in language ''
page no : vol1_102_2-033.exp0.tif

Can't encode transcription: ' 0द्रव्यत्वेऽप्यंशकल्पनमाकाशस्य TattvPradī. ( A.) 198. 7 (on 2. 48) .' in language ''
page no : vol1_103_1-029.exp0.tif

Can't encode transcription: ' 0विभजेल्लब्धं भवेद्वर्गः ĀryaSi. 15. 16 (147)' in language ''
page no : vol1_103_2-003.exp0.tif

san9_test-ground-truth.zip
san9_test.log

jayawantkarale · 2020-07-07T10:03:16Z

Can't encode transcription: ' 0अंशक्लृप्ति ( aṁśa-kḷpti) f. A division into parts औपाधिक्यंशक्लृप्तिर्घटग-' in language ''

Try to find the above text in your large dataset. Then using uniview or similar utility check whether there is any unprintable/invisible character in the line.

I have checked it with BabelPad Viewer but is not showing any unprintable/invisible character

Shreeshrii · 2020-07-08T02:06:30Z

Thank you for sharing the test dataset. Please also share the command you used to invoke makefile and tesseract version.

Shreeshrii · 2020-07-08T02:12:25Z

From your log file:

Wrote unicharset file data/san9_test/my.unicharset
merge_unicharsets data/san/san9_test.lstm-unicharset data/san9_test/my.unicharset  "data/san9_test/unicharset"
Loaded unicharset of size 225 from file data/san/san9_test.lstm-unicharset
Loaded unicharset of size 3 from file data/san9_test/my.unicharset
Wrote unicharset file data/san9_test/unicharset.

unicharset of size 3 from file data/san9_test/my.unicharset

THIS IS THE ISSUE.

I tried with just a few lines from your dataset with the following command and it seems to work.

 make training MODEL_NAME=san9_mwe LANG_TYPE=Indic START_MODEL=san TESSDATA=/home/ubuntu/tessdata_best

Shreeshrii · 2020-07-08T02:38:05Z

https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile#L191

	find $(GROUND_TRUTH_DIR) -name '*.gt.txt' | xargs cat | sort | uniq > "$@"

When single line ground truth text is being concatenated, it becomes one huge line and so if there is any error in even one file, the unicharset generation fails. I changed the above in makefile to add a linebreak after each groundtruth file.

	find $(GROUND_TRUTH_DIR) -name '*.gt.txt' | xargs -I{} sh -c "cat {}; echo ''" > "$@"

Now the unicharset is being generated from the training text.

unicharset_extractor --output_unicharset "data/san9_test/my.unicharset" --norm_mode 2 "data/san9_test/all-gt"
Bad box coordinates in boxfile string! नाम् । रसगुणबलिभिर्विधाय RaseCi. 2. 10; त्वक्पथ्ययोः समावंशौ शशिभागार्धसं-
Extracting unicharset from plain text file data/san9_test/all-gt
Invalid start of grapheme sequence:M=0x943
Normalization failed for string 'iii. 3.65; 8 C a quarter [अंशः] पादार्द्धयोदेृष्टः ParyāRaMā. 1539; 8 D a half;'
Other case I of i is not in unicharset
Other case O of o is not in unicharset
Other case Ṁ of ṁ is not in unicharset
Other case Ṇ of ṇ is not in unicharset
Other case Ḍ of ḍ is not in unicharset
Other case Ṛ of ṛ is not in unicharset
Other case Z of z is not in unicharset
Other case Q of q is not in unicharset
Other case X of x is not in unicharset
Other case Ū of ū is not in unicharset
Other case Ḥ of ḥ is not in unicharset
Other case Ṅ of ṅ is not in unicharset
Other case Ṭ of ṭ is not in unicharset
Other case Ḷ of ḷ is not in unicharset
Other case Ñ of ñ is not in unicharset
Wrote unicharset file data/san9_test/my.unicharset
merge_unicharsets data/Devanagari/san9_test.lstm-unicharset data/san9_test/my.unicharset  "data/san9_test/unicharset"
Loaded unicharset of size 217 from file data/Devanagari/san9_test.lstm-unicharset
Loaded unicharset of size 162 from file data/san9_test/my.unicharset
Wrote unicharset file data/san9_test/unicharset.

Shreeshrii · 2020-07-08T09:55:09Z

@jayawantkarale

I have checked it with BabelPad Viewer but is not showing any unprintable/invisible character

The generated unicharset has a line for feff (BOM) - I see that it is part of many of the gt lines. This can also cause the error.

 0 0,255,0,255,0,0,0,0,0,0 Common 3 18 3 	#  [feff ]

You can search for it in your groundtruth with

grep -rl $'\xEF\xBB\xBF' .

@kba @wrznr @stweil Can normalize process be used to strip the groundtruth of BOM?

wrznr · 2020-07-08T10:40:36Z

@Shreeshrii This could indeed be a problem! According to https://stackoverflow.com/questions/8898294/convert-utf-8-with-bom-to-utf-8-with-no-bom-in-python, there is a special encoding for UTF-8 with BOM utf-8-sig. However, I do not think that we can add a corresponding conversion per default.

@jayawantkarale Could you try to remove the BOMs and check whether the problem persists?

kba · 2020-07-08T12:02:38Z

However, I do not think that we can add a corresponding conversion per default.

Decoding as utf-8-sig would not hurt to do by default IIUC. If it's a UTF-8 string without BOM, the behavior should be the same as decoding as utf-8.

jayawantkarale · 2020-07-09T11:49:34Z

@Shreeshrii As suggested i make changes in Makfile.
find $(GROUND_TRUTH_DIR) -name '*.gt.txt' | xargs -I{} sh -c "cat {}; echo ''" > "$@"
Now its working on testdata i have sent but while running on full dataset it again shows 0 at start of some lines.

Encoding of string failed! Failure bytes: ffffffe2 ffffff80 ffffff8c 75 2e 20 32 35 33 2e 20 31
Can't encode transcription: ' 0( राजा) निसर्गस्नेहविषयेषु मित्रेष्वकुटिलः स्यान्न कार्यमित्रेषु ManvaM‌u. 253. 1' in language ''

and i actually don't understand how to remove BOM from file.
I need to check after removing BOM from file.

Thanks for Quick help @Shreeshrii @wrznr @kba

Shreeshrii · 2020-07-09T13:20:15Z

See https://stackoverflow.com/questions/9100728/remove-multiple-boms-from-a-file

jayawantkarale · 2020-07-10T12:41:40Z

@Shreeshrii i removed BOM from ground-truth files.
Now i not getting 0 at the start of line. for those lines i am getting following error.

Can't encode transcription: ' ( राजा) निसर्गस्नेहविषयेषु मित्रेष्वकुटिलः स्यान्न कार्यमित्रेषु ManvaM‌u. 253. 1' in language ''

Thanks a lot it solved my problem @Shreeshrii @wrznr @kba

Shreeshrii · 2020-07-10T13:50:37Z

@jayawantkarale please share the error log from training. It might be helpful in finding out why certain lines are still getting the error 'can't encode transcription' even though the unicharset is generated from the training text.

kba · 2020-07-10T14:16:37Z

Should we implement decoding from utf-8-sig to prevent BOM issues in the future?

wrznr · 2020-07-10T14:19:34Z

Good question. I am really not sure. If I got you correctly it wouldn't hurt but somehow I am still not a huge fan of it.

@Shreeshrii @stweil What do you think?

stweil · 2020-07-10T14:20:19Z

It could be implemented in the Tesseract code. But maybe just failing with a reasonable error message would be better. That helps to get uniform GT texts (instead of supporting many variants which make also problems elsewhere).

jayawantkarale · 2020-07-11T04:49:11Z

S i have attached log file from training
san9_test.log

Shreeshrii · 2020-07-12T09:55:33Z

        Line 23161: Can't encode transcription: 'द‌शा यत्रास्ति सामान्यस्पन्दरूपा तदकुलम्  ParāTri. 229.7; प्रसादात्ते जन्तुः' in language ''
	Line 25108: Can't encode transcription: '( राजा) निसर्गस्नेहविषयेषु मित्रेष्वकुटिलः स्यान्न कार्यमित्रेषु ManvaM‌u. 253. 1' in language ''
	Line 27514: Can't encode transcription: 'द‌शा यत्रास्ति सामान्यस्पन्दरूपा तदकुलम्  ParāTri. 229.7; प्रसादात्ते जन्तुः' in language ''
	Line 27556: Can't encode transcription: '( राजा) निसर्गस्नेहविषयेषु मित्रेष्वकुटिलः स्यान्न कार्यमित्रेषु ManvaM‌u. 253. 1' in language ''
	Line 27625: Can't encode transcription: 'द‌शा यत्रास्ति सामान्यस्पन्दरूपा तदकुलम्  ParāTri. 229.7; प्रसादात्ते जन्तुः' in language ''

The log shows two text lines getting the error.

By using https://r12a.github.io/uniview/, it seems to me that the problem maybe caused by ZWNJ.

 ‎004D LATIN CAPITAL LETTER M
 ‎0061 LATIN SMALL LETTER A
 ‎006E LATIN SMALL LETTER N
 ‎0076 LATIN SMALL LETTER V
 ‎0061 LATIN SMALL LETTER A
 ‎004D LATIN CAPITAL LETTER M
 ‎200C ZERO WIDTH NON-JOINER
 ‎0075 LATIN SMALL LETTER U

and

 ‎0926 DEVANAGARI LETTER DA
 ‎200C ZERO WIDTH NON-JOINER
 ‎0936 DEVANAGARI LETTER SHA
 ‎093E DEVANAGARI VOWEL SIGN AA
 ‎0020 SPACE

Please try after removing ZWNJ (200C) from the groundtruth and see if it works.

@stweil I think tesseract normalization process removes ZWNJ from the text. Would that be causing this issue?

Shreeshrii · 2020-07-12T09:56:48Z

@jayawantkarale Please also share the generated unicharset.

Shreeshrii · 2020-07-13T08:15:37Z

@Shreeshrii ZWNJ is required as it is required in prachin sanskrit books.

Is ZWNJ being used to create old style ligatures?

Please share the images and groundtruth for the two lines in error:

    Line 23161: Can't encode transcription: 'द‌शा यत्रास्ति सामान्यस्पन्दरूपा तदकुलम्  ParāTri. 229.7; प्रसादात्ते जन्तुः' in language ''
Line 25108: Can't encode transcription: '( राजा) निसर्गस्नेहविषयेषु मित्रेष्वकुटिलः स्यान्न कार्यमित्रेषु ManvaM‌u. 253. 1' in language ''

jayawantkarale · 2020-07-13T12:45:18Z

@Shreeshrii In these lines ZWNJ is not required. After removing it I am not getting any error.
But in certain lines ZWNJ is required but in those lines i am not getting error, So we can not remove ZWNJ from all lines.

I will share images where we use ZWNJ but not getting error.

jayawantkarale · 2020-07-13T14:32:40Z

@Shreeshrii These sample files contain ZWNJ character but still i am not getting error

ZWNJ.zip

Shreeshrii · 2020-08-04T10:32:40Z

@jayawantkarale Sorry, it took me so long to look at the files. I see that ZWNJ is being used for explicit halant (virama) in middle of words.

I am curious to know about the results of your training. Please do share when completed.

One suggestion:

The groundtruth needs to be reviewed carefully otherwise training results will be wrong. eg. in the sample that you shared, 'small u maatraa' is being used for 'roopam' two times, when it actually needs to be 'uu maatraa'.

aspects अस्ति भाति प्रियं रूपं नाम चेत्यंशपञ्चकम् । आद्यत्रयं ब्रह्मरुपं जगद्‌रुपं

needs to have ब्रह्मरूपं जगद्‌रूपं

instead of ब्रह्मरुपं जगद्‌रुपं.

wrznr self-assigned this Jun 30, 2020

wrznr added the bug Something isn't working label Jun 30, 2020

wrznr closed this as completed Jul 10, 2020

Shreeshrii reopened this Jul 12, 2020

Shreeshrii closed this as completed Aug 4, 2020

Shreeshrii mentioned this issue Sep 16, 2020

Training on real image data (Korean) #196

Closed

Shreeshrii mentioned this issue Dec 16, 2020

Concatenate with newline for all-gt , remove sort uniq #215

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't encode transcription: adding 0 automatically at start of ground truth file. #172

Can't encode transcription: adding 0 automatically at start of ground truth file. #172

jayawantkarale commented Jun 30, 2020

wrznr commented Jun 30, 2020

jayawantkarale commented Jul 3, 2020

Shreeshrii commented Jul 3, 2020

jayawantkarale commented Jul 7, 2020

jayawantkarale commented Jul 7, 2020

Shreeshrii commented Jul 8, 2020

Shreeshrii commented Jul 8, 2020

Shreeshrii commented Jul 8, 2020

Shreeshrii commented Jul 8, 2020

wrznr commented Jul 8, 2020

kba commented Jul 8, 2020

jayawantkarale commented Jul 9, 2020

Shreeshrii commented Jul 9, 2020

jayawantkarale commented Jul 10, 2020

Shreeshrii commented Jul 10, 2020

kba commented Jul 10, 2020

wrznr commented Jul 10, 2020

stweil commented Jul 10, 2020 •

edited

Loading

jayawantkarale commented Jul 11, 2020

Shreeshrii commented Jul 12, 2020

Shreeshrii commented Jul 12, 2020

Shreeshrii commented Jul 13, 2020

jayawantkarale commented Jul 13, 2020

jayawantkarale commented Jul 13, 2020

Shreeshrii commented Aug 4, 2020

Can't encode transcription: adding 0 automatically at start of ground truth file. #172

Can't encode transcription: adding 0 automatically at start of ground truth file. #172

Comments

jayawantkarale commented Jun 30, 2020

wrznr commented Jun 30, 2020

jayawantkarale commented Jul 3, 2020

Shreeshrii commented Jul 3, 2020

jayawantkarale commented Jul 7, 2020

jayawantkarale commented Jul 7, 2020

Shreeshrii commented Jul 8, 2020

Shreeshrii commented Jul 8, 2020

Shreeshrii commented Jul 8, 2020

Shreeshrii commented Jul 8, 2020

wrznr commented Jul 8, 2020

kba commented Jul 8, 2020

jayawantkarale commented Jul 9, 2020

Shreeshrii commented Jul 9, 2020

jayawantkarale commented Jul 10, 2020

Shreeshrii commented Jul 10, 2020

kba commented Jul 10, 2020

wrznr commented Jul 10, 2020

stweil commented Jul 10, 2020 • edited Loading

jayawantkarale commented Jul 11, 2020

Shreeshrii commented Jul 12, 2020

Shreeshrii commented Jul 12, 2020

Shreeshrii commented Jul 13, 2020

jayawantkarale commented Jul 13, 2020

jayawantkarale commented Jul 13, 2020

Shreeshrii commented Aug 4, 2020

stweil commented Jul 10, 2020 •

edited

Loading