DO NOT MERGE: Fix reversal of numerals in Arabic script #2266

Shreeshrii · 2019-02-25T12:06:37Z

Fixes issue #2263

Also addresses related issues
Add Indic numerals and missing punctuation to Arabic tesseract-ocr/langdata#131
Arabic training data has room for improvement #2047

See https://github.com/Shreeshrii/tessdata_arabic
for the finetuned traineddata file and test image and OCR output.

The OCR output looks correct when seen in notepad++ in RTL view. However I am not able to copy it in github.

amitdo · 2019-02-25T12:36:00Z

delete check of restrictive normalization rule

Why? Can you give an example where this rule is wrong?

The OCR output looks correct when seen in notepad++ in RTL view. However I am not able to copy it in github.

What do you mean?

If the problem is that it looks different when you paste it to github, then you can overcome it like this:

<p dir=rtl>
שלום עולם!
</p>

=>

שלום עולם!

Shreeshrii · 2019-02-25T13:21:36Z

delete check of restrictive normalization rule

Why? Can you give an example where this rule is wrong?

eg. For RTL languages, the processing is done in LTR order. So a final combining mark will be seen first and will get error that word begins with combining mark.

Shreeshrii · 2019-02-25T13:25:21Z

If the problem is that it looks different when you paste it to github, then you can overcome it like this:

I tried with <div dir=rtl> but the output loses linefeeds and becomes one para.

When copied in github the text also shows little red marks around the numbers in Arabic script, probably some kind of control marks for change of direction. However no such marks are visible in notepad++.

amitdo · 2019-02-25T13:32:33Z

Please save as .txt and upload.

Shreeshrii · 2019-02-25T13:43:21Z

It is already there as .txt at https://github.com/Shreeshrii/tessdata_arabic/blob/master/Arabic-TOC-ara-Amiri.txt

I was trying to edit it with div in github.

amitdo · 2019-02-25T14:06:27Z

https://raw.githubusercontent.com/Shreeshrii/tessdata_arabic/master/Arabic-TOC-ara-Amiri.txt

looks right in firefox.

amitdo · 2019-02-25T15:19:52Z

a final combining mark

Is it legal to a combining mark to be the last char in a word?

Shreeshrii · 2019-02-25T15:43:18Z

I do not know about Arabic to answer that but in Indic languages combining marks for dependent vowels are quite common as last character in word.

Shreeshrii · 2019-02-25T16:20:19Z

I have tested only with the TOC image since it was easy to verify. Need to check for cases where the numbers are in middle of a sentence.

amitdo · 2019-02-25T16:37:56Z

if (prev_cc == CharClass::kWhitespace && is_combiner) {

    const bool is_combiner =
        cc == CharClass::kCombiner || cc == CharClass::kVirama;
    // Reject easily detected badly formed sequences.
  if (prev_cc == CharClass::kWhitespace && is_combiner) {
    if (report_errors_) tprintf("Word started with a combiner:0x%x\n", ch);
    return false;
  }

  Validator::CharClass ValidateGrapheme::UnicodeToCharClass(char32 ch) const {
  if (IsVedicAccent(ch)) return CharClass::kVedicMark;
  // The ZeroWidth[Non]Joiner characters are mapped to kCombiner as they
  // always combine with the previous character.
  if (u_hasBinaryProperty(ch, UCHAR_GRAPHEME_LINK)) return CharClass::kVirama;
  if (u_isUWhiteSpace(ch)) return CharClass::kWhitespace;
  // Workaround for Javanese Aksara's Taling, do not label it as a combiner
  if (ch == 0xa9ba) return CharClass::kConsonant;
  int char_type = u_charType(ch);
  if (char_type == U_NON_SPACING_MARK || char_type == U_ENCLOSING_MARK ||
      char_type == U_COMBINING_SPACING_MARK || ch == kZeroWidthNonJoiner ||
      ch == kZeroWidthJoiner)
    return CharClass::kCombiner;
  return CharClass::kOther;
}

So 'is_combiner' is one of these:

ZWJ
ZWNJ
Virama
Combining Spacing Marks
- See here or here

amitdo · 2019-02-25T16:48:30Z

Bottom line, it seems to me that the commented code does not affect Arabic or any other RTL language. Therefore, I suggest to undo the first part.

Shreeshrii · 2019-02-25T17:05:12Z

cases where the numbers are in middle of a sentence.

These do not seem to be handled correctly. So this PR should NOT be merged.

it seem to me that the commented code is not affecting Arabic or any other RTL language. Therefore, I suggest to undo the first part.

Thanks, @amitdo. I will undo and test.

I am closing this PR as it does not handle the reversal in all cases.

Shreeshrii · 2019-02-25T17:51:55Z

Errors on reverting the change

=== Phase UP: Generating unicharset and unichar properties files ===
[Mon Feb 25 17:47:09 UTC 2019] /home/ubuntu/tesseract/src/training/unicharset_extractor --output_unicharset /tmp/ara-2019-02-25.kza/ara.unicharset --norm_mode 2 /tmp/ara-2019-02-25.kza/ara.Amiri.exp0.box
Extracting unicharset from box file /tmp/ara-2019-02-25.kza/ara.Amiri.exp0.box
Word started with a combiner:0x64b
Normalization failed for string 'ًا'
Word started with a combiner:0x64e
Normalization failed for string 'َو'
Word started with a combiner:0x64e
Normalization failed for string 'َر'
Word started with a combiner:0x64e
Normalization failed for string 'َد'
Word started with a combiner:0x650
Normalization failed for string 'ِه'
Word started with a combiner:0x650
Normalization failed for string 'ِة'
Word started with a combiner:0x64e
Word started with a combiner:0x651
Normalization failed for string 'َّي'
Word started with a combiner:0x650
Normalization failed for string 'ِم'
Word started with a combiner:0x652
Normalization failed for string 'ْل'
Word started with a combiner:0x650
Word started with a combiner:0x651
Normalization failed for string 'ِّس'
Word started with a combiner:0x650
Normalization failed for string 'ِة'
Word started with a combiner:0x64e
Word started with a combiner:0x651
Normalization failed for string 'َّي'
Word started with a combiner:0x650
Normalization failed for string 'ِو'
Word started with a combiner:0x64e

amitdo · 2019-02-25T18:01:13Z

https://www.compart.com/en/unicode/U+064B
https://www.compart.com/en/unicode/U+0650

Shreeshrii · 2019-02-25T18:01:59Z

At iteration 75/100/101, Mean rms=1.173%, delta=2.748%, char train=11.154%, word train=18.172%, skip ratio=1%, New best char error = 11.154 wrote best model:./ara-Amiri-Revert-from-Arabic/ara-Amiri-Revert11.154_75.checkpoint wrote checkpoint.

Encoding of string failed! Failure bytes: d9 8b d8 a7 d8 af d9 83 d8 a4 d9 85 db 94 d8 a9 d9 8a d8 b1 d8 a7 d8 af d8 a7 d9 84 d8 a7 20 d8 a9 d8 a6 d9 8a d9 87 d9 84 d8 a7 20 d8 a1 d8 a7 d8 b6 d8 b9 d8 a3 d8 a8 20 d8 a9 d9 82 d8 ab d9 84 d8 a7 20 d8 af d9 8a d8 af d8 ac d8 aa
Can't encode transcription: 'نا ًادكؤم۔ةيرادالا ةئيهلا ءاضعأب ةقثلا ديدجت' in language ''
2 Percent improvement time=147, best error was 100 @ 0
At iteration 147/200/202, Mean rms=1.14%, delta=2.553%, char train=9.305%, word train=17.24%, skip ratio=1%, New best char error = 9.305 Transitioned to stage 1 wrote best model:./ara-Amiri-Revert-from-Arabic/ara-Amiri-Revert9.305_147.checkpoint wrote checkpoint.

Encoding of string failed! Failure bytes: d9 92 d9 86 d9 90 d9 85 20 db 94 db 8c da be d8 aa 20 d9 84 d8 b2 d8 ba 20 d9 88 d8 ac 20 d8 a7 da af d9 88 db 81 20 db 94 da ba db 8c d8 b1 da a9 20 db 94 db 92 db 81 20 d8 b4 db 8c d9 be 20 29 d9 a3 20 d9 86 d8 b3 d8 ad 20 22 20 d9 90 d8 b2 d9 85 d8 b1 20 d8 9b 29 20 d8 b1 da af d9 85 20 d8 aa d8 b1 d8 a7 da be d8 a8 d8 8c d9 86 d8 a7 d8 aa d8 b3 da a9 d8 a7 d9 be 20 da be d8 aa d8 a7 d8 b3 20 d8 8f
Can't encode transcription: 'ْنِم ۔یھت لزغ وج اگوہ ۔ںیرک ۔ےہ شیپ )٣ نسح " ِزمر ؛) رگم تراھب،ناتسکاپ ھتاس ؏' in language ''
Encoding of string failed! Failure bytes: d9 8f d9 8a 20 d9 87 d8 a6 d8 a7 d9 85 d8 aa d9 86 d8 a7 20 d8 ac d9 86 d8 a8 d9 84 d8 a7 20 d9 88 d9 87 d9 88 20 d8 b3 20 d9 87 20 d9 86 d8 a7 20 d8 a8 20 d8 b1 d9 83 d8 b0 d9 8a 20 d9 88 20 d8 a7 d9 87 20 d9 84 20 d8 a9 d8 b9 d8 a8 d8 a7 d8 aa d9 84 d8 a7 20 d8 b1 d8 a6 d8 a7 d9 88 d8 af d9 84 d8 a7 20 d9 88
Can't encode transcription: 'ةريزو دعاسم بتاورلا يلع و قوقح ه و مهتابلطب مدقتلا ةطشنلا ةلقعلا بحُي هئامتنا جنبلا وهو س ه نا ب ركذي و اه ل ةعباتلا رئاودلا و' in language ''
Encoding of string failed! Failure bytes: d9 8e d8 b9 d9 90 d8 a8 d9 92 d8 b1 d8 a3 d9 84 d8 a7
Can't encode transcription: '¡¡ عوضوملا ٤ نأ نكمي ةايحلا ؛ريغ ةصاخلا يصخشلا نأ ”۔ليلق دعب عامتجا اندنع“ءاَعِبْرألا' in language ''
Encoding of string failed! Failure bytes: d9 8b d8 a7 d8 a6 d9 8a d8 b4
Can't encode transcription: 'باحصأل ةبلغلاو ةوقلاب عداخلا وهزلا نم ًائيش' in language ''
2 Percent improvement time=141, best error was 11.154 @ 75
At iteration 216/300/306, Mean rms=1.099%, delta=2.449%, char train=8.069%, word train=15.997%, skip ratio=2%, New best char error = 8.069 wrote best model:./ara-Amiri-Revert-from-Arabic/ara-Amiri-Revert8.069_216.checkpoint wrote checkpoint.

Shreeshrii · 2019-02-25T18:04:36Z

U+064B	ً	d9 8b

Word started with a combiner:0x64b
Normalization failed for string 'ًا'

Encoding of string failed! Failure bytes: d9 8b d8 a7 d8 af d9 83 d8 a4 d9 85 db 94 d8 a9 d9 8a d8 b1 d8 a7 d8 af d8 a7 d9 84 d8 a7 20 d8 a9 d8 a6 d9 8a d9 87 d9 84 d8 a7 20 d8 a1 d8 a7 d8 b6 d8 b9 d8 a3 d8 a8 20 d8 a9 d9 82 d8 ab d9 84 d8 a7 20 d8 af d9 8a d8 af d8 ac d8 aa
Can't encode transcription: 'نا ًادكؤم۔ةيرادالا ةئيهلا ءاضعأب ةقثلا ديدجت' in language ''

amitdo · 2019-02-25T18:05:52Z

I missed this part:

if (char_type == U_NON_SPACING_MARK || char_type == U_ENCLOSING_MARK ||

amitdo · 2019-02-25T18:13:03Z

Bottom line, it seems to me that the commented code does not affect Arabic or any other RTL language. Therefore, I suggest to undo the first part.

I was wrong :-)

Shreeshrii · 2019-02-25T18:18:12Z

This was the easy part :-)
Now need to find why numbers are not OK in middle of sentences.

amitdo · 2019-02-25T18:18:56Z

https://www.compart.com/en/unicode/category/Mn
https://www.fileformat.info/info/unicode/category/Me/list.htm

Shreeshrii · 2019-02-25T18:32:50Z

Thanks, @amitdo . These are useful resource links for reference.

Shreeshrii added 2 commits February 25, 2019 12:01

delete check of restrictive normalization rule

ceb0b2e

fix reversal of numerals in Arabic script

846b87d

Shreeshrii mentioned this pull request Feb 25, 2019

Numbers in Arabic script are getting reversed #2263

Closed

Shreeshrii changed the title ~~Fix reversal of Arabic numerals~~ Fix reversal of numerals in Arabic script Feb 25, 2019

Shreeshrii changed the title ~~Fix reversal of numerals in Arabic script~~ DO NOT MERGE: Fix reversal of numerals in Arabic script Feb 25, 2019

Shreeshrii closed this Feb 25, 2019

Shreeshrii mentioned this pull request Feb 26, 2019

Treat U_ARABIC_NUMBER as LTR #2270

Merged

Shreeshrii deleted the arabic_numbers branch March 1, 2019 13:19

amitdo added the RTL label Mar 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DO NOT MERGE: Fix reversal of numerals in Arabic script #2266

DO NOT MERGE: Fix reversal of numerals in Arabic script #2266

Shreeshrii commented Feb 25, 2019

amitdo commented Feb 25, 2019 •

edited

Loading

Shreeshrii commented Feb 25, 2019

Shreeshrii commented Feb 25, 2019 •

edited

Loading

amitdo commented Feb 25, 2019

Shreeshrii commented Feb 25, 2019

amitdo commented Feb 25, 2019

amitdo commented Feb 25, 2019

Shreeshrii commented Feb 25, 2019

Shreeshrii commented Feb 25, 2019 •

edited

Loading

amitdo commented Feb 25, 2019

amitdo commented Feb 25, 2019 •

edited

Loading

Shreeshrii commented Feb 25, 2019

Shreeshrii commented Feb 25, 2019

amitdo commented Feb 25, 2019

Shreeshrii commented Feb 25, 2019

Shreeshrii commented Feb 25, 2019

amitdo commented Feb 25, 2019

amitdo commented Feb 25, 2019

Shreeshrii commented Feb 25, 2019

amitdo commented Feb 25, 2019

Shreeshrii commented Feb 25, 2019

DO NOT MERGE: Fix reversal of numerals in Arabic script #2266

DO NOT MERGE: Fix reversal of numerals in Arabic script #2266

Conversation

Shreeshrii commented Feb 25, 2019

amitdo commented Feb 25, 2019 • edited Loading

Shreeshrii commented Feb 25, 2019

Shreeshrii commented Feb 25, 2019 • edited Loading

amitdo commented Feb 25, 2019

Shreeshrii commented Feb 25, 2019

amitdo commented Feb 25, 2019

amitdo commented Feb 25, 2019

Shreeshrii commented Feb 25, 2019

Shreeshrii commented Feb 25, 2019 • edited Loading

amitdo commented Feb 25, 2019

amitdo commented Feb 25, 2019 • edited Loading

Shreeshrii commented Feb 25, 2019

Shreeshrii commented Feb 25, 2019

amitdo commented Feb 25, 2019

Shreeshrii commented Feb 25, 2019

Shreeshrii commented Feb 25, 2019

amitdo commented Feb 25, 2019

amitdo commented Feb 25, 2019

Shreeshrii commented Feb 25, 2019

amitdo commented Feb 25, 2019

Shreeshrii commented Feb 25, 2019

amitdo commented Feb 25, 2019 •

edited

Loading

Shreeshrii commented Feb 25, 2019 •

edited

Loading

Shreeshrii commented Feb 25, 2019 •

edited

Loading

amitdo commented Feb 25, 2019 •

edited

Loading