Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DO NOT MERGE: Fix reversal of numerals in Arabic script #2266

Closed
wants to merge 2 commits into from
Closed

DO NOT MERGE: Fix reversal of numerals in Arabic script #2266

wants to merge 2 commits into from

Conversation

Shreeshrii
Copy link
Collaborator

Fixes issue #2263

Also addresses related issues
Add Indic numerals and missing punctuation to Arabic tesseract-ocr/langdata#131
Arabic training data has room for improvement #2047

See https://github.com/Shreeshrii/tessdata_arabic
for the finetuned traineddata file and test image and OCR output.

The OCR output looks correct when seen in notepad++ in RTL view. However I am not able to copy it in github.

@Shreeshrii Shreeshrii changed the title Fix reversal of Arabic numerals Fix reversal of numerals in Arabic script Feb 25, 2019
@amitdo
Copy link
Collaborator

amitdo commented Feb 25, 2019

delete check of restrictive normalization rule

Why? Can you give an example where this rule is wrong?

The OCR output looks correct when seen in notepad++ in RTL view. However I am not able to copy it in github.

What do you mean?

If the problem is that it looks different when you paste it to github, then you can overcome it like this:

<p dir=rtl>
שלום עולם!
</p>

=>

שלום עולם!

@Shreeshrii
Copy link
Collaborator Author

delete check of restrictive normalization rule

Why? Can you give an example where this rule is wrong?

eg. For RTL languages, the processing is done in LTR order. So a final combining mark will be seen first and will get error that word begins with combining mark.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Feb 25, 2019

If the problem is that it looks different when you paste it to github, then you can overcome it like this:

I tried with <div dir=rtl> but the output loses linefeeds and becomes one para.

When copied in github the text also shows little red marks around the numbers in Arabic script, probably some kind of control marks for change of direction. However no such marks are visible in notepad++.

@amitdo
Copy link
Collaborator

amitdo commented Feb 25, 2019

Please save as .txt and upload.

@Shreeshrii
Copy link
Collaborator Author

It is already there as .txt at https://github.com/Shreeshrii/tessdata_arabic/blob/master/Arabic-TOC-ara-Amiri.txt

I was trying to edit it with div in github.

@amitdo
Copy link
Collaborator

amitdo commented Feb 25, 2019

@amitdo
Copy link
Collaborator

amitdo commented Feb 25, 2019

a final combining mark

Is it legal to a combining mark to be the last char in a word?

@Shreeshrii
Copy link
Collaborator Author

I do not know about Arabic to answer that but in Indic languages combining marks for dependent vowels are quite common as last character in word.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Feb 25, 2019

I have tested only with the TOC image since it was easy to verify. Need to check for cases where the numbers are in middle of a sentence.

@amitdo
Copy link
Collaborator

amitdo commented Feb 25, 2019

if (prev_cc == CharClass::kWhitespace && is_combiner) {
    const bool is_combiner =
        cc == CharClass::kCombiner || cc == CharClass::kVirama;
    // Reject easily detected badly formed sequences.
  if (prev_cc == CharClass::kWhitespace && is_combiner) {
    if (report_errors_) tprintf("Word started with a combiner:0x%x\n", ch);
    return false;
  }
  Validator::CharClass ValidateGrapheme::UnicodeToCharClass(char32 ch) const {
  if (IsVedicAccent(ch)) return CharClass::kVedicMark;
  // The ZeroWidth[Non]Joiner characters are mapped to kCombiner as they
  // always combine with the previous character.
  if (u_hasBinaryProperty(ch, UCHAR_GRAPHEME_LINK)) return CharClass::kVirama;
  if (u_isUWhiteSpace(ch)) return CharClass::kWhitespace;
  // Workaround for Javanese Aksara's Taling, do not label it as a combiner
  if (ch == 0xa9ba) return CharClass::kConsonant;
  int char_type = u_charType(ch);
  if (char_type == U_NON_SPACING_MARK || char_type == U_ENCLOSING_MARK ||
      char_type == U_COMBINING_SPACING_MARK || ch == kZeroWidthNonJoiner ||
      ch == kZeroWidthJoiner)
    return CharClass::kCombiner;
  return CharClass::kOther;
}

So 'is_combiner' is one of these:

  • ZWJ
  • ZWNJ
  • Virama
  • Combining Spacing Marks

@amitdo
Copy link
Collaborator

amitdo commented Feb 25, 2019

Bottom line, it seems to me that the commented code does not affect Arabic or any other RTL language. Therefore, I suggest to undo the first part.

@Shreeshrii Shreeshrii changed the title Fix reversal of numerals in Arabic script DO NOT MERGE: Fix reversal of numerals in Arabic script Feb 25, 2019
@Shreeshrii
Copy link
Collaborator Author

cases where the numbers are in middle of a sentence.

These do not seem to be handled correctly. So this PR should NOT be merged.

it seem to me that the commented code is not affecting Arabic or any other RTL language. Therefore, I suggest to undo the first part.

Thanks, @amitdo. I will undo and test.

I am closing this PR as it does not handle the reversal in all cases.

@Shreeshrii Shreeshrii closed this Feb 25, 2019
@Shreeshrii
Copy link
Collaborator Author

Errors on reverting the change

=== Phase UP: Generating unicharset and unichar properties files ===
[Mon Feb 25 17:47:09 UTC 2019] /home/ubuntu/tesseract/src/training/unicharset_extractor --output_unicharset /tmp/ara-2019-02-25.kza/ara.unicharset --norm_mode 2 /tmp/ara-2019-02-25.kza/ara.Amiri.exp0.box
Extracting unicharset from box file /tmp/ara-2019-02-25.kza/ara.Amiri.exp0.box
Word started with a combiner:0x64b
Normalization failed for string 'ًا'
Word started with a combiner:0x64e
Normalization failed for string 'َو'
Word started with a combiner:0x64e
Normalization failed for string 'َر'
Word started with a combiner:0x64e
Normalization failed for string 'َد'
Word started with a combiner:0x650
Normalization failed for string 'ِه'
Word started with a combiner:0x650
Normalization failed for string 'ِة'
Word started with a combiner:0x64e
Word started with a combiner:0x651
Normalization failed for string 'َّي'
Word started with a combiner:0x650
Normalization failed for string 'ِم'
Word started with a combiner:0x652
Normalization failed for string 'ْل'
Word started with a combiner:0x650
Word started with a combiner:0x651
Normalization failed for string 'ِّس'
Word started with a combiner:0x650
Normalization failed for string 'ِة'
Word started with a combiner:0x64e
Word started with a combiner:0x651
Normalization failed for string 'َّي'
Word started with a combiner:0x650
Normalization failed for string 'ِو'
Word started with a combiner:0x64e

@amitdo
Copy link
Collaborator

amitdo commented Feb 25, 2019

@Shreeshrii
Copy link
Collaborator Author

At iteration 75/100/101, Mean rms=1.173%, delta=2.748%, char train=11.154%, word train=18.172%, skip ratio=1%, New best char error = 11.154 wrote best model:./ara-Amiri-Revert-from-Arabic/ara-Amiri-Revert11.154_75.checkpoint wrote checkpoint.

Encoding of string failed! Failure bytes: d9 8b d8 a7 d8 af d9 83 d8 a4 d9 85 db 94 d8 a9 d9 8a d8 b1 d8 a7 d8 af d8 a7 d9 84 d8 a7 20 d8 a9 d8 a6 d9 8a d9 87 d9 84 d8 a7 20 d8 a1 d8 a7 d8 b6 d8 b9 d8 a3 d8 a8 20 d8 a9 d9 82 d8 ab d9 84 d8 a7 20 d8 af d9 8a d8 af d8 ac d8 aa
Can't encode transcription: 'نا ًادكؤم۔ةيرادالا ةئيهلا ءاضعأب ةقثلا ديدجت' in language ''
2 Percent improvement time=147, best error was 100 @ 0
At iteration 147/200/202, Mean rms=1.14%, delta=2.553%, char train=9.305%, word train=17.24%, skip ratio=1%, New best char error = 9.305 Transitioned to stage 1 wrote best model:./ara-Amiri-Revert-from-Arabic/ara-Amiri-Revert9.305_147.checkpoint wrote checkpoint.

Encoding of string failed! Failure bytes: d9 92 d9 86 d9 90 d9 85 20 db 94 db 8c da be d8 aa 20 d9 84 d8 b2 d8 ba 20 d9 88 d8 ac 20 d8 a7 da af d9 88 db 81 20 db 94 da ba db 8c d8 b1 da a9 20 db 94 db 92 db 81 20 d8 b4 db 8c d9 be 20 29 d9 a3 20 d9 86 d8 b3 d8 ad 20 22 20 d9 90 d8 b2 d9 85 d8 b1 20 d8 9b 29 20 d8 b1 da af d9 85 20 d8 aa d8 b1 d8 a7 da be d8 a8 d8 8c d9 86 d8 a7 d8 aa d8 b3 da a9 d8 a7 d9 be 20 da be d8 aa d8 a7 d8 b3 20 d8 8f
Can't encode transcription: 'ْنِم ۔یھت لزغ وج اگوہ ۔ںیرک ۔ےہ شیپ )٣ نسح " ِزمر ؛) رگم تراھب،ناتسکاپ ھتاس ؏' in language ''
Encoding of string failed! Failure bytes: d9 8f d9 8a 20 d9 87 d8 a6 d8 a7 d9 85 d8 aa d9 86 d8 a7 20 d8 ac d9 86 d8 a8 d9 84 d8 a7 20 d9 88 d9 87 d9 88 20 d8 b3 20 d9 87 20 d9 86 d8 a7 20 d8 a8 20 d8 b1 d9 83 d8 b0 d9 8a 20 d9 88 20 d8 a7 d9 87 20 d9 84 20 d8 a9 d8 b9 d8 a8 d8 a7 d8 aa d9 84 d8 a7 20 d8 b1 d8 a6 d8 a7 d9 88 d8 af d9 84 d8 a7 20 d9 88
Can't encode transcription: 'ةريزو دعاسم بتاورلا يلع و قوقح ه و مهتابلطب مدقتلا ةطشنلا ةلقعلا بحُي هئامتنا جنبلا وهو س ه نا ب ركذي و اه ل ةعباتلا رئاودلا و' in language ''
Encoding of string failed! Failure bytes: d9 8e d8 b9 d9 90 d8 a8 d9 92 d8 b1 d8 a3 d9 84 d8 a7
Can't encode transcription: '¡¡ عوضوملا ٤ نأ نكمي ةايحلا ؛ريغ ةصاخلا يصخشلا نأ ”۔ليلق دعب عامتجا اندنع“ءاَعِبْرألا' in language ''
Encoding of string failed! Failure bytes: d9 8b d8 a7 d8 a6 d9 8a d8 b4
Can't encode transcription: 'باحصأل ةبلغلاو ةوقلاب عداخلا وهزلا نم ًائيش' in language ''
2 Percent improvement time=141, best error was 11.154 @ 75
At iteration 216/300/306, Mean rms=1.099%, delta=2.449%, char train=8.069%, word train=15.997%, skip ratio=2%, New best char error = 8.069 wrote best model:./ara-Amiri-Revert-from-Arabic/ara-Amiri-Revert8.069_216.checkpoint wrote checkpoint.

@Shreeshrii
Copy link
Collaborator Author

U+064B ً d9 8b

Word started with a combiner:0x64b
Normalization failed for string 'ًا'

Encoding of string failed! Failure bytes: d9 8b d8 a7 d8 af d9 83 d8 a4 d9 85 db 94 d8 a9 d9 8a d8 b1 d8 a7 d8 af d8 a7 d9 84 d8 a7 20 d8 a9 d8 a6 d9 8a d9 87 d9 84 d8 a7 20 d8 a1 d8 a7 d8 b6 d8 b9 d8 a3 d8 a8 20 d8 a9 d9 82 d8 ab d9 84 d8 a7 20 d8 af d9 8a d8 af d8 ac d8 aa
Can't encode transcription: 'نا ًادكؤم۔ةيرادالا ةئيهلا ءاضعأب ةقثلا ديدجت' in language ''

@amitdo
Copy link
Collaborator

amitdo commented Feb 25, 2019

I missed this part:

if (char_type == U_NON_SPACING_MARK || char_type == U_ENCLOSING_MARK ||

@amitdo
Copy link
Collaborator

amitdo commented Feb 25, 2019

Bottom line, it seems to me that the commented code does not affect Arabic or any other RTL language. Therefore, I suggest to undo the first part.

I was wrong :-)

@Shreeshrii
Copy link
Collaborator Author

This was the easy part :-)
Now need to find why numbers are not OK in middle of sentences.

@Shreeshrii
Copy link
Collaborator Author

Thanks, @amitdo . These are useful resource links for reference.

@Shreeshrii Shreeshrii deleted the arabic_numbers branch March 1, 2019 13:19
@amitdo amitdo added the RTL label Mar 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants