Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Box File disorder, Arabic Language #648

Open
ghost opened this issue Jan 10, 2017 · 77 comments
Open

Box File disorder, Arabic Language #648

ghost opened this issue Jan 10, 2017 · 77 comments

Comments

@ghost
Copy link

ghost commented Jan 10, 2017

@theraysmith @amitdo @Shreeshrii

  • Box file disorder
    The Arabic box file generate using Tesseract 4.x is in LTR ( Left to Right ) which is reversed, the Arabic language is from RTL ( Right to Left ). That means that the first box should start from from the right side.
    ( Have a look at the wrong and disorder Tesseract 4.0x Arabic box file )
    ara.Traditional_Arabic.exp0.zip

Tesseract 4.0 lstm puts the spaces between the words into boxes, as you know.
Thus a problem arises caused by the box file disorder since the boxes are mistakenly set to be in LTR ( Left to Right ) for Arabic which is wrong, causing jumps from ( the end of the first line) to ( the end of the last letter of the line after it).
See the image attached
box disorder

( Now have a look at the attached correct order of Arabic example tif/box of version Tesseract 3.05).
Arabic example 1.zip
Example 1, correct box order:
right order

@Shreeshrii
Copy link
Collaborator

While tesstrain.sh takes into account RTL languages while creating the DAWG files, text2image process does not seem to have specific RTL processing.

@ghost
Copy link
Author

ghost commented Jan 11, 2017

will there be modifications for text2image ?

@amitdo
Copy link
Collaborator

amitdo commented Jan 11, 2017

What you show here is 'by design'. This should not cause any problem in training process and characters recognition for RTL languages.

@theraysmith
Copy link
Contributor

theraysmith commented Jan 11, 2017 via email

@Shreeshrii
Copy link
Collaborator

Ray,
There seems to be a bug. I have tried training a couple of times to 2-3% char error rate. But the OCRed text seems to be way off. During training, it seems that the diacritics are being recognized well., eg.

Iteration 1702: ALIGNED TRUTH : انَدبْعَ ىلَعَ انَلْنَ امَّمِ بٍيْرَ يفِ مْتُنْكُ نْإِوَ نَومُلَعْتَ مْتُنْأوَ ادًادَنْأ هِللِ اولُعَجْتَ الَفَ مْكُل
Iteration 1702: BEST OCR TEXT : انَيبْعَ ىلَعَ انَلْنَ اهَمِ بِيْرَ يفِ مْتُنَ نْإوَ نَوهُلَمْتَ مْتُأوَ اذَادَنأ هِلَلِ اولُعَجْتَ الَفَ مْكُلَ

But the OCRed text does not seem to include any.

ara.Arial_Unicode_MS.exp0.txt
ara.Calibri.exp0.txt
ara.Arial.exp0.txt

@Christophered and @bmwmy can provide Arabic specific details.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jan 11, 2017

https://github.com/tesseract-ocr/tesseract/files/696122/ara.TRAINING.zip has the box tif pairs for the above training.

https://github.com/tesseract-ocr/tesseract/files/696184/traininglog-mid.txt shows the debug messages from during training.

@amitdo
Copy link
Collaborator

amitdo commented Jan 11, 2017

I wonder if the bidi integration is working correctly for LSTM, as the accuracy with Arabic is unsatisfactory.

Ray,

According to your tests, how does Hebrew (another RTL language) perform?

Do you have accuracy report for various languages that you can share with us, other than the one in the DAS2016 slides?

@Shreeshrii
Copy link
Collaborator

@theraysmith Hope you have seen comments by Chris on the other thread also - #642

i was merging the letter extender with the Arabic letter into one single box, and putting that Arabic letters as the character of the box, basically, i was trying to train the engine to recognize that Arabic letter in it's multiple positions, as you know the Arabic letters have multiple forms based which is based on it's position in the word ( beginning, middle, ending, isolated )
Example:
( كـ ) is not ( ك + ـ ) in the box file, it should be ( ك )
also ( ـكـ ) or ( ـك ) they are a single character ( ك ) in different positions, this is important in the box file.

Which also means that ( كَـ ) is not ( ك + ـَ ), it is ( كَ )

@theraysmith
Copy link
Contributor

theraysmith commented Jan 11, 2017 via email

@theraysmith
Copy link
Contributor

@amitdo Hebrew seems to be OK. It is certainly ahead of 3.05.
I have some detailed results, but they aren't very meaningful without being able to look at the actual errors to see how many are actually due to the ground truth, or some strange disagreement on whitespace.
The gist is that 4.00 is less good than it could be on:
Arabic (all langs)
Indic (all langs)
Chinese, Japanese.
The problems with Arabic may be explained by this thread.
Chinese, Japanese are troubled by the used of radical-stroke encoding. I need to switch to a better scheme.
Indic may be troubled by the length of the compressed codes used. I need to switch to a better scheme.

@Shreeshrii
Copy link
Collaborator

@theraysmith

  • The diacritics are currently excluded from the unicharset, probably
    because they are only rarely used, but need to be included. There may not
    be enough text with them included in the text corpora.

Ray,
Please see #552 and tesseract-ocr/langdata#35

Arabic Diacritics are included in the Arabic.unicharset.

@bmwmy had offered to provide additional training text with diacritics - see #552 (comment)

I was able to get the diacritics recognized during training by adding the following line to ara.config, however for some fonts it seems to be treating diacritics as a separate line. Do not know whether it is related to x-height for fonts.

#Diacritics
textord_min_linesize 2.5

Related question, for lstm training, I am using

--noextract_font_properties

while creating the box tiff pairs and lstmf files since @amitdo mentioned that font_properties are not needed for LSTM training, please confirm.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jan 12, 2017

Please also see #318 (comment) and other comments regarding unicharset.

Are the glyph metrics updated based on the fonts used for training?
Are glyph metrics used for LSTM training?

Answer:

No. Most of the unicharset fields are irrelevant to LSTM training and recognition.
_The mirror and normalized string fields ARE important though.
_

@bmwmy
Copy link

bmwmy commented Jan 12, 2017

I examined @Shreeshrii training set and I appreciate his effort, but seems the text in generated tiffs image files are very small than it should be. It is hard to read even for humans. The vowel diacritics looks like noise also some letter glyphs as ( فـ / ـمـ ). I suggest using 16-22pt font.
Also this:
Iteration 1702: ALIGNED TRUTH : انَدبْعَ ىلَعَ انَلْنَ امَّمِ بٍيْرَ يفِ مْتُنْكُ نْإِوَ نَومُلَعْتَ مْتُنْأوَ ادًادَنْأ هِللِ اولُعَجْتَ الَفَ مْكُل
Iteration 1702: BEST OCR TEXT : انَيبْعَ ىلَعَ انَلْنَ اهَمِ بِيْرَ يفِ مْتُنَ نْإوَ نَوهُلَمْتَ مْتُأوَ اذَادَنأ هِلَلِ اولُعَجْتَ الَفَ مْكُلَ
are reversed order (RTL issue)

comparison between noisy text example which was used in training and good one:
testdia

I think using bigger text image input will result in very high improvement. I am satisfied with this result taking in consideration this noisy input but (RTL issue) should be solved.

@Christophered ara.Traditional_Arabic.exp0.zip was good input image file
but @Shreeshrii https://github.com/tesseract-ocr/tesseract/files/696122/ara.TRAINING.zip is noisy input image.

@amitdo
Copy link
Collaborator

amitdo commented Jan 12, 2017

About --noextract_font_properties . Ray confirmed it here:
#634 (comment)

@amitdo
Copy link
Collaborator

amitdo commented Jan 12, 2017

Are glyph metrics used for LSTM training?

I believe the answer is 'No'.

@theraysmith, can you confirm that?

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jan 12, 2017

@amitdo I have been training using --noextract_font_properties since you brought it to my notice.

However, I am wondering whether some type of fontinfo / xheights is still required for LSTM training.

eg. in order to avoid the diacritics being discarded as noise, I had to add textord_min_linesize 2.6 in ara.config. But different fonts have different sizes, even at same point size. I had to play with different values, but couldn't figure out a value that would work with multiple fonts.

So, I am trying the training again with only one font, Traditional Arabic font at 32 point, as suggested by @bmwmy.

Ray might have a different solution - will wait till the changes in wiki for training are updated.

@amitdo
Copy link
Collaborator

amitdo commented Jan 12, 2017

textord_min_linesize is a hint for the layout analysis step in Tesseract.

If the layout analysis step does not 'cut' the lines properly, the next step - the lines' text recognition, will suffer.

@amitdo
Copy link
Collaborator

amitdo commented Jan 12, 2017

Tesseract release notes July 11 2015 - V3.04.00

Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc.

From DAS2016 slide 5 - 'Page Layout Analysis':

Tesseract's existing text-line finding is also weak wrt diacritics,
especially for Arabic and Thai.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jan 12, 2017

There is Bidi processing inside the post-recognition processing of
Tesseract that reprocesses/re-orders the text for output, so it appears in
the correct order.

Iteration 1702: ALIGNED TRUTH : انَدبْعَ ىلَعَ انَلْنَ امَّمِ بٍيْرَ يفِ مْتُنْكُ نْإِوَ نَومُلَعْتَ مْتُنْأوَ ادًادَنْأ هِللِ اولُعَجْتَ الَفَ مْكُل
Iteration 1702: BEST OCR TEXT : انَيبْعَ ىلَعَ انَلْنَ اهَمِ بِيْرَ يفِ مْتُنَ نْإوَ نَوهُلَمْتَ مْتُأوَ اذَادَنأ هِلَلِ اولُعَجْتَ الَفَ مْكُلَ
are reversed order (RTL issue)

@theraysmith

Ray, Can the bidi post-processing be applied as an experiment to these debug messages? Then we can very easily see whether it is working.

Answer: No. That would be very difficult. They are intended to be displayed completely without any RTL smarts.

@amitdo
Copy link
Collaborator

amitdo commented Jan 12, 2017

Shree, you might want to use this text2image option with Arabic:
--leading Inter-line space (in pixels) (type:int default:12)

As a minimum it should equal to ptsize. For Arabic, you can try to increase it (20-50 percent bigger than ptsize)

@amitdo
Copy link
Collaborator

amitdo commented Jan 12, 2017

IMO, 32 ptsize is too big. Try 14/16.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jan 12, 2017

Arabic also has presentation forms i.e. spacing forms of Arabic diacritics, and contextual letter forms.

Please see
http://www.alanwood.net/unicode/arabic_presentation_forms_a.html
http://www.alanwood.net/unicode/arabic_presentation_forms_b.html
and
https://github.com/w3c/alreq/wiki/Should-I-use-the-Arabic-Presentation-Forms-provided-in-Unicode%3F

Though, it is not recommended to use these for content under newer versions of unicode, I am wondering whether it will make OCR easier if these were used...

There can be a post-processing step to convert them to regular unicode points later.

@amitdo
Copy link
Collaborator

amitdo commented Jan 13, 2017

Are glyph metrics used for LSTM training?

No. Confirmed by Ray here: tesseract-ocr/langdata#31 (comment)

... the glyph metrics aren't used.

@Shreeshrii
Copy link
Collaborator

Shree, you might want to use this text2image option with Arabic:
--leading Inter-line space (in pixels) (type:int default:12)

As a minimum it should equal to ptsize. For Arabic, you can try to increase it (20-50 percent bigger than ptsize)

tesstrain.sh process uses default --ptsize which is 12.

language_specific.sh sets --leading to 32 by default and to 48 for Thai fonts.

@ghost
Copy link
Author

ghost commented Jan 28, 2017

@Shreeshrii here is a sample text for you, please test and post the findings
Arabic sample variation.zip

@theraysmith Most if not all languages related to Arabic (example: Farsi, Urdu, etc..) use such diacritics.
The Arabic diacritics are often but not always used in the Arabic text, sometimes in all the text, and sometimes at one letter in each word, but believe me the diacritics are frequently used.
Have a look at The Quran

@ghost
Copy link
Author

ghost commented Jan 28, 2017

@Shreeshrii your sample of box/tif that you provided had an some errors that I've notice:

  1. The U+640 (tatweel) issue
    example: بِسمِ
    wrong: ب ـِ س ـِ م
    correct: ب س م only 3 letters بسم
    @theraysmith got it right, Tesseract should have the capability to generate it in .tif, but not consider it as a single character in the .box file, the correct thing would be that the box of U+640 (tatweel) be combined with the box of another letter, while setting the box as the character of a single letter, never even mentioning U+640 (tatweel) in the .box file , ever.
    U+640 (tatweel) is a special case, people dont usually use it while writing text, so dont use it, or merge the boxes and remove it.

  2. As @bmwmy mentioned earlier, in your .tif the charachters of the text are seperated, thats wrong.
    wrong: بـ ـسـ ـم
    correct: بسم

I tracked the problem, and the cause was the txt that you were using, it contained the 1st U+640 (tatweel) mistake that I mentioned earlier

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jan 28, 2017 via email

@ghost
Copy link
Author

ghost commented Jan 28, 2017

@theraysmith i suggest that you give capability to convert tesseract 3.0x box files to tesseract 4.0x since many of us have tif/box files based on the older version 3.0x

@theraysmith Also, Ubuntu have released the Snaps project, giving the ability to distribute a software as a universal Linux package. Would it be possible to release a Snappy version on Tesseract 4.0x, this would save us alot of time and effort by skipping the building process and it's issues.
https://www.ubuntu.com/desktop/snappy
http://snapcraft.io/

@ghost
Copy link
Author

ghost commented Jul 30, 2017

Question
@theraysmith I understand that for training, Tesseract 4.x reorders the Arabic text to tesseracts's reading order, meaning it converts RTL to LTR & then normalizes it.
For normalization, which form does it uses, NFD or NFC?
Example:
000000
GDT: آمنا بالله إن شئتم الآخرة هم بمؤمنين يا أيها
NFD: اهئا اي نينمٔومب مه ةرخٓالا متٔيش نٕا هللاب انمٓا
NFC: اهيأ اي نينمؤمب مه ةرخآلا متئش نإ هللاب انمآ

The difference between NFD vs NFC, is that after the text is is reordered to LTR for training, NFD pushes the marks/diacritics to the right-side, while in NFC the marks/diacritics remain on the characters itself.

@amitdo
Copy link
Collaborator

amitdo commented Jul 30, 2017

text2image should have an option to randomly add tatweel every n lines.

https://en.wikipedia.org/wiki/Kashida

Kashida is generally only used in one word per line and applied to one letter per word.

Furthermore, kashida is recommended only between certain combinations of letters (typically those which cannot form a ligature).

@ghost ghost closed this as completed Jul 30, 2017
@ghost ghost reopened this Jul 30, 2017
@ghost
Copy link
Author

ghost commented Jul 30, 2017

@amitdo No, No.......! That is very bad
Tatweel reduces the recognition rate, and makes the model unstable and confused theoretically speaking
Ray stated earlier that he have fixed the tatweel problem, he even gave the option to render it if you want to, but it will be removed as a singular sole character from the training data.
I think he solved this issue by Merging the tatweel Box with the earlier glyph box, and removing the tatweel as a character, the final remainder is 1 box with the earlier glyph.

@amitdo
Copy link
Collaborator

amitdo commented Jul 30, 2017

For normalization, which form does it uses, NFD or NFC?

It seems it uses NFKC.

@amitdo
Copy link
Collaborator

amitdo commented Jul 30, 2017

Hi @Christophered, I don't think you understood my meaning.

I Re-read what Ray said in this issue.

it needs to be preserved in the training text, so it gets rendered

I understand that tatweel is a rendering artifact, that should be rendered for training, but not occur in the output text (or in the language model).

The tatweel and ligature problem are fixed and will be corrected in the new traineddatas coming soon

So it seems he already implemented (but didn't push yet) what I just now suggested.

@amitdo
Copy link
Collaborator

amitdo commented Jul 30, 2017

text2image is the program that renders images from ground truth.

text2image should have an option to randomly add tatweel every n lines.

'add' here means 'render'.

@Shreeshrii
Copy link
Collaborator

3e63918

Please test with latest source from github. The commit by @theraysmith fixes the issue.

@Shreeshrii
Copy link
Collaborator

@theraysmith

Thanks for these updates.

Does the complete fix for RTL languages also require new traineddata created with these fixes?

Will you be uploading a new version of traineddata?

@hanikh
Copy link

hanikh commented Sep 12, 2017 via email

@Shreeshrii
Copy link
Collaborator

@hanikh You are right, Ray had uploaded the best traineddata files on August 1.

However, I think that the wordlists in traineddata for RTL languages in that still had some errors in order of characters in ligatures.

I am hoping that @theraysmith will upload fixed versions of traineddata to the new repos for LSTM traineddata -

https://github.com/tesseract-ocr/tessdata_best
and
https://github.com/tesseract-ocr/tessdata_fast

@Fahad-Alsaidi
Copy link

if you are looking for Arabic diacritization corpus, here is a good one

@AbdelsalamHaa
Copy link

Hi guys, im using tesseract 4.00 and i have did my program to recognize English language . it worked very well , now im trying to include arabic language in my program but it give a very weird characters even though i used ara.triandata instead of eng.traindata

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Apr 30, 2018 via email

@amitdo
Copy link
Collaborator

amitdo commented Apr 30, 2018

@AbdelsalamHaa
Copy link

@amitdo
sorry i thought each one is different , my apologies to you .

@Shreeshrii .
this is the result when i use both fast ara.traineddata and best ara.traineddata . both the same

image

the image im trying to read is this one
image

however on the prompt window the result is also different then even the one before printing
image

is it because the language i use for my laptop or that has nothing to do with this .

@Shreeshrii
Copy link
Collaborator

Seems like a locale issue. Output to a file and then open it in a unicode text editor.

@Shreeshrii
Copy link
Collaborator

You may need to preprocess the image for better result.

648-arabic

عبدالسلام حمدي عبدالعزيز

tesseract 648-arabic.png    -  -l ara --tessdata-dir ./tessdata_best
عبدالسلام حمدي عبدالعزيز

@AbdelsalamHaa
Copy link

AbdelsalamHaa commented May 2, 2018

@Shreeshrii
i found out why they are different it's because my computer used chines to encode any non unicode text , i mean from the tesseract and the prompt window

but still the result still the same

i download ara.traineddata from here

https://github.com/tesseract-ocr/tessdata_fast
and also tried from here but same results
https://github.com/tesseract-ocr/tessdata_best

and i used this website to check the unicode as u mention
https://r12a.github.io/app-conversion/

i think i found the problem but im not sure how to solve it
the first line is the UTF-8 code for "عبدالسلام حمدي عبدالعزيز"
then when i convert it press"Hex code points" i got the string that i got when used tessreract .
image

but still not sure how to solve the problem

@AbdelsalamHaa
Copy link

image
this is the part of my code for tessract

i aslo used the image u just send to test but still got the same wired charterers

@AbdelsalamHaa
Copy link

image

here where i initialized all .traineddata files

@AbdelsalamHaa
Copy link

To say my problem more clearly , the tesseract read the image correctly. It returns the correct UTF-8 code but then this code when it's treated as hex code the charterers will be wrong.
hope can anyone help me with this

@AbdelsalamHaa
Copy link

AbdelsalamHaa commented May 3, 2018

okay final i found out the problem , well i fixed few thing so im not sure which one exactly was the mistake but i think the most is
in the Advance saving option i change it to be like this
image

i test the code first using this simple code even though it's not tesseract fault but it might be use full for other

`#include <stdio.h>
#include <windows.h>
#include
#include
#include
#include

using namespace std;
int main() {
ofstream writer("file3.txt");
// Set console code page to UTF-8 so console known how to interpret string data
//SetConsoleOutputCP(CP_UTF8);

// Enable buffering to prevent VS from chopping up UTF-8 byte sequences
//setvbuf(stdout, nullptr, _IOFBF, 1000);

string test1 = "عبدالسلام حمدي عبدالعزيز\n";

cout << test1  << std::endl;

writer << "\t na  " << test1.c_str() << endl;



getchar(); 
return 0; 

} `
please take note the result is only correct when u print to a file but not on the prompt window not even in watch of visual studio .

@mhdsRahnama
Copy link

hello
I have tatweel (kashida) problem in Persain too. can i use text2image to fix it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants