Arabic Language output is reversed #169

ghost · 2015-12-11T01:12:31Z

Hi there,
I have created my own Arabic Language traindata, but the problem is that when used it gives the recognized text reversely (opposite direction), noting that the Arabic and Hebrew languages are written and read from Right to left handside (RTL).
People keep implying to use Cube for training Arabic, but I think no one really knows how to use Cube for training, and yes I have read the tesseract extra Cube documentation, and it seems that they purposely don't want anyone to use Cube.
How can I make a tesseract traineddata that recognize RTL languages as Arabic correctly?
Waiting for your reply

amitdo · 2015-12-11T09:34:53Z

Hi @Christophered,

How did you generate the tif images for training? Did you use the text2image tool?

Did you use set_unicharset_properties?
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract#set_unicharset_properties-new-in-303

From https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract#dictionary-data-optional

For right-to-left languages (RTL) use option "-r 1".

You can also try to use tesstrain.sh
https://github.com/tesseract-ocr/tesseract/wiki/tesstrain.sh

In general, the right place to ask questions like this is here:
http://groups.google.com/group/tesseract-ocr

See also:
https://groups.google.com/forum/#!topic/tesseract-ocr/HdT8V1nFTtY

ghost · 2015-12-11T17:10:23Z

Thank you for reply
I have just used the "wordlist2dawg -r 1" that you suggested and it's has solved my "reversed words" problem.
But now I have a new problem, the recognized text are combined together, meaning the words have no spacing between them. Tesseract seems to recognize all words as only 1 word.
I need help, waiting for reply

amitdo · 2015-12-13T13:24:37Z

Try using this config file:
https://github.com/tesseract-ocr/langdata/blob/master/ara/ara.config

Remove this line:

tessedit_ocr_engine_mode 1

ghost · 2015-12-13T18:43:55Z

The problem has been solved! Thanks to the user (amitdo) The solution was:
To use "wordlist2dawg.exe -r 1" to create the "freaquent_words_list" + "words_list"
To use "ara.config" and removing this line from it "tessedit_ocr_engine_mode 1"

This solved my 2 problems of Arabic Language reversed words, and Arabic Language combined word.
Thank you

roozgar · 2016-02-27T09:59:57Z

@Christophered
did you get any good results by training? my best accuracy is about 40-50% on 300 dpi scanned document

ghost · 2016-03-07T09:57:55Z

roozgar, I will conduct some tests and will reply back after couple of days

roozgar · 2016-03-07T10:15:23Z

@Christophered
i tried to find official Arabic resources to make up a better train file but not lucky
so
if you need i can help you by providing Arabic words list or some scanned page
just send me an email: roozgar@gmail.com

ghost · 2016-03-07T10:23:40Z

Thank you roozgar, I appreciate you
I am currently conducting some tests on Arabic Tesseract and I am exhausting all resources available to me to make sure finding the best method for arabic recognition.
Don't worry I will conduct some tests and reply back to you ( by GOD)
By-the-way send me the scanned Arabic Document that you've been testing the accuracy on.
waiting for your reply

roozgar · 2016-03-07T10:25:03Z

@Christophered
sure
please tell me your email address

ghost · 2016-03-07T10:47:13Z

christopher.edward@outlook.com

ghost · 2016-03-18T15:41:01Z

I have tested tesseract 3.02+3.04+3.05dev all have failed in arabic ocr.
Some-how I got the feeling that Arabic Language was purposely neglected and rejected.

roozgar · 2016-03-26T20:45:20Z

@Christophered do you have any plan to work more on this subject?!
the official train data for arabic is working really good on 'times' font
so i think its possible to have a good accuracy other fonts too!

amitdo · 2016-03-26T21:26:55Z

The official trained data uses the 'Cube' engine. There is no documented way to train 'Cube' with other fonts.

roozgar · 2016-03-26T22:31:39Z

@amitdo Oops! i found this

https://code.google.com/archive/p/tesseract-ocr-extradocs/wikis/Cube.wiki

its really undocumented!!
but how they build current Arabic file!!

amitdo · 2016-03-26T22:44:35Z

I suppose they have a program for that task...

roozgar · 2016-03-26T22:55:14Z

@amitdo who are they? i there any way to find who build each trained file?

amitdo · 2016-03-26T23:31:25Z

Developers from Google.

ghost · 2016-08-04T06:19:50Z

@Christophered
hey christopher, can you please tell me how you created your own .traineddata file for arabic or send me a link that contains a tutorial that i can use to follow.
I have been trying to implemented Tresseract ara.traineddata file but for some reason, the app that i have made using android studio gets stuck

ghost · 2016-08-05T16:57:57Z

Hi @AREEBAKAMIL I have replyed by email also here is the Tutorial that you requested.
https://www.youtube.com/watch?v=vohgRChtRck

also here is one method to improve the recognition, just for testing
https://www.youtube.com/watch?v=tLJvHWhX_JA

Please remember this is for the Arabic Language, the recognition rate is low to moderate.
Training tesseract for English Language gains +90%, but not for Arabic sadly.

in jtessboxeditor:
Arabic use ara
Urdu use urd
English use eng

Shreeshrii · 2016-09-14T04:34:40Z

@Christophered
How do you take into account the different forms - isolated, initial, medial and final forms of the same letter, during training?

https://tsl620atnaz.wikispaces.com/file/view/arabic.gif/130745645/arabic.gif

ghost · 2016-12-10T20:50:56Z

@Shreeshrii
one of the methods that I use in training:
example:
Isolated: (ك)
Initial: (ك) then press "Shift j or ت" , so the result will be ( كـ)
Medial: "Shift j or ت" , then press (ك), then "Shift j or ت" , result is ( ـكـ )
Final: "Shift j or ت" , then press (ك) , result is (ـك)

harinadhkota · 2017-01-28T10:02:44Z

Hi ,
can you please provide me any version of tesseract-ocr which supports "Arabic " Language ,
I am tried with tesseract-ocr3.02 version ,
It is not supportng "Arabic " Language,
if any upgade or downgrade versions supports "Arabic " language
Please Let me know

If any supported version is there "send me " tesseract-ocr " software and all supported configuration files as well
or else if download is available send me "dowload link " to "saimuralikrishna005@gmail.com "
In case if required please provide to my business mail id : "saikrishna.yalakala@wissen.com "
send me mail "saimuralikrishna005@gmail.com"

ghost · 2017-06-03T17:49:54Z

@amitdo should this work for hebrew as well? Do I need to create training data myself (i.e. "freaquent_words_list" + "words_list" etc)?

10x

amitdo · 2017-06-04T11:13:29Z

Hi Uri !

There is a tessdata package for Hebrew.
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files

Try to use it before you start training Hebrew.

Also, read this page:
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

should this work for hebrew as well?

By 'this' you mean

'Tessseract' ?
Yes. Tessseract supports Hebrew.
The provided tessdata does not supports Hebrew diacritics (nikud).

From https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract#dictionary-data-optional
For right-to-left languages (RTL) use option "-r 1".

Yes, it should be used for Hebrew too.

If you have further questions please use the forum (I'm not participating there).

For Hebrew OCR questions / discussions you can also try here:
https://github.com/amitdo/Hebrew-OCR-Discussions

ghost · 2017-06-05T15:59:16Z

@amitdo thank you very much! I will go through you suggestions.
I meant should the solution for Arabic reverse output should apply for Hebrew as well.
One of the participants mensons reversing the strings int the training data, wasn't sure if this is somethng I need to do...?

adinetoiu · 2017-06-19T11:37:12Z

Hi,
I am also having problems with tesseract OCR for arabic and i need your help.

Can you please send me a trained data file for arabic language for tesseract 3.0.2?
My email is adinetoiu@yahoo.com.

Thank you in advance,
Adrian

ghost · 2017-06-19T14:13:44Z

@adinetoiu I suggest that you skip using Tesseract 3.x for Arabic, instead use Tesseract 4.
a binary is also available at http://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-4.00.00dev.exe

adinetoiu · 2017-06-19T14:17:27Z

Thank you very much! From: chris <notifications@github.com> To: tesseract-ocr/tesseract <tesseract@noreply.github.com> Cc: adinetoiu <adinetoiu@yahoo.com>; Mention <mention@noreply.github.com> Sent: Monday, June 19, 2017 5:14 PM Subject: Re: [tesseract-ocr/tesseract] Arabic Language output is reversed (#169) @adinetoiu I suggest that you skip using Tesseract 3.x for Arabic, instead use Tesseract 4. a binary is also available at http://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-4.00.00dev.exe— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

adinetoiu · 2017-06-19T14:36:54Z

Do you have a sample project or link that uses tesseract 4? From: chris <notifications@github.com> To: tesseract-ocr/tesseract <tesseract@noreply.github.com> Cc: adinetoiu <adinetoiu@yahoo.com>; Mention <mention@noreply.github.com> Sent: Monday, June 19, 2017 5:14 PM Subject: Re: [tesseract-ocr/tesseract] Arabic Language output is reversed (#169) @adinetoiu I suggest that you skip using Tesseract 3.x for Arabic, instead use Tesseract 4. a binary is also available at http://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-4.00.00dev.exe— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Shreeshrii · 2017-06-19T15:38:16Z

Both gimagereader and vietocr have versions which use tesseract 4. ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Jun 19, 2017 at 8:07 PM, adinetoiu ***@***.***> wrote: Do you have a sample project or link that uses tesseract 4? From: chris ***@***.***> To: tesseract-ocr/tesseract ***@***.***> Cc: adinetoiu ***@***.***>; Mention ***@***.***> Sent: Monday, June 19, 2017 5:14 PM Subject: Re: [tesseract-ocr/tesseract] Arabic Language output is reversed (#169) @adinetoiu I suggest that you skip using Tesseract 3.x for Arabic, instead use Tesseract 4. a binary is also available at http://digi.bib.uni-mannheim. de/tesseract/tesseract-ocr-setup-4.00.00dev.exe— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#169 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_ozNSJAJOADt4YTvs6TLIAMnHaa7Vks5sFoeSgaJpZM4GzPQj> .

ghost · 2017-06-19T16:07:12Z

@adinetoiu I have contacted the developer of jtessboxeditor, he stated that it might take time until we see an automated lstm trainer, until then, you must train manually.
secondly, the steps and examples are available in the Wiki along with some test box files.
Note: please edit your replies and leave only your replies, your adding unrequired information.
From: chris <notifications@github.com ......... delete that

nsoud · 2017-09-21T23:09:19Z

hello man, please can you send me the ara.traineddata so i can test it, i don't know how to train iOS tesseract 3.x to recognize arabic in a great way?
also is tesseract 4 made for iOS , i didn't find an example for iPhone or iPad with tesseract 4, and also if there is one how can i update me old tesseract with the new one. thank you very much

ghost · 2018-08-13T14:32:00Z

hello every one, i have a issue in urdu language data, any expert here who can help me please mail me.
My email is moen.eqbal@gmail.com

zdenop closed this as completed Dec 14, 2015

lisaied mentioned this issue Feb 6, 2016

Arabic Language output is reversed #212

Closed

amitdo mentioned this issue Feb 27, 2016

Arabic language (right to left in writing) stored (left to right) after create PDF Searchable #238

Open

amitdo mentioned this issue May 12, 2016

Arabic Language output is reversed tesseract-ocr/tessdata#12

Closed

amitdo added the question label May 27, 2016

Shreeshrii mentioned this issue Mar 26, 2019

Add test code for fuzzing #2350

Merged

amitdo added the RTL label Sep 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arabic Language output is reversed #169

Arabic Language output is reversed #169

ghost commented Dec 11, 2015

amitdo commented Dec 11, 2015

ghost commented Dec 11, 2015

amitdo commented Dec 13, 2015

ghost commented Dec 13, 2015

roozgar commented Feb 27, 2016

ghost commented Mar 7, 2016

roozgar commented Mar 7, 2016

ghost commented Mar 7, 2016

roozgar commented Mar 7, 2016

ghost commented Mar 7, 2016

ghost commented Mar 18, 2016

roozgar commented Mar 26, 2016

amitdo commented Mar 26, 2016

roozgar commented Mar 26, 2016

amitdo commented Mar 26, 2016

roozgar commented Mar 26, 2016

amitdo commented Mar 26, 2016

ghost commented Aug 4, 2016 •

edited by ghost

ghost commented Aug 5, 2016 •

edited by ghost

Shreeshrii commented Sep 14, 2016 •

edited

ghost commented Dec 10, 2016 •

edited by ghost

harinadhkota commented Jan 28, 2017 •

edited

ghost commented Jun 3, 2017

amitdo commented Jun 4, 2017

ghost commented Jun 5, 2017

adinetoiu commented Jun 19, 2017

ghost commented Jun 19, 2017

adinetoiu commented Jun 19, 2017 via email

adinetoiu commented Jun 19, 2017 via email

Shreeshrii commented Jun 19, 2017 via email

ghost commented Jun 19, 2017

nsoud commented Sep 21, 2017

ghost commented Aug 13, 2018

Arabic Language output is reversed #169

Arabic Language output is reversed #169

Comments

ghost commented Dec 11, 2015

amitdo commented Dec 11, 2015

ghost commented Dec 11, 2015

amitdo commented Dec 13, 2015

ghost commented Dec 13, 2015

roozgar commented Feb 27, 2016

ghost commented Mar 7, 2016

roozgar commented Mar 7, 2016

ghost commented Mar 7, 2016

roozgar commented Mar 7, 2016

ghost commented Mar 7, 2016

ghost commented Mar 18, 2016

roozgar commented Mar 26, 2016

amitdo commented Mar 26, 2016

roozgar commented Mar 26, 2016

amitdo commented Mar 26, 2016

roozgar commented Mar 26, 2016

amitdo commented Mar 26, 2016

ghost commented Aug 4, 2016 • edited by ghost

ghost commented Aug 5, 2016 • edited by ghost

Shreeshrii commented Sep 14, 2016 • edited

ghost commented Dec 10, 2016 • edited by ghost

harinadhkota commented Jan 28, 2017 • edited

ghost commented Jun 3, 2017

amitdo commented Jun 4, 2017

ghost commented Jun 5, 2017

adinetoiu commented Jun 19, 2017

ghost commented Jun 19, 2017

adinetoiu commented Jun 19, 2017 via email

adinetoiu commented Jun 19, 2017 via email

Shreeshrii commented Jun 19, 2017 via email

ghost commented Jun 19, 2017

nsoud commented Sep 21, 2017

ghost commented Aug 13, 2018

ghost commented Aug 4, 2016 •

edited by ghost

ghost commented Aug 5, 2016 •

edited by ghost

Shreeshrii commented Sep 14, 2016 •

edited

ghost commented Dec 10, 2016 •

edited by ghost

harinadhkota commented Jan 28, 2017 •

edited