Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arabic Language output is reversed #169

Closed
ghost opened this issue Dec 11, 2015 · 33 comments
Closed

Arabic Language output is reversed #169

ghost opened this issue Dec 11, 2015 · 33 comments

Comments

@ghost
Copy link

ghost commented Dec 11, 2015

Hi there,
I have created my own Arabic Language traindata, but the problem is that when used it gives the recognized text reversely (opposite direction), noting that the Arabic and Hebrew languages are written and read from Right to left handside (RTL).
People keep implying to use Cube for training Arabic, but I think no one really knows how to use Cube for training, and yes I have read the tesseract extra Cube documentation, and it seems that they purposely don't want anyone to use Cube.
How can I make a tesseract traineddata that recognize RTL languages as Arabic correctly?
Waiting for your reply

@amitdo
Copy link
Collaborator

amitdo commented Dec 11, 2015

Hi @Christophered,

How did you generate the tif images for training? Did you use the text2image tool?

Did you use set_unicharset_properties?
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract#set_unicharset_properties-new-in-303

From https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract#dictionary-data-optional

For right-to-left languages (RTL) use option "-r 1".

You can also try to use tesstrain.sh
https://github.com/tesseract-ocr/tesseract/wiki/tesstrain.sh

In general, the right place to ask questions like this is here:
http://groups.google.com/group/tesseract-ocr

See also:
https://groups.google.com/forum/#!topic/tesseract-ocr/HdT8V1nFTtY

@ghost
Copy link
Author

ghost commented Dec 11, 2015

Thank you for reply
I have just used the "wordlist2dawg -r 1" that you suggested and it's has solved my "reversed words" problem.
But now I have a new problem, the recognized text are combined together, meaning the words have no spacing between them. Tesseract seems to recognize all words as only 1 word.
I need help, waiting for reply

@amitdo
Copy link
Collaborator

amitdo commented Dec 13, 2015

Try using this config file:
https://github.com/tesseract-ocr/langdata/blob/master/ara/ara.config

Remove this line:

tessedit_ocr_engine_mode 1

@ghost
Copy link
Author

ghost commented Dec 13, 2015

The problem has been solved! Thanks to the user (amitdo) The solution was:
To use "wordlist2dawg.exe -r 1" to create the "freaquent_words_list" + "words_list"
To use "ara.config" and removing this line from it "tessedit_ocr_engine_mode 1"

This solved my 2 problems of Arabic Language reversed words, and Arabic Language combined word.
Thank you

@roozgar
Copy link

roozgar commented Feb 27, 2016

@Christophered
did you get any good results by training? my best accuracy is about 40-50% on 300 dpi scanned document

@ghost
Copy link
Author

ghost commented Mar 7, 2016

roozgar, I will conduct some tests and will reply back after couple of days

@roozgar
Copy link

roozgar commented Mar 7, 2016

@Christophered
i tried to find official Arabic resources to make up a better train file but not lucky
so
if you need i can help you by providing Arabic words list or some scanned page
just send me an email: roozgar@gmail.com

@ghost
Copy link
Author

ghost commented Mar 7, 2016

Thank you roozgar, I appreciate you
I am currently conducting some tests on Arabic Tesseract and I am exhausting all resources available to me to make sure finding the best method for arabic recognition.
Don't worry I will conduct some tests and reply back to you ( by GOD)
By-the-way send me the scanned Arabic Document that you've been testing the accuracy on.
waiting for your reply

@roozgar
Copy link

roozgar commented Mar 7, 2016

@Christophered
sure
please tell me your email address

@ghost
Copy link
Author

ghost commented Mar 7, 2016

@ghost
Copy link
Author

ghost commented Mar 18, 2016

I have tested tesseract 3.02+3.04+3.05dev all have failed in arabic ocr.
Some-how I got the feeling that Arabic Language was purposely neglected and rejected.

@roozgar
Copy link

roozgar commented Mar 26, 2016

@Christophered do you have any plan to work more on this subject?!
the official train data for arabic is working really good on 'times' font
so i think its possible to have a good accuracy other fonts too!

@amitdo
Copy link
Collaborator

amitdo commented Mar 26, 2016

The official trained data uses the 'Cube' engine. There is no documented way to train 'Cube' with other fonts.

@roozgar
Copy link

roozgar commented Mar 26, 2016

@amitdo Oops! i found this

https://code.google.com/archive/p/tesseract-ocr-extradocs/wikis/Cube.wiki

its really undocumented!!
but how they build current Arabic file!!

@amitdo
Copy link
Collaborator

amitdo commented Mar 26, 2016

I suppose they have a program for that task...

@roozgar
Copy link

roozgar commented Mar 26, 2016

@amitdo who are they? i there any way to find who build each trained file?

@amitdo
Copy link
Collaborator

amitdo commented Mar 26, 2016

Developers from Google.

@ghost
Copy link
Author

ghost commented Aug 4, 2016

@Christophered
hey christopher, can you please tell me how you created your own .traineddata file for arabic or send me a link that contains a tutorial that i can use to follow.
I have been trying to implemented Tresseract ara.traineddata file but for some reason, the app that i have made using android studio gets stuck

@ghost
Copy link
Author

ghost commented Aug 5, 2016

Hi @AREEBAKAMIL I have replyed by email also here is the Tutorial that you requested.
https://www.youtube.com/watch?v=vohgRChtRck

also here is one method to improve the recognition, just for testing
https://www.youtube.com/watch?v=tLJvHWhX_JA

Please remember this is for the Arabic Language, the recognition rate is low to moderate.
Training tesseract for English Language gains +90%, but not for Arabic sadly.

in jtessboxeditor:
Arabic use ara
Urdu use urd
English use eng

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Sep 14, 2016

@Christophered
How do you take into account the different forms - isolated, initial, medial and final forms of the same letter, during training?

https://tsl620atnaz.wikispaces.com/file/view/arabic.gif/130745645/arabic.gif

@ghost
Copy link
Author

ghost commented Dec 10, 2016

@Shreeshrii
one of the methods that I use in training:
example:
Isolated: (ك)
Initial: (ك) then press "Shift j or ت" , so the result will be ( كـ)
Medial: "Shift j or ت" , then press (ك), then "Shift j or ت" , result is ( ـكـ )
Final: "Shift j or ت" , then press (ك) , result is (ـك)

@harinadhkota
Copy link

harinadhkota commented Jan 28, 2017

Hi ,
can you please provide me any version of tesseract-ocr which supports "Arabic " Language ,
I am tried with tesseract-ocr3.02 version ,
It is not supportng "Arabic " Language,
if any upgade or downgrade versions supports "Arabic " language
Please Let me know

  1. If any supported version is there "send me " tesseract-ocr " software and all supported configuration files as well
  2. or else if download is available send me "dowload link " to "saimuralikrishna005@gmail.com "
  3. In case if required please provide to my business mail id : "saikrishna.yalakala@wissen.com "
    send me mail "saimuralikrishna005@gmail.com"

@ghost
Copy link
Author

ghost commented Jun 3, 2017

@amitdo should this work for hebrew as well? Do I need to create training data myself (i.e. "freaquent_words_list" + "words_list" etc)?

10x

@amitdo
Copy link
Collaborator

amitdo commented Jun 4, 2017

Hi Uri !

There is a tessdata package for Hebrew.
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files

Try to use it before you start training Hebrew.

Also, read this page:
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

should this work for hebrew as well?

By 'this' you mean

  1. 'Tessseract' ?
    Yes. Tessseract supports Hebrew.
    The provided tessdata does not supports Hebrew diacritics (nikud).

From https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract#dictionary-data-optional
For right-to-left languages (RTL) use option "-r 1".

Yes, it should be used for Hebrew too.

If you have further questions please use the forum (I'm not participating there).

For Hebrew OCR questions / discussions you can also try here:
https://github.com/amitdo/Hebrew-OCR-Discussions

@ghost
Copy link
Author

ghost commented Jun 5, 2017

@amitdo thank you very much! I will go through you suggestions.
I meant should the solution for Arabic reverse output should apply for Hebrew as well.
One of the participants mensons reversing the strings int the training data, wasn't sure if this is somethng I need to do...?

@adinetoiu
Copy link

Hi,
I am also having problems with tesseract OCR for arabic and i need your help.

Can you please send me a trained data file for arabic language for tesseract 3.0.2?
My email is adinetoiu@yahoo.com.

Thank you in advance,
Adrian

@ghost
Copy link
Author

ghost commented Jun 19, 2017

@adinetoiu I suggest that you skip using Tesseract 3.x for Arabic, instead use Tesseract 4.
a binary is also available at http://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-4.00.00dev.exe

@adinetoiu
Copy link

adinetoiu commented Jun 19, 2017 via email

@adinetoiu
Copy link

adinetoiu commented Jun 19, 2017 via email

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jun 19, 2017 via email

@ghost
Copy link
Author

ghost commented Jun 19, 2017

@adinetoiu I have contacted the developer of jtessboxeditor, he stated that it might take time until we see an automated lstm trainer, until then, you must train manually.
secondly, the steps and examples are available in the Wiki along with some test box files.
Note: please edit your replies and leave only your replies, your adding unrequired information.
From: chris <notifications@github.com ......... delete that

@nsoud
Copy link

nsoud commented Sep 21, 2017

hello man, please can you send me the ara.traineddata so i can test it, i don't know how to train iOS tesseract 3.x to recognize arabic in a great way?
also is tesseract 4 made for iOS , i didn't find an example for iPhone or iPad with tesseract 4, and also if there is one how can i update me old tesseract with the new one. thank you very much

@ghost
Copy link

ghost commented Aug 13, 2018

hello every one, i have a issue in urdu language data, any expert here who can help me please mail me.
My email is moen.eqbal@gmail.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants