Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Extended Arabic-Indic Digits to Persian, Urdu and Sindhi #72

Closed
Shreeshrii opened this issue May 2, 2017 · 10 comments
Closed

Add Extended Arabic-Indic Digits to Persian, Urdu and Sindhi #72

Shreeshrii opened this issue May 2, 2017 · 10 comments

Comments

@Shreeshrii
Copy link
Contributor

Add 0-9 and

Perso-Arabic variant ۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹

for Persian, Urdu and Sindhi

Please see tesseract-ocr/tesseract#858

@Shreeshrii
Copy link
Contributor Author

tesseract-ocr/tesseract#894

The rightmost column in image has 2 digit numbers, but most of the time only one digit seems to be recognized.

@theraysmith
Copy link
Contributor

I've added them to my copy of desired_characters. I'll push them to github after testing.
Anyone know which digits are needed for the other Arabic languages?
kur_ara, pus, uig

@ebraminio
Copy link

ebraminio commented Aug 8, 2017

Kurdish with Arabic script (kur) uses Arabic-Indic (١٢٣٤٥٦٧٨٩), Pashto (pus) uses either same with Persian (۱۲۳۴۵۶۷۸۹) or West Arabic (a.k.a European, 123456789), Uighur (uig) uses European.

There is a solution that you check by your own which language uses what digits, open your browser console and enter these, each line separately (needs two letters code, not three letters which tesseract uses):

(123456.789).toLocaleString('ckb') // ١٢٣٬٤٥٦٫٧٨٩ (Arabic-Indic)
(123456.789).toLocaleString('ug') // 123,456.789
(123456.789).toLocaleString('ps') // Interesting that Safari gives "۱۲۳٬۴۵۶٫۷۸۹" (Extended Arabic-Indic similar to Persian) but Chrome "123,456.789"

Please note that Urdu text may use digits with same unicode with Persian but with different appearance (but European style digits seems nowadays are used more often with Urdu), open this on your browser (Urdu appearance of Arabic-Indic extended digits):

data:text/html;charset=utf8,<div lang="ur" style="font-family: Arial; font-size: 400%">۱۲۳۴۵۶۷۸۹

and compare it with (default, and Persian appearance of Arabic-Indic extended digits):

data:text/html;charset=utf8,<div style="font-family: Arial; font-size: 400%">۱۲۳۴۵۶۷۸۹

Same Unicode but different appearance. Opentype, more accurately, a font able to handle opentype language tag feature, handles this magic and Pango, which you use for creation of training dataset for tesseract, is able to handle this for you if language code is passed correctly.

@roozgar
Copy link

roozgar commented Aug 8, 2017

in persian ziro to nine is listed correctly
also "," is used for digit separation...

@Shreeshrii
Copy link
Contributor Author

Thank you all for your helpful input.

@theraysmith
Copy link
Contributor

theraysmith commented Aug 10, 2017 via email

@reza1615
Copy link

@theraysmith
1- here is listed all arabic family characters.
I check the table plus numbers there are some other similar characters which have different Unicode:

ۀ = \u06C0
ۂ =\u06C2
هٔ = \u0647 + \u0654

إ =\u0625
ٳ =\u0673

ٲ =\u0672
أ =\u0623
ٵ =\u0675

، =\u060C
٬ =\u066C
٫ =\u066B

064E
0659

ڼ =\u06BC
ڹ=\u06B9

06EC
06E0
06F0
0660
06DF
06EB
06EA
. = (dot)

0674
0655
0654
065F
0621

٭ =\u066D

  • = *

you can check their Unicode at here
2-at http://collation-charts.org/icu442/ there is list of many languages and their official characters (you can find Persian, Pashto, Arabic, ...) separately like
3- vowels (main vowels Unicode = [\u064B-\u0650\u0652\u0670] ) have unique Unicode for all member of the Arabic family.

@gheyret
Copy link

gheyret commented Aug 22, 2017

Uyghur(Uighur) language uses 0123456789 digits.

@amitdo
Copy link

amitdo commented Feb 26, 2021

This issue should be re-opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants