Use a different set of fonts for Persian #26

ebraminio · 2016-05-27T11:10:01Z

(moved from tesseract-ocr/tesseract#294)

First of all, thanks for adding support to tesseract finally. From quickly inspecting Persian related codes on tesseract I reached to https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh#L520 which I can say speculatively is not a good set of fonts for training Persian printed text and can result in poor performance of OCR quality as most Persian fonts don't have the style these fonts have. On "Font recognition using Variogram fractal dimension", a good set of Persian fonts is introduced (second page, at the bottom) which as you can see there also, it is different from favorites Arabic language fonts (even the fact both are using Arabic script). So for training Persian OCR for tesseract I suggest adding or replacing current fonts with these free fonts, Nazli (i.e. Nazanin as indicated on that article) and Titr from Debian fonts-farsiweb package and also XB Zar and XB Yaghut from OFL licensed xfonts. Thank you.

Shreeshrii · 2017-08-05T08:20:56Z

@ebraminio Please test with the latest BEST 4.0alpha traineddata and provide feedback at tesseract-ocr/tessdata#70

Shreeshrii · 2019-07-22T08:47:18Z

https://fontlibrary.org/en/font/nazli

https://github.com/behnam/fonts-farsiweb/tree/master/ttf

http://wiki.irmug.com/index.php/X_Series_2

ebraminio · 2019-07-22T09:54:18Z

Great! Thanks for the work, I hadn't chance to test its quality but since you know about the fonts now I assume this is already done so do you like to close this as fixed? Thank you!

Shreeshrii · 2019-07-22T10:16:00Z

Yes. You can close this, since the issue is being tracked at tesseract-ocr/tessdata#70.

I assume this is already done

I don't know about that since the training was done by Ray at Google.

ebraminio · 2019-07-22T12:01:13Z

Thanks!

amitdo mentioned this issue Sep 14, 2016

Arabic language (right to left in writing) stored (left to right) after create PDF Searchable tesseract-ocr/tesseract#238

Open

Shreeshrii mentioned this issue Aug 5, 2017

Best Traineddata Feedback - Persian tesseract-ocr/tessdata#70

Open

ebraminio closed this as completed Jul 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a different set of fonts for Persian #26

Use a different set of fonts for Persian #26

ebraminio commented May 27, 2016

Shreeshrii commented Aug 5, 2017

Shreeshrii commented Jul 22, 2019

ebraminio commented Jul 22, 2019

Shreeshrii commented Jul 22, 2019

ebraminio commented Jul 22, 2019

Use a different set of fonts for Persian #26

Use a different set of fonts for Persian #26

Comments

ebraminio commented May 27, 2016

Shreeshrii commented Aug 5, 2017

Shreeshrii commented Jul 22, 2019

ebraminio commented Jul 22, 2019

Shreeshrii commented Jul 22, 2019

ebraminio commented Jul 22, 2019