Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Shan language (shn) #33

Closed
ronaldaug opened this issue Dec 7, 2019 · 8 comments
Closed

Add support for Shan language (shn) #33

ronaldaug opened this issue Dec 7, 2019 · 8 comments
Labels
enhancement New feature or request

Comments

@ronaldaug
Copy link

Could someone help me to add the Shan language in tesseract?

Shan language = https://en.wikipedia.org/wiki/Shan_language
Language code = shn
Shan Wiki = https://shn.wikipedia.org
All Shan words (including IPA) = jsonfile
Websites that are using Shan scripts = https://shannews.org/ , http://shanunicode.com/
Font = https://saosu-mp.github.io/font/PangLong/PangLong.ttf
Shan syllable break = https://github.com/kwarm/syllable-break

Some Shan characters such as င သ တ ထ ပ မ ယ ရ လ ဝ ႉ း ွ ု ူ ိ ီ ် ၊ ။ are similar to Myanmar (Burmese).

Thanks in advance

@ronaldaug
Copy link
Author

It seems this repo isn't active or maintained.

@stweil
Copy link
Contributor

stweil commented Jan 5, 2020

That's correct, this repo is for the old Tesseract 3.05 and the legacy OCR recognizer.
The more recent repository is https://github.com/tesseract-ocr/langdata_lstm. Should I move this issue to that repo?

@ronaldaug
Copy link
Author

Yes please, thanks @stweil .

@stweil stweil reopened this Jan 6, 2020
@stweil stweil transferred this issue from tesseract-ocr/langdata Jan 6, 2020
@stweil
Copy link
Contributor

stweil commented Jan 6, 2020

@ronaldaug, do you want to prepare a pull request which adds shn, maybe based on https://github.com/tesseract-ocr/langdata_lstm/tree/master/mya?

@stweil stweil added the enhancement New feature or request label Jan 6, 2020
@ronaldaug
Copy link
Author

Ok, I'll prepare and send a pull request to "/tesseract-orc/langdata_Istm" based on mya and other languages.

@ronaldaug
Copy link
Author

@stweil
Sorry for bothering you.
Is this repo still active?
I've created PR for Shan language.
Do I have to train it by myself?

@stweil
Copy link
Contributor

stweil commented Aug 26, 2021

Yes, the repo is active. I also noticed your pull request, but had no time to review it up to now. Ideally Shan support and training should be done by someone who knows that language (so not by me).

@ronaldaug
Copy link
Author

Thanks for your quick response.
Though I'm not very familiar to tesseract-ocr training process, I'll try it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants