Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Armenian #67

Open
Shreeshrii opened this issue Apr 11, 2017 · 32 comments

Comments

@Shreeshrii
Copy link
Contributor

commented Apr 11, 2017

copied from: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/zn4Xd-8wKe8/B6VpQkuZAwAJ

Dear all,

I am trying tesseart recently and it is really a very good product. I would like to ask if there is any tutorial or steps about how we can add a new language support to the package? for example Armenian language.

Thank you in advance.

Regards,
Vahe

@Shreeshrii

This comment has been minimized.

Copy link
Contributor Author

commented Apr 11, 2017

Vahe, Please add the following info.

  • Which language code - arm or hye

  • Modern Armenian or Classical Armenian

  • Sources for primary texts in unicode the Armenian language to use for training

  • Freely available unicode fonts to render the text

@Shreeshrii

This comment has been minimized.

Copy link
Contributor Author

commented Apr 11, 2017

langdata has https://github.com/tesseract-ocr/langdata/blob/master/Armenian.unicharset

but no folders for armenian languages.

@theraysmith Is this one of the new languages included in your current training?

I had closed an earlier issue - #51

@amitdo

This comment has been minimized.

@vahenr

This comment has been minimized.

Copy link

commented Apr 12, 2017

Thank for all comments (sorry for being late to response):
Language code is: arm
Modern Armenian: Eastern_Armenian
For fonts please refer to this link: http://armunicode.com/en/fonts/unicode/

@vahenr

This comment has been minimized.

Copy link

commented Apr 12, 2017

For this one:
Sources for primary texts in unicode the Armenian language to use for training

Do you need any Armenian text pages ?

@Shreeshrii

This comment has been minimized.

Copy link
Contributor Author

commented Apr 13, 2017

@vahenr

This comment has been minimized.

Copy link

commented Apr 13, 2017

Yes there is an Armenian wikipedia, this is the link:
https://hy.wikipedia.org/wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D5%A7%D5%BB

I will try to get some unicode text resources and share it with you.

Thank you once again.

@vahenr

This comment has been minimized.

Copy link

commented Apr 13, 2017

text-v2.docx

I attached some text file Armenian unicode hope it help, if you need any more please let me know.

@Shreeshrii

This comment has been minimized.

Copy link
Contributor Author

commented Apr 14, 2017

Thanks, I will give a try and let you know.

@Shreeshrii

This comment has been minimized.

Copy link
Contributor Author

commented Apr 14, 2017

Attached is a zip file with arm.traineddata for use with --oem 0 i.e. legacy engine only for testing. Please give it a try, I have not done any eval on it.

I did training using the following command:


training/tesstrain.sh  \
--fonts_dir  /mnt/c/Windows/Fonts \
 --lang arm   \
 --exposures "0"    \
 --langdata_dir ../langdata \
 --tessdata_dir ../tessdata  \
 --output_dir ~/tesstutorial/arm  \
 --fontlist   "Arial" \
  "Consolas" \
  "Courier New" \
  "DejaVu Sans" \
  "DejaVu Sans Mono" \
  "DejaVu Serif" \
  "FreeMono" \
  "FreeSans" \
  "FreeSerif" \
  "Microsoft Sans Serif" \
  "Segoe UI" \
  "Sylfaen" \
  "Tahoma" \
  "Times New Roman," \
  "Trebuchet MS" \
  "Verdana" \
  "Verdana Bold" \
  "Verdana Bold Italic" \
  "Verdana Italic" 

arm.zip

@Shreeshrii

This comment has been minimized.

Copy link
Contributor Author

commented Apr 14, 2017

Attached is an eval report using one of the training text images - arm.Sylfaen.exp0.txt

CER 2.91
WER 5.02
WER (order independent) 4.63

arm_report.zip

@vahenr

This comment has been minimized.

Copy link

commented Apr 14, 2017

Thanks a lot for the files, could you please tell me what to do exactly for the next step, and what we are missing ?
Thank you very much once again.

@vahenr

This comment has been minimized.

Copy link

commented Apr 14, 2017

I did some tests, for the fist one I got:
Error in pixGenHalftoneMask: pix too small: w = 270, h = 97
But the output in overall is not bad (attaching the original and the output) there some characters wrong.
armeniantext
armeniantext.txt

@vahenr

This comment has been minimized.

Copy link

commented Apr 14, 2017

The next test was better, no errors.
fedrasansarmenian
second.txt

@vahenr

This comment has been minimized.

Copy link

commented Apr 14, 2017

Waiting for your suggestions.

@Shreeshrii

This comment has been minimized.

Copy link
Contributor Author

commented Apr 15, 2017

@Shreeshrii

This comment has been minimized.

Copy link
Contributor Author

commented Apr 15, 2017

Please see attached zip file.

arm-2.zip

It has a newer arm.traineddata as well as the training_text, fonts list etc that I used. You can test so see if this is better than the earlier version - use --oem 0 since it does not have lstm traineddata.

You can do training by modifying training text etc.
You will need to add arm as a valid language code in

https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh#L21

and also add a line similar to https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh#L921 for arm.

@vahenr

This comment has been minimized.

Copy link

commented Apr 16, 2017

Thank you very much once again.
I will try to do the test on Monday and post the result, I tested this new one arm-2.zip got the same output no big difference.

@Shreeshrii

This comment has been minimized.

Copy link
Contributor Author

commented Apr 16, 2017

@vahenr

This comment has been minimized.

Copy link

commented Apr 18, 2017

Could you please help me with this issue:
training/./tesstrain.sh --fonts_dir /root/ocr/training/Fonts --lang arm --exposures "0" --langdata_dir ../langdata --tessdata_dir ../tessdata --output_dir /root/ocr/training_output --fontlist "Aramian Normal" "Arial AM"

=== Starting training for language 'arm'
ERROR: Error: arm is not a valid language code

Thank you once again.

@theraysmith

This comment has been minimized.

Copy link
Contributor

commented Apr 18, 2017

@Shreeshrii

This comment has been minimized.

Copy link
Contributor Author

commented Apr 19, 2017

Thanks, Ray.

However, hye is marked as unusable language code. Also there is no folder for hye in langdata.

https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh#L36

@Shreeshrii

This comment has been minimized.

Copy link
Contributor Author

commented Apr 19, 2017

@vahenr Please see earlier comment at #67 (comment)

You will need to add arm as a valid language code in

https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh#L21

and also add a line similar to https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh#L921 for arm.

Or as suggested by Ray, use hye as the language code.

@vahenr

This comment has been minimized.

Copy link

commented Apr 20, 2017

What do I need to put in this file: arm.training_text ? This is for the option: --langdata_dir ../langdata

@Shreeshrii

This comment has been minimized.

Copy link
Contributor Author

commented Apr 20, 2017

https://github.com/tesseract-ocr/langdata/files/923560/arm-2.zip

The above zip file has the files that I used. Put them in a folder named arm under langdata. The training text I used has the text from the doc file you sent, Unicode text for udhr and some text copied from Wikipedia.

The wordlist is taken from crubdan site, link is given in some earlier comment in this thread.

These will be sufficient for legacy training. My trial for LSTM training were not successful. Hopefully Ray will provide new traineddata for Armenian soon.

@Shreeshrii

This comment has been minimized.

Copy link
Contributor Author

commented Apr 20, 2017

Also download other required files from langdata repo. Read the readme file for requirements or just clone the whole repo.

@Shreeshrii

This comment has been minimized.

Copy link
Contributor Author

commented Apr 20, 2017

@gelinger777

This comment has been minimized.

Copy link

commented Apr 27, 2017

Thank you @Shreeshrii for your help in adding armenian to tessa !!!

@amitdo

This comment has been minimized.

Copy link

commented Aug 1, 2017

https://github.com/tesseract-ocr/tessdata/tree/master/best

Armenian.traineddata
hye.traineddata

@Shreeshrii

This comment has been minimized.

Copy link
Contributor Author

commented Aug 4, 2017

@vahenr @gelinger777 Please test Armenian support with the newly posted best traineddata for use with the LSTM engine

@arm2arm

This comment has been minimized.

Copy link

commented Sep 15, 2019

are there any progress on this ticket?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.