Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated langdata #83

Open
ahmed-alaa opened this issue Aug 4, 2017 · 23 comments
Open

Updated langdata #83

ahmed-alaa opened this issue Aug 4, 2017 · 23 comments

Comments

@ahmed-alaa
Copy link

We need the updated langdata with the update unicharset specially for arabic language to be able to maintain the same accuracy in the new .traineddata

Thanks

@amitdo
Copy link

amitdo commented Aug 4, 2017

I don't know if Ray plans to update these files.

Anyway, it seems that you can now extract the unicharset and the dawg files used by the new lstm engine from the traineddata.

@stweil
Copy link
Contributor

stweil commented Aug 4, 2017

Yes, using combine_tessdata is an easy way to get the unicharset and the dawg files from a traineddata file. In a 2nd step the word list can be made from those files using dawg2wordlist.

Then you can remove unwanted components, fix word lists and reverse the whole process to create your own new traineddata file.

@ahmed-alaa
Copy link
Author

Actually, I'm trying to fine tune and continue from the new Arabic.traineddata but the newly generated traineddata file can't keep the same accuracy as Arabic.traineddata.

Also, the current unicharset of landdata repo has around 2048 data but the generated from the new traineddata around 300 only is that logic?

@theraysmith
Copy link
Contributor

Point taken. It needs updating. I was going to push until I discovered a bug with the RTL word lists.
Then I also need to integrate this issues list, that I haven't looked at in a while, and rerun training.

@Shreeshrii
Copy link
Contributor

@theraysmith Is it ready for update now?

@Shreeshrii
Copy link
Contributor

@jbreiden Do you have the files to update this repo for 4.0.0?

Alternately, should we try to reverse engineer files from tessdata_fast, they will not be complete - (config, wordlist, numbers, punc, unicharset).

@jbreiden
Copy link

Do you have the files to update this repo for 4.0.0?

No, I don't. But I have been (and continue to) look into this.

@theraysmith
Copy link
Contributor

Hmm. Sorry. I thought I had done this in September.
The Google repo is up-to-date apart from the redundant files that need to be deleted.
I'll work with Jeff to get this done.

@Shreeshrii
Copy link
Contributor

Thanks!

Will the training process, tesstrain.sh and related scripts also need changes?

@Shreeshrii
Copy link
Contributor

Also, what about the possibility of training from scanned images?

@stweil
Copy link
Contributor

stweil commented Mar 20, 2018

Also, what about the possibility of training from scanned images?

It is possible and seems to work pretty good, as I heard from @wrznr.

@Shreeshrii
Copy link
Contributor

@stweil Do you know how the box files for the scanned images were created?

ASAIK tesseract makebox generated box files do not match the format of files from text2image.

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Mar 21, 2018

@theraysmith

  1. Since training depends on the fonts used, I suggest also uploading a file with fontlist used for training in every language and script in their subdirectories in langdata. This file can then be referred to by tesstrain.sh/language_specific.sh.

  2. What is the recommended method for combining languages to create a script traineddata?

  3. Is it possible to use multiple languages/scripts to continue from for creating a 'script' type of traineddata by finetuning? If so, how?

@theraysmith
Copy link
Contributor

theraysmith commented Mar 21, 2018 via email

@wrznr
Copy link

wrznr commented Mar 22, 2018

@Shreeshrii Your right, creating box files has been done by using an extra script. It is rather straightforward:

  • split your GT line into characters
  • print them ‘one-char-per-line’ and add the coordinates of the whole line to each character
  • add a tab stop (as an EOL indicator) to the end of the line sequence with coordinates +1

@amitdo
Copy link

amitdo commented Mar 22, 2018

@amitdo
Copy link

amitdo commented Mar 22, 2018

@wrznr's method is similar but easier than my proposal.

@Shreeshrii
Copy link
Contributor

@wrznr

Please share the script, if possible. I would like to test it for Indic/complex scripts. It will also be useful to many others who have been asking for this feature.

You could create a PR to put it in https://github.com/tesseract-ocr/tesseract/tree/master/contrib

Thanks!

@amitdo
Copy link

amitdo commented Mar 22, 2018

It won't work well for complex scripts like the Indic scripts.

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Apr 3, 2018 via email

@wrznr
Copy link

wrznr commented May 3, 2018

@Shreeshrii FYI: https://github.com/OCR-D/ocrd-train

@Shreeshrii
Copy link
Contributor

@wrznr Thank you for the makefile for doing LSTM training from scratch. I will give it a try.

Do you also have a variant for doing fine tuning or adding a layer?

@Shreeshrii
Copy link
Contributor

https://github.com/tesseract-ocr/langdata_lstm
has the updated langdata files fo LSTM training.

This issue can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants