Updated langdata #83

ahmed-alaa · 2017-08-04T10:56:21Z

We need the updated langdata with the update unicharset specially for arabic language to be able to maintain the same accuracy in the new .traineddata

Thanks

amitdo · 2017-08-04T11:33:39Z

I don't know if Ray plans to update these files.

Anyway, it seems that you can now extract the unicharset and the dawg files used by the new lstm engine from the traineddata.

stweil · 2017-08-04T14:34:42Z

Yes, using combine_tessdata is an easy way to get the unicharset and the dawg files from a traineddata file. In a 2nd step the word list can be made from those files using dawg2wordlist.

Then you can remove unwanted components, fix word lists and reverse the whole process to create your own new traineddata file.

ahmed-alaa · 2017-08-04T14:49:02Z

Actually, I'm trying to fine tune and continue from the new Arabic.traineddata but the newly generated traineddata file can't keep the same accuracy as Arabic.traineddata.

Also, the current unicharset of landdata repo has around 2048 data but the generated from the new traineddata around 300 only is that logic?

theraysmith · 2017-08-08T00:42:15Z

Point taken. It needs updating. I was going to push until I discovered a bug with the RTL word lists.
Then I also need to integrate this issues list, that I haven't looked at in a while, and rerun training.

Shreeshrii · 2017-10-07T10:14:01Z

@theraysmith Is it ready for update now?

Shreeshrii · 2018-03-12T10:03:47Z

@jbreiden Do you have the files to update this repo for 4.0.0?

Alternately, should we try to reverse engineer files from tessdata_fast, they will not be complete - (config, wordlist, numbers, punc, unicharset).

jbreiden · 2018-03-12T16:06:04Z

Do you have the files to update this repo for 4.0.0?

No, I don't. But I have been (and continue to) look into this.

theraysmith · 2018-03-20T03:22:02Z

Hmm. Sorry. I thought I had done this in September.
The Google repo is up-to-date apart from the redundant files that need to be deleted.
I'll work with Jeff to get this done.

Shreeshrii · 2018-03-20T04:43:30Z

Thanks!

Will the training process, tesstrain.sh and related scripts also need changes?

Shreeshrii · 2018-03-20T05:05:42Z

Also, what about the possibility of training from scanned images?

stweil · 2018-03-20T05:34:27Z

Also, what about the possibility of training from scanned images?

It is possible and seems to work pretty good, as I heard from @wrznr.

Shreeshrii · 2018-03-20T05:40:09Z

@stweil Do you know how the box files for the scanned images were created?

ASAIK tesseract makebox generated box files do not match the format of files from text2image.

Shreeshrii · 2018-03-21T08:28:10Z

@theraysmith

Since training depends on the fonts used, I suggest also uploading a file with fontlist used for training in every language and script in their subdirectories in langdata. This file can then be referred to by tesstrain.sh/language_specific.sh.
What is the recommended method for combining languages to create a script traineddata?
Is it possible to use multiple languages/scripts to continue from for creating a 'script' type of traineddata by finetuning? If so, how?

theraysmith · 2018-03-21T17:26:46Z

On Wed, Mar 21, 2018 at 1:28 AM Shreeshrii ***@***.***> wrote: @theraysmith <https://github.com/theraysmith> 1. Since training depends on the fonts used, I suggest loading a file with fontlist used for training in every language and script in their own subdirectories. This file can then be referred to by tesstrain.sh/language_specific.sh. Yes, I have a list of fonts used for each training, and can add that to

the langdata.

1. 2. Is it possible to use multiple languages to continue from for creating a 'script' type of traineddata by finetuning? Unfortunately not. I did have an idea for a better multi-language

implementation that would cleanly use models from multiple languages at once, but that depends on getting rid of the old code, and moving the multi-language functionality into the beam search. Until the old code is gone, that would be very messy.

…

1. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#83 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056bXfAPBtr62lkf6Ma2WzI5Zv7CAVks5tgg8cgaJpZM4Otkog> .

-- Ray.

wrznr · 2018-03-22T10:21:00Z

@Shreeshrii Your right, creating box files has been done by using an extra script. It is rather straightforward:

split your GT line into characters
print them ‘one-char-per-line’ and add the coordinates of the whole line to each character
add a tab stop (as an EOL indicator) to the end of the line sequence with coordinates +1

amitdo · 2018-03-22T10:27:51Z

related: tesseract-ocr/tesseract#1276 (comment)

amitdo · 2018-03-22T10:58:14Z

@wrznr's method is similar but easier than my proposal.

Shreeshrii · 2018-03-22T11:32:00Z

@wrznr

Please share the script, if possible. I would like to test it for Indic/complex scripts. It will also be useful to many others who have been asking for this feature.

You could create a PR to put it in https://github.com/tesseract-ocr/tesseract/tree/master/contrib

Thanks!

amitdo · 2018-03-22T11:48:00Z

It won't work well for complex scripts like the Indic scripts.

Shreeshrii · 2018-04-03T12:32:03Z

@theraysmith @jbreiden Any update regarding this???

…

On Tue 20 Mar, 2018, 8:52 AM theraysmith, ***@***.***> wrote: Hmm. Sorry. I thought I had done this in September. The Google repo is up-to-date apart from the redundant files that need to be deleted. I'll work with Jeff to get this done. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#83 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_ozSnNQITRG_kE4gS1yHJgQXSJyKuks5tgHXcgaJpZM4Otkog> .

wrznr · 2018-05-03T14:02:49Z

@Shreeshrii FYI: https://github.com/OCR-D/ocrd-train

Shreeshrii · 2018-05-03T14:38:45Z

@wrznr Thank you for the makefile for doing LSTM training from scratch. I will give it a try.

Do you also have a variant for doing fine tuning or adding a layer?

Shreeshrii · 2018-08-24T08:57:55Z

https://github.com/tesseract-ocr/langdata_lstm
has the updated langdata files fo LSTM training.

This issue can be closed.

Shreeshrii mentioned this issue Feb 22, 2018

Please update new 4.0 version of langdata #105

Closed

Shreeshrii mentioned this issue Mar 24, 2018

kur_ara does not have Arabic unicharset. tesseract-ocr/tessdata_fast#14

Open

Shreeshrii mentioned this issue Apr 11, 2018

RFC: Tesseract 4.0.0 – open tasks tesseract-ocr/tesseract#1423

Closed

amitdo mentioned this issue May 9, 2018

wrong coordinates in .box file with LSTM tesseract-ocr/tesseract#1276

Closed

This was referenced May 26, 2018

“low accuracy” on mixed language (fas+eng) tesseract-ocr/tesseract#1599

Closed

RFC: Remove the legacy OCR Engine tesseract-ocr/tesseract#707

Closed

amitdo mentioned this issue Oct 7, 2018

Tesseract doesn't recognize multiple languages tesseract-ocr/tesseract#1579

Closed

Shreeshrii mentioned this issue Aug 30, 2019

multilingual ocr ara+eng tesseract-ocr/tesseract#2626

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated langdata #83

Updated langdata #83

ahmed-alaa commented Aug 4, 2017

amitdo commented Aug 4, 2017 •

edited

Loading

stweil commented Aug 4, 2017

ahmed-alaa commented Aug 4, 2017

theraysmith commented Aug 8, 2017

Shreeshrii commented Oct 7, 2017

Shreeshrii commented Mar 12, 2018

jbreiden commented Mar 12, 2018

theraysmith commented Mar 20, 2018

Shreeshrii commented Mar 20, 2018

Shreeshrii commented Mar 20, 2018

stweil commented Mar 20, 2018

Shreeshrii commented Mar 20, 2018

Shreeshrii commented Mar 21, 2018 •

edited

Loading

theraysmith commented Mar 21, 2018 via email

wrznr commented Mar 22, 2018 •

edited

Loading

amitdo commented Mar 22, 2018

amitdo commented Mar 22, 2018

Shreeshrii commented Mar 22, 2018

amitdo commented Mar 22, 2018

Shreeshrii commented Apr 3, 2018 via email

wrznr commented May 3, 2018

Shreeshrii commented May 3, 2018

Shreeshrii commented Aug 24, 2018

Updated langdata #83

Updated langdata #83

Comments

ahmed-alaa commented Aug 4, 2017

amitdo commented Aug 4, 2017 • edited Loading

stweil commented Aug 4, 2017

ahmed-alaa commented Aug 4, 2017

theraysmith commented Aug 8, 2017

Shreeshrii commented Oct 7, 2017

Shreeshrii commented Mar 12, 2018

jbreiden commented Mar 12, 2018

theraysmith commented Mar 20, 2018

Shreeshrii commented Mar 20, 2018

Shreeshrii commented Mar 20, 2018

stweil commented Mar 20, 2018

Shreeshrii commented Mar 20, 2018

Shreeshrii commented Mar 21, 2018 • edited Loading

theraysmith commented Mar 21, 2018 via email

wrznr commented Mar 22, 2018 • edited Loading

amitdo commented Mar 22, 2018

amitdo commented Mar 22, 2018

Shreeshrii commented Mar 22, 2018

amitdo commented Mar 22, 2018

Shreeshrii commented Apr 3, 2018 via email

wrznr commented May 3, 2018

Shreeshrii commented May 3, 2018

Shreeshrii commented Aug 24, 2018

amitdo commented Aug 4, 2017 •

edited

Loading

Shreeshrii commented Mar 21, 2018 •

edited

Loading

wrznr commented Mar 22, 2018 •

edited

Loading