update readme for integerized LSTM models #88

Shreeshrii · 2018-03-22T09:39:30Z

Will upload the traineddata files next.

GitHub upload failed for files > 25mb

amitdo · 2018-03-22T10:20:44Z

GitHub upload failed for files > 25mb

Which files?

Shreeshrii · 2018-03-22T11:00:25Z

I have created another PR with 2nd set of fles, please merge that too. I will list the languages that could not be uploaded after that. ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Mar 22, 2018 at 3:50 PM, Amit D. ***@***.***> wrote: GitHub upload failed for files > 25mb Which files? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#88 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o8M5TkESg8wXKD_Ncvo_rs-Fo6sAks5tg3r-gaJpZM4S2uav> .

Shreeshrii · 2018-03-22T11:04:36Z

So far, the files that did not upload are: bod chi_sim chi_tra jpn kan khm lao mya san The older version of files are still there. ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Mar 22, 2018 at 4:29 PM, ShreeDevi Kumar <shreeshrii@gmail.com> wrote:

I have created another PR with 2nd set of fles, please merge that too. I will list the languages that could not be uploaded after that. ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Mar 22, 2018 at 3:50 PM, Amit D. ***@***.***> wrote: > GitHub upload failed for files > 25mb > > Which files? > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#88 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AE2_o8M5TkESg8wXKD_Ncvo_rs-Fo6sAks5tg3r-gaJpZM4S2uav> > . >

stweil · 2018-03-22T19:10:06Z

@Shreeshrii, how did you handle deu_frak.traineddata and the other files which have no best traineddata?

Shreeshrii · 2018-03-22T23:21:04Z

@stweil I did not have a check whether file exists in best. Just the version string would have gotten updated.

Shreeshrii · 2018-03-23T05:27:28Z

@stweil

Is it just the three user contributed files , dan_frak, deu_frak and slk_frak?
Is there a way to go back to an older commit only for these files (functionality is still the same, just the version string is changed) or should I reupload an older version?

stweil · 2018-03-23T06:41:03Z

No, I don't think that this is necessary. We can keep them as they are.

amitdo · 2018-03-23T08:47:18Z

kur has no lstm. Does it have Latin or Arabic letters?
kur_ara is from best.

Shreeshrii · 2018-03-23T12:02:37Z

I looked at langdata repo just now. It has both kur and kur_ara. Looks like there was a change in langcode but the files were not moved.

I had taken the list of RTL languages from language_specific.sh, it did not have kur, but has kur_ara.

langdata/kur has training text, wordlists etc, in Arabic script. While langdata/kur_ara only has a list of desired and forbidden characters. Hence the kur_ara traineddata file in tessdata_best is not correct. Probably same will apply to tessdata_fast - I haven't checked.

I will file an issue in langdata mentioning this. Hopefully all this will be fixed when Ray/Jeff update langdata for 4.0.0.

Shreeshrii · 2018-03-23T12:21:49Z

kur has no lstm. Does it have Latin or Arabic letters?
kur_ara is from best.

@amitdo Thanks for bringing notice to this.

https://en.wikipedia.org/wiki/Kurdish_languages says

In use:

Hawar alphabet (Latin script; used mostly in Turkey and Syria)

Sorani alphabet(Perso-Arabic script; used mostly in Iraq and Iran)

Not used:

Cyrillic alphabet (former Soviet Union)

So probably both kur and kur_ara can be there with appropriate langdata.

amitdo · 2018-03-23T13:37:59Z

An older issue about kur: #45

Shreeshrii added 2 commits March 22, 2018 15:08

update readme for integerized LSTM models

1366556

Update LSTM Models to integerized tessdata_best for files < 25mb

d87b3cb

GitHub upload failed for files > 25mb

zdenop merged commit 1438f22 into tesseract-ocr:master Mar 22, 2018

This was referenced Mar 23, 2018

kur needs to be merged into kur_ara tesseract-ocr/langdata#116

Closed

kur_ara does not have Arabic unicharset. tesseract-ocr/tessdata_fast#14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update readme for integerized LSTM models #88

update readme for integerized LSTM models #88

Shreeshrii commented Mar 22, 2018

amitdo commented Mar 22, 2018

Shreeshrii commented Mar 22, 2018 via email

Shreeshrii commented Mar 22, 2018 via email

stweil commented Mar 22, 2018

Shreeshrii commented Mar 22, 2018

Shreeshrii commented Mar 23, 2018

stweil commented Mar 23, 2018 •

edited

amitdo commented Mar 23, 2018

Shreeshrii commented Mar 23, 2018

Shreeshrii commented Mar 23, 2018

amitdo commented Mar 23, 2018

update readme for integerized LSTM models #88

update readme for integerized LSTM models #88

Conversation

Shreeshrii commented Mar 22, 2018

amitdo commented Mar 22, 2018

Shreeshrii commented Mar 22, 2018 via email

Shreeshrii commented Mar 22, 2018 via email

stweil commented Mar 22, 2018

Shreeshrii commented Mar 22, 2018

Shreeshrii commented Mar 23, 2018

stweil commented Mar 23, 2018 • edited

amitdo commented Mar 23, 2018

Shreeshrii commented Mar 23, 2018

Shreeshrii commented Mar 23, 2018

amitdo commented Mar 23, 2018

stweil commented Mar 23, 2018 •

edited