[Suggestion] Script for installing only selected languages from github tessdata_fast #1440

Wikinaut · 2018-03-29T08:53:54Z

(this issue is probably something for #1423 )

When running tesseract from the sources it often appears to be overkill (in terms of download bandwidth, time, disk space) to install all languages from github tessdata_fast. I propose a small helper script, or a configuration file, so that only certain user-configurable languages are checked out from github.

@stweil @Shreeshrii What do you think?

Shreeshrii · 2018-03-29T08:58:34Z

Agree. #1423 (comment)

Wikinaut · 2018-03-29T09:05:28Z

@Shreeshrii and please, I propose to have an explanation if one can delete the subdirectory /script below /tessdata.

I have no idea (even after having read the wiki pages for this) for what these scripts are good and which files under tessdata can be deleted (for example, if one has only deu-fra-eng installed, what can be deleted from a fresh and full github checkout?).

Shreeshrii · 2018-03-29T09:18:55Z

If you are only using deu-fra-eng, you maybe interested in trying out script/Latin which has these three plus other languages written in Latin alphabet (not language).

For few files, you can try wget of the raw traineddata file rather than cloning the whole repo.

@stweil may offer a more elegant solution :-)

amitdo · 2018-03-29T09:52:12Z

It would be useful to have a bash utility that have these options:

'--list': List all available langs 'eng (English)'
'--get ': download langs using wget/curl
'--install': move the downloaded langs to the tessdata dir.

stweil · 2018-03-29T12:23:03Z

Well, I could image many more variants.

Tesseract could download missing traineddata automatically and store it in a local cache.
If we had a comfortable way for users to get traineddata, it would no longer be necessary to store traineddata files on GitHub. We could store the individual components instead. That would make updates and smaller fixes much easier. Best and fast could be handled in a single repository. ...

stweil · 2018-03-29T12:31:16Z

@Wikinaut, script contains traineddata files which were trained for some script, not for some language. Example: While deu.traineddata supports the German language, script/Latin.traineddata supports "all" languages using Latin script which includes German, but also other western European languages. This has the advantage that Latin script not only includes German umlauts, but also other accented characters, so even a German text with Café could be recognized better. So far the theory. Recently colleagues told me that they got much better results with deu than with script/Latin.

Wikinaut · 2018-03-29T13:53:37Z

@stweil Problem is, that this is "biased" developer knowledge. As a sometimes-users, I would like to have a short page where the "best use praxis" is explained.

My seriously meant questions as follow-up to your answer above:

How exactly can I select the script/Latin?
Is this a "language", or would a better term be "meta-language", or "language cluster"?
Would it improve detection quality to select script/Latin and (for example) deu+eng+fra if I need these three languages in a three-language document?

stweil · 2018-03-29T14:05:39Z

How exactly can I select the script/Latin?

You can select it by running tesseract -l script/Latin ....

Is this a "language", or would a better term be "meta-language", or "language cluster"?

Which language is ABC? It's neither a language, nor a meta-language or a language cluster. It's a script, Latin script in my example (German: Schrift). See the Wikipedia article for the full description.

Would it improve detection quality to select script/Latin and (for example) deu+eng+fra if I need these three languages in a three-language document?

You have to try it for your class of documents. You can combine script/Latin+deu+eng+fra, but even the order is important and changes the results.

Wikinaut · 2018-03-29T14:12:36Z

@stweil Did not know this before (that combination is possible)

You have to try it for your class of documents. You can combine script/Latin+deu+eng+fra, but even the order is important and changes the results.

"script" should be renamed to "font" or "fonts".

Shreeshrii · 2018-03-29T14:21:07Z

You can combine script/Latin+deu+eng+fra, but even the order is important and changes the results.

Alternately you can try

deu+eng+fra

vs

script/Latin

stweil · 2018-03-29T14:24:09Z

"script" should be renamed to "font" or "fonts".

No, font is not the same as script. Arial, Times New Roman or Helvetica are different fonts, but all of them are typically used with Latin script.

Wikinaut · 2018-03-29T14:24:29Z

Okay. I am closing this now. Thanks.

Wikinaut closed this as completed Mar 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Suggestion] Script for installing only selected languages from github tessdata_fast #1440

[Suggestion] Script for installing only selected languages from github tessdata_fast #1440

Wikinaut commented Mar 29, 2018 •

edited

Loading

Shreeshrii commented Mar 29, 2018

Wikinaut commented Mar 29, 2018

Shreeshrii commented Mar 29, 2018

amitdo commented Mar 29, 2018

stweil commented Mar 29, 2018 •

edited

Loading

stweil commented Mar 29, 2018

Wikinaut commented Mar 29, 2018 •

edited

Loading

stweil commented Mar 29, 2018 •

edited

Loading

Wikinaut commented Mar 29, 2018

Shreeshrii commented Mar 29, 2018

stweil commented Mar 29, 2018

Wikinaut commented Mar 29, 2018

[Suggestion] Script for installing only selected languages from github tessdata_fast #1440

[Suggestion] Script for installing only selected languages from github tessdata_fast #1440

Comments

Wikinaut commented Mar 29, 2018 • edited Loading

Shreeshrii commented Mar 29, 2018

Wikinaut commented Mar 29, 2018

Shreeshrii commented Mar 29, 2018

amitdo commented Mar 29, 2018

stweil commented Mar 29, 2018 • edited Loading

stweil commented Mar 29, 2018

Wikinaut commented Mar 29, 2018 • edited Loading

stweil commented Mar 29, 2018 • edited Loading

Wikinaut commented Mar 29, 2018

Shreeshrii commented Mar 29, 2018

stweil commented Mar 29, 2018

Wikinaut commented Mar 29, 2018

Wikinaut commented Mar 29, 2018 •

edited

Loading

stweil commented Mar 29, 2018 •

edited

Loading

Wikinaut commented Mar 29, 2018 •

edited

Loading

stweil commented Mar 29, 2018 •

edited

Loading