Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Suggestion] Script for installing only selected languages from github tessdata_fast #1440

Closed
Wikinaut opened this issue Mar 29, 2018 · 12 comments

Comments

@Wikinaut
Copy link
Contributor

Wikinaut commented Mar 29, 2018

(this issue is probably something for #1423 )

When running tesseract from the sources it often appears to be overkill (in terms of download bandwidth, time, disk space) to install all languages from github tessdata_fast. I propose a small helper script, or a configuration file, so that only certain user-configurable languages are checked out from github.

@stweil @Shreeshrii What do you think?

@Shreeshrii
Copy link
Collaborator

Agree. #1423 (comment)

@Wikinaut
Copy link
Contributor Author

@Shreeshrii and please, I propose to have an explanation if one can delete the subdirectory /script below /tessdata.

I have no idea (even after having read the wiki pages for this) for what these scripts are good and which files under tessdata can be deleted (for example, if one has only deu-fra-eng installed, what can be deleted from a fresh and full github checkout?).

@Shreeshrii
Copy link
Collaborator

If you are only using deu-fra-eng, you maybe interested in trying out script/Latin which has these three plus other languages written in Latin alphabet (not language).

For few files, you can try wget of the raw traineddata file rather than cloning the whole repo.

@stweil may offer a more elegant solution :-)

@amitdo
Copy link
Collaborator

amitdo commented Mar 29, 2018

It would be useful to have a bash utility that have these options:

  • '--list': List all available langs 'eng (English)'
  • '--get ': download langs using wget/curl
  • '--install': move the downloaded langs to the tessdata dir.

@stweil
Copy link
Contributor

stweil commented Mar 29, 2018

Well, I could image many more variants.

  • Tesseract could download missing traineddata automatically and store it in a local cache.
  • If we had a comfortable way for users to get traineddata, it would no longer be necessary to store traineddata files on GitHub. We could store the individual components instead. That would make updates and smaller fixes much easier. Best and fast could be handled in a single repository. ...

@stweil
Copy link
Contributor

stweil commented Mar 29, 2018

@Wikinaut, script contains traineddata files which were trained for some script, not for some language. Example: While deu.traineddata supports the German language, script/Latin.traineddata supports "all" languages using Latin script which includes German, but also other western European languages. This has the advantage that Latin script not only includes German umlauts, but also other accented characters, so even a German text with Café could be recognized better. So far the theory. Recently colleagues told me that they got much better results with deu than with script/Latin.

@Wikinaut
Copy link
Contributor Author

Wikinaut commented Mar 29, 2018

@stweil Problem is, that this is "biased" developer knowledge. As a sometimes-users, I would like to have a short page where the "best use praxis" is explained.

My seriously meant questions as follow-up to your answer above:

  • How exactly can I select the script/Latin?
  • Is this a "language", or would a better term be "meta-language", or "language cluster"?
  • Would it improve detection quality to select script/Latin and (for example) deu+eng+fra if I need these three languages in a three-language document?

@stweil
Copy link
Contributor

stweil commented Mar 29, 2018

How exactly can I select the script/Latin?

You can select it by running tesseract -l script/Latin ....

Is this a "language", or would a better term be "meta-language", or "language cluster"?

Which language is ABC? It's neither a language, nor a meta-language or a language cluster. It's a script, Latin script in my example (German: Schrift). See the Wikipedia article for the full description.

Would it improve detection quality to select script/Latin and (for example) deu+eng+fra if I need these three languages in a three-language document?

You have to try it for your class of documents. You can combine script/Latin+deu+eng+fra, but even the order is important and changes the results.

@Wikinaut
Copy link
Contributor Author

@stweil Did not know this before (that combination is possible)

You have to try it for your class of documents. You can combine script/Latin+deu+eng+fra, but even the order is important and changes the results.

"script" should be renamed to "font" or "fonts".

@Shreeshrii
Copy link
Collaborator

You can combine script/Latin+deu+eng+fra, but even the order is important and changes the results.

Alternately you can try

deu+eng+fra

vs

script/Latin

@stweil
Copy link
Contributor

stweil commented Mar 29, 2018

"script" should be renamed to "font" or "fonts".

No, font is not the same as script. Arial, Times New Roman or Helvetica are different fonts, but all of them are typically used with Latin script.

@Wikinaut
Copy link
Contributor Author

Okay. I am closing this now. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants