Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate and incomplete data for German fraktur #49

Open
stweil opened this issue Mar 13, 2017 · 8 comments
Open

Duplicate and incomplete data for German fraktur #49

stweil opened this issue Mar 13, 2017 · 8 comments

Comments

@stweil
Copy link
Contributor

stweil commented Mar 13, 2017

Both deu_frak.traineddata and frk.traineddata try to support German fraktur.

deu_frak is not part of the official tesseract-ocr/langdata, but comes from paalberti/tesseract-dan-fraktur. It does not support the new LSTM recognizer introduced by Tesseract 4, but currently gives better results for fraktur texts than frk (which supports LSTM).

frk can be improved a lot by adding missing characters (primarily the long s, but also paragraph and dollar sign and maybe more) and based on latest corrections for langdata. With an improved frk, deu_frak would no longer be needed.

It is unclear who invented the name frk for Frankish. Maybe it should be renamed.

@amitdo
Copy link

amitdo commented Mar 13, 2017

It is unclear who invented the name frk for Frankish. Maybe it should be renamed.

frk is the ISO 639-3 code for Frankish.

@amitdo
Copy link

amitdo commented Mar 13, 2017

https://github.com/tesseract-ocr/tesseract/wiki/Data-Files

FYI, the source of the 'Language' column in the tables is the old google code download page. Ray uploaded the official traineddata files to that old page, Zdenko added a few 3rd party files.

@stweil
Copy link
Contributor Author

stweil commented Mar 13, 2017

I should have explained my question better. Why is German fraktur called Frankish? Neither the characters nor the words and also not the fonts used are Frankish language. And without hints from others I'd never have thought of using frk for German fraktur.

@amitdo
Copy link

amitdo commented Mar 13, 2017

It seems frk is trained using modern German corpus and a small number of fonts.

@amitdo
Copy link

amitdo commented Aug 17, 2017

@stweil, maybe you want to close this issue?

@stweil
Copy link
Contributor Author

stweil commented Aug 17, 2017

Do you think that frk is the right name? Or should it be renamed, maybe deu_old or deu_frak (as people are used to that name)? "Frankish" is definitely the wrong description for the current frk.

@amitdo
Copy link

amitdo commented Aug 18, 2017

Is 'frk' only for German Fraktur?

@stweil
Copy link
Contributor Author

stweil commented Aug 18, 2017

I expect that the frk LSTM model will work quite good with Fraktur text in other languages, too. But the word list of frk is mainly based on German words (I estimate more than 95 % of the 473228 words are German). The list also includes few words from English, Spanish, French, Latin, Russian and other languages. Many of them would not be expected in Fraktur text (jQuery, motherboard, ...). The German words contain lots of the known problems like ß/B, ii/ü and other confusions, lower case substantives (should always be upper case for German), upper case adjectives (should normally be lower case), random words in all upper case, lots of web sites (also not typical for Fraktur) and so on.

@theraysmith, it would be really interesting to know more details of the process which leads to that and also the other word lists. They look like extracts from random web sites. I don't think that good word lists for Fraktur can be produced like that.

danpla added a commit to danpla/dpscreenocr that referenced this issue May 15, 2022
Although frk is Frankish in	ISO 693-3, the data is actually for
German Fractur. See:

tesseract-ocr/tessdata_best#68
tesseract-ocr/tessdata#49
tesseract-ocr/langdata#61
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants