Fix some wrong German words (confusion B / ß) #54

stweil · 2017-02-18T22:59:24Z

No description provided.

Replace wrong "EinfluB" and other variants by "Einfluß" and remove duplicate entries. Remove it also from the Latin wordlist. Signed-off-by: Stefan Weil <sw@weilnetz.de>

Remove wrong entries with "daB". All exists in the correct form "daß". Signed-off-by: Stefan Weil <sw@weilnetz.de>

Remove wrong entries with "muB" (or its variants). All exists in the correct form "muß". Signed-off-by: Stefan Weil <sw@weilnetz.de>

Wikinaut · 2017-03-14T09:48:10Z

@stweil Stefan, when the tesseract-ocr/langdata is updated, this does not mean that the /tessdata is updated, as far as I know.

As your recent patches are very important for DE language (and would save me from publishing my post-filter sed script which is still needed to repair German-ocr-ed texts), please can you take care that the deu*.traineddata files are updated (retrained) as well ?

stweil · 2017-03-14T10:10:46Z

Yes, fixes in langdata need additional work to result in updates of tessdata.

I'm afraid that I cannot do much there. See another comment which I have just written.

To summarize, tessdata updates for Tesseract 3.x are manageable for contributors like me, but updates for Tesseract 4.x with LSTM are much more difficult. @theraysmith recently announced that he is going to start the training for an update of tessdata with an improved training process, and I hope that this is sufficient to fix some of the problems which we see for German texts (although I am not sure about that, see also the tesseract-dev forum).

Shreeshrii · 2018-05-24T11:40:04Z

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/wQ8AUCkvDbo/QQJ91biKAAAJ

Very often tesseract detects "StraBe" instead of "Straße".

Yes, I use -l=deu

stweil · 2018-05-25T07:26:40Z

Ray did not use the fixed dictionary when he created the LSTM traineddata files, but used new dictionaries with the old (and additional new) errors.

amitdo · 2018-05-25T08:12:48Z

tesseract-ocr/tessdata#62 (comment)

guettli · 2018-05-25T09:56:30Z

@stweil you said that improving the ocr result is harder with 4.0. What does "harder" mean?

I can provide a lot of german scan input which has broken results.

If training material is needed, just tell me.

I could provide parts of scans where a single german word is visible and the corresponding word in utf8.

Wikinaut · 2018-05-25T10:08:04Z

I am sorry to drop in here, but please have a look to the proposal (not a solution, yet) tesseract-ocr/tesseract#1442 [Suggestion] "Training light" - Learning by doing - re-feeding a corrected text file to retrain tesseract.

stweil added 3 commits February 18, 2017 23:57

Fix German word "Einfluß"

8d620a7

Replace wrong "EinfluB" and other variants by "Einfluß" and remove duplicate entries. Remove it also from the Latin wordlist. Signed-off-by: Stefan Weil <sw@weilnetz.de>

Fix German word "daß"

31bb55d

Remove wrong entries with "daB". All exists in the correct form "daß". Signed-off-by: Stefan Weil <sw@weilnetz.de>

Fix German word "muß"

e5fa99e

Remove wrong entries with "muB" (or its variants). All exists in the correct form "muß". Signed-off-by: Stefan Weil <sw@weilnetz.de>

zdenop merged commit dba5f9f into tesseract-ocr:master Feb 19, 2017

stweil mentioned this pull request Mar 9, 2017

Q&A: Training Wiki Updates and Request for Info tesseract-ocr/tesseract#659

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix some wrong German words (confusion B / ß) #54

Fix some wrong German words (confusion B / ß) #54

stweil commented Feb 18, 2017

Wikinaut commented Mar 14, 2017

stweil commented Mar 14, 2017

Shreeshrii commented May 24, 2018

stweil commented May 25, 2018

amitdo commented May 25, 2018

guettli commented May 25, 2018

Wikinaut commented May 25, 2018

Fix some wrong German words (confusion B / ß) #54

Fix some wrong German words (confusion B / ß) #54

Conversation

stweil commented Feb 18, 2017

Wikinaut commented Mar 14, 2017

stweil commented Mar 14, 2017

Shreeshrii commented May 24, 2018

stweil commented May 25, 2018

amitdo commented May 25, 2018

guettli commented May 25, 2018

Wikinaut commented May 25, 2018