Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some wrong German words (confusion B / ß) #54

Merged
merged 3 commits into from
Feb 19, 2017

Conversation

stweil
Copy link
Contributor

@stweil stweil commented Feb 18, 2017

No description provided.

Replace wrong "EinfluB" and other variants by "Einfluß"
and remove duplicate entries.

Remove it also from the Latin wordlist.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
Remove wrong entries with "daB". All exists in the correct form "daß".

Signed-off-by: Stefan Weil <sw@weilnetz.de>
Remove wrong entries with "muB" (or its variants).
All exists in the correct form "muß".

Signed-off-by: Stefan Weil <sw@weilnetz.de>
@Wikinaut
Copy link

@stweil Stefan, when the tesseract-ocr/langdata is updated, this does not mean that the /tessdata is updated, as far as I know.

As your recent patches are very important for DE language (and would save me from publishing my post-filter sed script which is still needed to repair German-ocr-ed texts), please can you take care that the deu*.traineddata files are updated (retrained) as well ?

@stweil
Copy link
Contributor Author

stweil commented Mar 14, 2017

Yes, fixes in langdata need additional work to result in updates of tessdata.

I'm afraid that I cannot do much there. See another comment which I have just written.

To summarize, tessdata updates for Tesseract 3.x are manageable for contributors like me, but updates for Tesseract 4.x with LSTM are much more difficult. @theraysmith recently announced that he is going to start the training for an update of tessdata with an improved training process, and I hope that this is sufficient to fix some of the problems which we see for German texts (although I am not sure about that, see also the tesseract-dev forum).

@Shreeshrii
Copy link
Contributor

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/wQ8AUCkvDbo/QQJ91biKAAAJ

Very often tesseract detects "StraBe" instead of "Straße".

Yes, I use -l=deu

@stweil
Copy link
Contributor Author

stweil commented May 25, 2018

Ray did not use the fixed dictionary when he created the LSTM traineddata files, but used new dictionaries with the old (and additional new) errors.

@amitdo
Copy link

amitdo commented May 25, 2018

@guettli
Copy link

guettli commented May 25, 2018

@stweil you said that improving the ocr result is harder with 4.0. What does "harder" mean?

I can provide a lot of german scan input which has broken results.

If training material is needed, just tell me.

I could provide parts of scans where a single german word is visible and the corresponding word in utf8.

@Wikinaut
Copy link

I am sorry to drop in here, but please have a look to the proposal (not a solution, yet) tesseract-ocr/tesseract#1442 [Suggestion] "Training light" - Learning by doing - re-feeding a corrected text file to retrain tesseract.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants