-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix some wrong German words (confusion B / ß) #54
Conversation
Replace wrong "EinfluB" and other variants by "Einfluß" and remove duplicate entries. Remove it also from the Latin wordlist. Signed-off-by: Stefan Weil <sw@weilnetz.de>
Remove wrong entries with "daB". All exists in the correct form "daß". Signed-off-by: Stefan Weil <sw@weilnetz.de>
Remove wrong entries with "muB" (or its variants). All exists in the correct form "muß". Signed-off-by: Stefan Weil <sw@weilnetz.de>
@stweil Stefan, when the tesseract-ocr/langdata is updated, this does not mean that the /tessdata is updated, as far as I know. As your recent patches are very important for DE language (and would save me from publishing my post-filter |
Yes, fixes in langdata need additional work to result in updates of tessdata. I'm afraid that I cannot do much there. See another comment which I have just written. To summarize, tessdata updates for Tesseract 3.x are manageable for contributors like me, but updates for Tesseract 4.x with LSTM are much more difficult. @theraysmith recently announced that he is going to start the training for an update of tessdata with an improved training process, and I hope that this is sufficient to fix some of the problems which we see for German texts (although I am not sure about that, see also the tesseract-dev forum). |
|
Ray did not use the fixed dictionary when he created the LSTM traineddata files, but used new dictionaries with the old (and additional new) errors. |
@stweil you said that improving the ocr result is harder with 4.0. What does "harder" mean? I can provide a lot of german scan input which has broken results. If training material is needed, just tell me. I could provide parts of scans where a single german word is visible and the corresponding word in utf8. |
I am sorry to drop in here, but please have a look to the proposal (not a solution, yet) tesseract-ocr/tesseract#1442 [Suggestion] "Training light" - Learning by doing - re-feeding a corrected text file to retrain tesseract. |
No description provided.