Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Suggestion] "Training light" - Learning by doing - re-feeding a corrected text file to retrain tesseract #1442

Closed
Wikinaut opened this issue Mar 29, 2018 · 6 comments

Comments

@Wikinaut
Copy link
Contributor

(probably something for #1423)

When I have a manually corrected text version (txt output file), my deep wish is to have an easy way to retrain tesseract, based upon the 1. first tesseract txt output and the related 2. manually corrected txt file.

I remember that I asked this question already some years ago. But now situation has changed (more and other developers, new algorithms).

Do you like the idea? Is it possible?

@stweil
Copy link
Contributor

stweil commented Mar 29, 2018

The idea is good, it is possible, but today the way is not easy. So there remains work to be done to make it easy.

@Wikinaut
Copy link
Contributor Author

@stweil Can I help with a donation?

@stweil
Copy link
Contributor

stweil commented Mar 29, 2018

See http://gepris.dfg.de/gepris/projekt/394264782 (German).

@Wikinaut
Copy link
Contributor Author

Wikinaut commented Mar 29, 2018

Ich verstehe. Developed a server some years ago (open-source replacement for Finereader appliance) with a scheduler and so on. Users could upload multi-page pdf documents with a web formular upload (no limit, we could ocr >1.000 page-pdfs), and users received an e-mail and a link when their "ocr jobs" were ready.

@Wikinaut
Copy link
Contributor Author

Wikinaut commented Mar 29, 2018

This was when I discovered that tesseract (then) internally re-coded documents with a lossy compression, which is a "NOGO" for CG and something with sharp edges: like fonts, the thing we want to ocr....

@amitdo
Copy link
Collaborator

amitdo commented Sep 10, 2021

As @stweil already said, this feature was implemented years ago (fine tuning a lstm model).

We should not keep this 'issue' open for the 'make it easy' part. PRs to this repo or to the tesstrain repo are welcomed.

@amitdo amitdo closed this as completed Sep 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants