-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid text recognition in image areas of the page #38
Comments
Generally, that requires good text/image segmentation, which isn't part of ocropy yet. The plan is to provide a 2D LSTM implementation to learn text/image segmentation. For that that, I've been working on the CLSTM implementation. |
Okay, I see. Can you say something about an ETA for such a feature? Days, weeks, months? |
It's probably going to be months; it's a pretty hard problem to solve in general, and there is only limited training data. Here is a paper describing an older approach: http://www.csse.uwa.edu.au/~shafait/papers/Bukhari-Text-Image-Segmentation-DAS10.pdf |
I'm doing my masters thesis on this topic at the moment and would like to get involved in implementing segmentation useful for ocropy. @tmbdev, are you looking for people to help? |
Any updates on this issue? |
Both Tesseract and ocropus 0.4 use the Leptonica library to do text/image segmentation. |
In several tests ocropus/ocropy tries to recognize some text within an image area in my pages. How can this be avoided? Here is an example of such a page: normal-beispiel-mit-bild
I run the same sequence of commands as in
run-test
and ocropus/ocropy recognizes two columns one with the actual text and the other with some nonsensical symbols from the picture.The text was updated successfully, but these errors were encountered: