Avoid text recognition in image areas of the page #38

zuphilip · 2015-05-16T09:43:12Z

In several tests ocropus/ocropy tries to recognize some text within an image area in my pages. How can this be avoided? Here is an example of such a page: normal-beispiel-mit-bild

I run the same sequence of commands as in run-test and ocropus/ocropy recognizes two columns one with the actual text and the other with some nonsensical symbols from the picture.

The text was updated successfully, but these errors were encountered:

tmbdev · 2015-06-05T09:23:27Z

Generally, that requires good text/image segmentation, which isn't part of ocropy yet.

The plan is to provide a 2D LSTM implementation to learn text/image segmentation. For that that, I've been working on the CLSTM implementation.

zuphilip · 2015-06-05T17:29:45Z

Okay, I see. Can you say something about an ETA for such a feature? Days, weeks, months?

tmbdev · 2015-06-05T19:44:41Z

It's probably going to be months; it's a pretty hard problem to solve in general, and there is only limited training data.

Here is a paper describing an older approach: http://www.csse.uwa.edu.au/~shafait/papers/Bukhari-Text-Image-Segmentation-DAS10.pdf

JamesOwers · 2015-06-16T12:31:57Z

I'm doing my masters thesis on this topic at the moment and would like to get involved in implementing segmentation useful for ocropy. @tmbdev, are you looking for people to help?

ankitag9 · 2015-09-06T12:11:07Z

@tmbdev @danvk I have been using ocropy for some time now and have made a few changes as well.. I am planning to work on ML based image segmentation next.. do you guys have some alpha version code of the same on which i can build?

ChillarAnand · 2017-07-15T17:03:26Z

Any updates on this issue?

zuphilip · 2017-12-27T13:34:34Z

Old C code is here https://github.com/michaelyin/ocropus-git/blob/ba6930627f3f81392a05a02bfcf7dac2595c35cf/ocr-layout/ocr-text-image-seg.cc and some Python code (~2012) https://github.com/zuphilip/ocropus-from-searchcode/blob/master/ocropy/ocropus-tiseg resp. https://github.com/zuphilip/ocropus-from-searchcode/blob/master/ocropy/OLD/ocropy/oldsimpleti.py

amitdo · 2017-12-27T16:08:29Z

Both Tesseract and ocropus 0.4 use the Leptonica library to do text/image segmentation.

zuphilip added the ✨ enhancement label Oct 31, 2016

zuphilip added the 🙇‍♂️ help wanted label Apr 25, 2017

zuphilip mentioned this issue Jul 5, 2017

Support for Japanese Language #229

Open

lehzwo mentioned this issue Jul 21, 2017

ocropus-gpageseg: Enable usage of masks to specify column separators/ ignore areas of scan #236

Merged

zuphilip mentioned this issue Jan 13, 2019

can ocropy give classification of different type of text line? #317

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid text recognition in image areas of the page #38

Avoid text recognition in image areas of the page #38

zuphilip commented May 16, 2015

tmbdev commented Jun 5, 2015

zuphilip commented Jun 5, 2015

tmbdev commented Jun 5, 2015

JamesOwers commented Jun 16, 2015

ankitag9 commented Sep 6, 2015

ChillarAnand commented Jul 15, 2017

zuphilip commented Dec 27, 2017

amitdo commented Dec 27, 2017

Avoid text recognition in image areas of the page #38

Avoid text recognition in image areas of the page #38

Comments

zuphilip commented May 16, 2015

tmbdev commented Jun 5, 2015

zuphilip commented Jun 5, 2015

tmbdev commented Jun 5, 2015

JamesOwers commented Jun 16, 2015

ankitag9 commented Sep 6, 2015

ChillarAnand commented Jul 15, 2017

zuphilip commented Dec 27, 2017

amitdo commented Dec 27, 2017