Page segmentation deletes some parts of the text, how to avoid? #259

franvillamil · 2017-11-14T13:43:55Z

I am using ocropy to extract data from old documents that list electoral results. These are big pages arranged in up to 4 colums, from which I have taken screenshots of each column. See a sample of the raw image (I know the quality is very bad, but is basically running OCR on this or copying it entirely by hand): https://user-images.githubusercontent.com/3774527/32782400-9111bab6-c948-11e7-9ea6-6266cc828627.png

To avoid problems, I'm trying to make ocropy read the text as a one-column text (see issue #240), after deleting every and so far it's more or less going well. In some cases, however, ocropy is deleting some parts (mainly numbers) when it does the segmentation. See below two screenshorts of the original binary file and the segmentation output (the .nrm.png files):

Missing some part of '207', original binary:

Image after segmentation:

The '3' is completely removed, original binary:

After segmentation:

In some cases (e.g. when there are a few 1s, see below), it seems ocropy thinks these are black lines delimiting columns and tries to ignore them. But I don't want to do this, as I'm removing every black line that could be mistaken in Gimp.

Several 1s together might appear like a black line?:

Solution?

Does anyone know if there is any piece of code I can modify to avoid this? I've been looking into the gpageseg code but haven't found anything.

Your Environment

Python version: 2.7.14
Git revision of ocropy: Downloaded & installed today (Nov 2017)
Operating System and version: OS X 10.12

The text was updated successfully, but these errors were encountered:

lehzwo · 2017-11-16T21:03:30Z

Hey @franvillamil,

could you please provide a whole example page? It is hard to reproduce the error with an image of a single column.

urhub · 2017-11-17T00:05:31Z

Several 1s together might appear like a black line?
It probably could. I know that it needs "black line separators" of height 20*scale or 20*xheight to recognize it as delimiting a column, please see issue #250 .

zuphilip · 2017-12-29T12:16:24Z

During page segmentation there is also a step which deletes small component remove_noise and another one which deletes vertical lines remove_hlines. Moreover, it is possible that some characters interfer with the column segmentation itself.

However, as already @lehzwo mention, I also cannot reproduce your issues with the images you provide and I don't know whether you use any special parameter during the call.

@franvillamil if you provide more information such that we can reproduce the issue, then we can look here again, otherwise I suggest to close this issue.

ChillarAnand · 2017-12-30T15:59:14Z

I also faced this issue. Here are some sample images.

Commands used to process the image

pyflash.utils - INFO - python2 /home/chillaranand/projects/ocr/ocropy/ocropus-nlbin /home/chillaranand/projects/ocr/data/vishadam-021.png -o output -n 
pyflash.utils - INFO - python2 /home/chillaranand/projects/ocr/ocropy/ocropus-gpageseg output/????.bin.png -n 
pyflash.utils - INFO - python2 /home/chillaranand/projects/ocr/ocropy/ocropus-rpred -Q 4 -m /home/chillaranand/projects/ocr/ocropy/models/te.pyrnn.gz output/????.bin.png -n

zuphilip · 2017-12-30T17:44:35Z

@ChillarAnand Okay, it looks that for your examples the parts below the baseline (descenders?) are larger than expected. The computed lines look then like this:

and lines from the descender part are then neglected. Try to adjust the vscale/scale manually, e.g.

./ocropus-gpageseg temp/chillar0001.bin.png --debug -n --vscale 1.5

which should work well, except for the page number on the top left (but I think this is a known issue).

ChillarAnand · 2018-01-21T13:17:51Z

Thank you @zuphilip. With --vscale it is segmenting correctly. How did you generate the image above?

zuphilip · 2018-01-21T14:54:01Z

How did you generate the image above?

The --debug option produces such pictures in the folder where you are calling the script.

Shreeshrii · 2020-02-13T07:25:20Z

I tried the above image with different --vscale values. With --vscale 2.0 it gets all the text but gets 2 lines per image (ie. gets 13 images instead of 26). Without it, it gets only 2 lines in the whole page.

 ocropus-gpageseg 'book/????.bin.png' --debug -n --vscale 2.0  --maxcolseps 0 --maxseps 0
INFO:
INFO:  ########## /usr/local/bin/ocropus-gpageseg book/????.bin.png --debug -n
INFO:
INFO:  book/0001.bin.png
INFO:  scale 57.236352
INFO:  computing segmentation
INFO:  computing column separators
INFO:  considering at most 0 whitespace column separators
INFO:  debug _1thresh.png
INFO:  debug _2grad.png
INFO:  debug _3seps.png
INFO:  debug _4seps.png
INFO:  debug _colwsseps.png
INFO:  computing lines
INFO:  debug _cleaned.png
INFO:  debug _lineseeds.png
INFO:  debug _seeds.png
INFO:  propagating labels
INFO:  spreading labels
INFO:  number of lines 14
INFO:  finding reading order
INFO:  writing lines
INFO:      12  book/0001.bin.png 57.2 13

ocropus-gpageseg 'book/????.bin.png'
INFO:
INFO:  ########## /usr/local/bin/ocropus-gpageseg book/????.bin.png
INFO:
INFO:  book/0001.bin.png
INFO:  scale 57.236352
INFO:  computing segmentation
INFO:  computing column separators
INFO:  considering at most 3 whitespace column separators
INFO:  computing lines
INFO:  propagating labels
INFO:  spreading labels
INFO:  number of lines 54
INFO:  finding reading order
INFO:  writing lines
INFO:       1  book/0001.bin.png 57.2 2

_lineseeds.png seems to be identifying all the lines.

zuphilip added the ❔ question label Dec 29, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Page segmentation deletes some parts of the text, how to avoid? #259

Page segmentation deletes some parts of the text, how to avoid? #259

franvillamil commented Nov 14, 2017 •

edited

lehzwo commented Nov 16, 2017

urhub commented Nov 17, 2017 •

edited

zuphilip commented Dec 29, 2017

ChillarAnand commented Dec 30, 2017

zuphilip commented Dec 30, 2017

ChillarAnand commented Jan 21, 2018

zuphilip commented Jan 21, 2018

Shreeshrii commented Feb 13, 2020

Page segmentation deletes some parts of the text, how to avoid? #259

Page segmentation deletes some parts of the text, how to avoid? #259

Comments

franvillamil commented Nov 14, 2017 • edited

Solution?

Your Environment

lehzwo commented Nov 16, 2017

urhub commented Nov 17, 2017 • edited

zuphilip commented Dec 29, 2017

ChillarAnand commented Dec 30, 2017

zuphilip commented Dec 30, 2017

ChillarAnand commented Jan 21, 2018

zuphilip commented Jan 21, 2018

Shreeshrii commented Feb 13, 2020

franvillamil commented Nov 14, 2017 •

edited

urhub commented Nov 17, 2017 •

edited