Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page segmentation deletes some parts of the text, how to avoid? #259

Open
franvillamil opened this issue Nov 14, 2017 · 8 comments
Open

Comments

@franvillamil
Copy link

franvillamil commented Nov 14, 2017

I am using ocropy to extract data from old documents that list electoral results. These are big pages arranged in up to 4 colums, from which I have taken screenshots of each column. See a sample of the raw image (I know the quality is very bad, but is basically running OCR on this or copying it entirely by hand): https://user-images.githubusercontent.com/3774527/32782400-9111bab6-c948-11e7-9ea6-6266cc828627.png

To avoid problems, I'm trying to make ocropy read the text as a one-column text (see issue #240), after deleting every and so far it's more or less going well. In some cases, however, ocropy is deleting some parts (mainly numbers) when it does the segmentation. See below two screenshorts of the original binary file and the segmentation output (the .nrm.png files):

Missing some part of '207', original binary:
ocropy1a
Image after segmentation:
ocropy1a

The '3' is completely removed, original binary:
ocropy2a
After segmentation:
ocropy2b

In some cases (e.g. when there are a few 1s, see below), it seems ocropy thinks these are black lines delimiting columns and tries to ignore them. But I don't want to do this, as I'm removing every black line that could be mistaken in Gimp.

Several 1s together might appear like a black line?:
ocropy3

Solution?

Does anyone know if there is any piece of code I can modify to avoid this? I've been looking into the gpageseg code but haven't found anything.

Your Environment

  • Python version: 2.7.14
  • Git revision of ocropy: Downloaded & installed today (Nov 2017)
  • Operating System and version: OS X 10.12
@lehzwo
Copy link
Contributor

lehzwo commented Nov 16, 2017

Hey @franvillamil,

could you please provide a whole example page? It is hard to reproduce the error with an image of a single column.

@urhub
Copy link

urhub commented Nov 17, 2017

Several 1s together might appear like a black line?
It probably could. I know that it needs "black line separators" of height 20*scale or 20*xheight to recognize it as delimiting a column, please see issue #250 .

@zuphilip
Copy link
Collaborator

During page segmentation there is also a step which deletes small component remove_noise and another one which deletes vertical lines remove_hlines. Moreover, it is possible that some characters interfer with the column segmentation itself.

However, as already @lehzwo mention, I also cannot reproduce your issues with the images you provide and I don't know whether you use any special parameter during the call.

@franvillamil if you provide more information such that we can reproduce the issue, then we can look here again, otherwise I suggest to close this issue.

@ChillarAnand
Copy link
Contributor

I also faced this issue. Here are some sample images.

Commands used to process the image

pyflash.utils - INFO - python2 /home/chillaranand/projects/ocr/ocropy/ocropus-nlbin /home/chillaranand/projects/ocr/data/vishadam-021.png -o output -n 
pyflash.utils - INFO - python2 /home/chillaranand/projects/ocr/ocropy/ocropus-gpageseg output/????.bin.png -n 
pyflash.utils - INFO - python2 /home/chillaranand/projects/ocr/ocropy/ocropus-rpred -Q 4 -m /home/chillaranand/projects/ocr/ocropy/models/te.pyrnn.gz output/????.bin.png -n

@zuphilip
Copy link
Collaborator

@ChillarAnand Okay, it looks that for your examples the parts below the baseline (descenders?) are larger than expected. The computed lines look then like this:

_lineseeds

and lines from the descender part are then neglected. Try to adjust the vscale/scale manually, e.g.

./ocropus-gpageseg temp/chillar0001.bin.png --debug -n --vscale 1.5

which should work well, except for the page number on the top left (but I think this is a known issue).

@ChillarAnand
Copy link
Contributor

Thank you @zuphilip. With --vscale it is segmenting correctly. How did you generate the image above?

@zuphilip
Copy link
Collaborator

How did you generate the image above?

The --debug option produces such pictures in the folder where you are calling the script.

@Shreeshrii
Copy link

03

I tried the above image with different --vscale values. With --vscale 2.0 it gets all the text but gets 2 lines per image (ie. gets 13 images instead of 26). Without it, it gets only 2 lines in the whole page.

 ocropus-gpageseg 'book/????.bin.png' --debug -n --vscale 2.0  --maxcolseps 0 --maxseps 0
INFO:
INFO:  ########## /usr/local/bin/ocropus-gpageseg book/????.bin.png --debug -n
INFO:
INFO:  book/0001.bin.png
INFO:  scale 57.236352
INFO:  computing segmentation
INFO:  computing column separators
INFO:  considering at most 0 whitespace column separators
INFO:  debug _1thresh.png
INFO:  debug _2grad.png
INFO:  debug _3seps.png
INFO:  debug _4seps.png
INFO:  debug _colwsseps.png
INFO:  computing lines
INFO:  debug _cleaned.png
INFO:  debug _lineseeds.png
INFO:  debug _seeds.png
INFO:  propagating labels
INFO:  spreading labels
INFO:  number of lines 14
INFO:  finding reading order
INFO:  writing lines
INFO:      12  book/0001.bin.png 57.2 13
ocropus-gpageseg 'book/????.bin.png'
INFO:
INFO:  ########## /usr/local/bin/ocropus-gpageseg book/????.bin.png
INFO:
INFO:  book/0001.bin.png
INFO:  scale 57.236352
INFO:  computing segmentation
INFO:  computing column separators
INFO:  considering at most 3 whitespace column separators
INFO:  computing lines
INFO:  propagating labels
INFO:  spreading labels
INFO:  number of lines 54
INFO:  finding reading order
INFO:  writing lines
INFO:       1  book/0001.bin.png 57.2 2

_lineseeds.png seems to be identifying all the lines.

_lineseeds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants