ocropus-gpageseg: Defective line splitting #210

Open
wrznr opened this Issue May 2, 2017 · 7 comments

Comments

Projects
None yet
3 participants

wrznr commented May 2, 2017

Expected Behavior

Simple running text should be consistently split into lines.

Current Behavior

Currently working on data from the Grenzboten project together with @uvius. For some images, line splitting does not work. It is not clear why because very similar images are split correctly.

Steps to Reproduce (for bugs)

  1. Download test files:

179411_01 nrm
179411_01 bin

  1. Run ocropus-gpageseg on testfile(s).
  2. Inspect results.

Test files have been created with ocropus-nlbin. Tested various command line parameter settings without success.

Your Environment

  • Python version: Python 2.7.3
  • Git revision of ocropy: commit 49c7f9e
  • Operating System and version: Linux lal 3.2.0-4-amd64 #1 SMP Debian 3.2.73-2+deb7u2 x86_64 GNU/Linux
Collaborator

zuphilip commented May 2, 2017

Okay, I looked at the debug output with --debug and it seems that the detected scale is too small (approximately half of the correct size):

scale-default_lineseeds

The disconnected (red) components are then creating the different lines.

If you increase that value by hand by setting the --scale parameter:

ocropus-gpageseg grenzboten.bin.png -n --debug --scale 30

then the output looks good:

scale-30_lineseeds

(Don't forget to remove all old images from the directory containing the lines.)

wrznr commented May 16, 2017

Alright, thanks. That fixes the issue for the specific image (and many others). But if I set this parameter globally for the whole (pre-segmented) book, new problems arise with smaller (e.g. on-line images). Is there a known bug in the scale detection?

Collaborator

zuphilip commented May 16, 2017

new problems arise with smaller (e.g. on-line images)

I don't know what exactly you mean with "on-line images", but in general when you have to deal with font sizes which vary much (header vs. body text vs. footnote text), then ocropus has some problems and you might need some other steps.

Is there a known bug in the scale detection?

Nothing I am aware of, but the example you provide looks like not an optimal guess from ocropus for the scale parameter. My guess is that for your test image the binarization will produce characters that are splitted into several connected components, and this influences the estimation of the scale parameter. I tried another binarization method here, and then the result seems also okay.

wrznr commented May 17, 2017

Sorry @zuphilip. This is a typo and should be "one-line images" (i.e., images which cover only a single line). So it's not the varying font size but rather varying clipping sizes from the whole page image which cause the issues.

I tried another binarization method here, and then the result seems also okay.

This is a great idea. I used ocropus-nlbin which seems the most obvious choice. From my experience, the tesseract line splitting is far superior to ocropous-gpagseg but this probably boils down to binarization.

Many thanks for your ongoing support!

Collaborator

zuphilip commented May 17, 2017

The scale estimation in ocropus for your example will produce this scalemap

grenzboten-scalemap

As far as I understand the following happen then: For each of these boxes the algorithm continues to calculate the area and then take the square root (i.e. geometric mean of width and height). Overall the median of these numbers (without outliners) is then taken. Maybe in your example there are too many small connected components an/or the font is too narrow...

( The corresponding Jupyter notebook is here: https://gist.github.com/zuphilip/e551ba6b733b5094749799651e4fbd3e )

Sauvola is one possibility and I the ocropus-nlbin has more parameters to try out. Moreover, it should be possible to mix some of the steps of Tesseract with some of the steps with Ocropus.

@wrznr Thank you for asking interesting questions!

wrznr commented May 30, 2017

Indeed, using e.g. scantailor for binarization results in an almost error-free line splitting! Only small one-line segments like page numbers and signature marks (which is probably to be expected) are not correctly processed. Significant step forward!

While this is great for me, is it a problem for ocropus (I.e. problems in the combination of nlbin and gpageseg)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment