Join GitHub today
ocropus-gpageseg: Enable usage of masks to specify column separators/ ignore areas of scan #236
Current behavior of the mask
A white on black mask can be applied to an image file. The white areas of the mask are interpreted as column separators and are merged with the column separators that have been identified by OCRopus itself.
The file which specifies the mask is saved as the basename of the file it shall be applied on, extended by ".mask.png".
e.g. the basename of the file is ".../book/0003"
The current behavior of column separators identified by OCRopus is, that they delete the pixels beneath them. The column separators specified by the mask currently behave the same way. Due to this the mask can also be used to ignore areas of pixels of a scan. This might screw up the segmentation, though!
This way images as in the example of #38 or other areas could be ignored.
Potential improvements/ changes
The usage of the mask to ignore areas might lead to segmentation problems if not used with care. Due to this it might be desirable to be able to apply different kind of masks like. A mask that is extended by ".mask.sep.png" for column separators and another mask ".mask.ia.png" that ignores areas of the scan.
Another potential improvement would be to introduce a new parameter to ocropus-gpageseg where a "master-mask" (/master masks, if multiple masks are allowed) can be passed that is applied to every image, processed by ocropus-gpageseg. This way unwanted recurring patterns can be easily ignored on each scan without renaming/ copying an existing mask.
Mask for first image of #208 as well as lineseeds with and without mask. Mask is used to identify correct separator and prevent identification of wrong separators in the text of the scan.
Disabled colum separator identification by OCRopus when the mask is used since we only want to apply the separators specified by the mask in this case.
Mask for image of #89 as well as lineseeds with and without mask. Mask is used to prevent "holes" in the separators (would have been enough to mark holes in the mask instead of complete separators since the specified areas are merged with the separators found by OCRopus).
Mask for image of #38 as well as lineseeds with and without mask. Mask is used to ignore image on scan.
tldr: Thank you very much @lehzwo for this nice PR!
I updated the PR and add a test case. However, I think that the smoke test here may not detect many errors. Currently, the output of the test results looks like:
I didn't adjust the maximum number of lines, because the ordering of the lines takes quite some time. Thus, it results with this ERROR, but still a valid test case for us.
Yes, this another nice application of such masks. I cannot think of any negative effect for the segmentation. The "smearing" step will not go into the white areas of the mask, but everywhere else where it would go in normal run as well. This looks like exactly the behavior we want. @lehzwo Did you have anything specific in mind here? I rather would go further from here: with these "masks" it is also possible to run the recognition only in a limited area by just coloring everything else white. Thus, one can for example recognize only one column and ignoring the remaining parts of the page.
The only concern I have at the moment is, that this feature is now very hidden. We certainly should document it well, e.g. in the wiki https://github.com/tmbdev/ocropy/wiki/Page-Segmentation . But maybe it also makes sens to add an additional parameter for it. With an additional parameter we can then change the default name as you suggest, but more important IMO is that the feature appears in