Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: A color layer in mixed mode to keep color text #264

Open
trufanov-nok opened this issue Mar 6, 2017 · 7 comments

Comments

@trufanov-nok
Copy link

commented Mar 6, 2017

Hi,

I've implemented in my ScanTailor fork a few enhancements and would like to suggest few of them.
The most interested one is a "Color layer" support in Mixed mode. It addresses following problems:

I. If your scan contains a text in color or some color arrows or glyphs the picture detector can easily miss these areas bcs they are too thin, small and doesn't look like square picture. In this case you either accept the fact that they'll be in black ad white or select their areas manually. In last case it's very non productive and usually you can't select the area you need without background. And background might be non-white. Just because the page is old and yellowish.

For example this page:
original page

in mixed mode without adding manual zones will be:

mixed page

Or this page contains a lot of particles, some of which are drawn in blue:

page3

That info is completely lost by mixed mode picture detector.

page4

II. Some text can't be binarized well enough. Just because it was typed so small or the quality of typing wasn't good its letters are surrounded with grayish areas or contains grayish pixels between close parts of letter glyph. Like this:

letter1

Whatever dpi you scan there will be grey color inside "e" and "a" just becouse it was typed like that in typography. And you'll get:
2
or
3
in ScanTailor for default (0) and max thin (-50) binarization correspondingly. It still can be read by human but cause problems with OCR. The only way to improve this is to keep original text quality and don't worsen it. For example, by selecting text in mixed mode with its background surroundings. It will look like original picture.

Solution.

Both problems could be addressed. The results are:
4

Above is the text of same quality but over the white background. I'm not sure if modern OCR algorithms could benefit from grayscale text, still I'm glad to keep this information in scans for future.

5

All arrows and glyphs are colored. Text is in bw.

6

All particles are colored. Text in bw.

The trick is simple. We just need to process original scan in 2 modes: black&white and greyscale/color. Then use first as a mask for second. That gives us colored text and glyphs over the white background.
So user will have 3 sources for mixed mode: BW, COLOR and MASKED COLOR. And now you can combine them, for example select some paragraphs with big enough text from BW layer, some with small text from MASKED COLOR layer and pictures from COLOR layer.

This is in a nutshell what I implement in my ScanTailor fork to automatize this process. I made this on a base of mixed mode. I have added 2 more controls to it:
controls
By default color layer is disabled. Auto layer is the results of default picture regions detector. Btw the way there is a sense in a control to disable it. Sometimes its results are completely wrong and it's easier to remove them and select zones manually. Currently I have to create a big "remove fro auto layer" zone around whole page to achieve this.
If color layer is enabled you'll see everything that is not white in BW mode glowing bluish in Picture Zones tab. The manually set zones now could be applied to color layer too. In fact - in case auto layer is disabled all zones are applied to color layer. In case Auto layer is enabled the "Subtract from all layers" zone type is applied to both auto and color zones. The content of color zone that is inside such zone is displayed in bw. This way you can decide if you want some part of text in grayscale or black.
controls2

Still there are known limitations:

  1. I've disabled this layer in case Dewarping is used. It can be implemented for this mode too but I faced with completely of displaying non dewarped color layer in Picture Zones tab.
  2. Would be better to have 2 thickness controls for black'n'white text and for black'n'white text which i use as a mask, but current ScanTailor UI seems to be overloaded. It just don't fit.
  3. As I tried to change as less as possible in original code there are some inconsistencies in highlighting layers in Picture Zones tab. For example pictures could belong to both color layer and auto layer. Just because bw mode "see" them too. And if you put image in "remove from auto layer" zone it still shall glow bcs this zone doesn't affect color layer. But its not as I'm simulating these layers while working with single mask image preliminary composed from them.

P.S. Have no chance to test it under Windows machine but these changes shouldn't be platform dependent.

@GalloglyP

This comment has been minimized.

Copy link

commented Mar 6, 2017

I was about to post an issue before I read this. Would this also address pages which include both black text on white background and white text on black background. Using the threshold slider improves either the black text or white text while degrading the other on the same page.

@trufanov-nok

This comment has been minimized.

Copy link
Author

commented Mar 6, 2017

Perhaps. Threshold treated differently for color layer zones. It depends. Pls attach here a sample page or upload it somewhere and I'll check.

@GalloglyP

This comment has been minimized.

Copy link

commented Mar 6, 2017

test

@trufanov-nok

This comment has been minimized.

Copy link
Author

commented Mar 6, 2017

It works exactly as black and white mode but letters a bit grayish. That's because mixed mode's picture detector fails to recognize the black area as picture and binarization goes crazy. If I fix it then it'll give you pretty same result as grayscale mode - in fact, unchanged original image (as there is no noise on background anyway). I'm afraid that's best you can get at all from such image. Even if you delete the black part of picture and play with BW mode's binarization the results will be not as good as original picture. Assuming you have no dirty background - the grayscale mode will be your best option. And you have dirty background - Color layer might come in handy (if picture autodetection will be fixed).

@trufanov-nok

This comment has been minimized.

Copy link
Author

commented Mar 6, 2017

Btw, how about that? Would it be better for you than original? Note - it's not black and white. Just sharpened grayscale. pic

@GalloglyP

This comment has been minimized.

Copy link

commented Mar 6, 2017

I am not completely following, I just install this yesterday. I was trying to use the mixed mode. Is that result using your greyscale text or just the default color/greyscale setting?

@trufanov-nok

This comment has been minimized.

Copy link
Author

commented Mar 6, 2017

The last image is original file edited in GIMP image editor. I played with filters\enhance\sharpen or filters\enhance\unsharp mask. GIMP is a crossplatform opensource professional image editor and thus has much more sophisticated image processing algorithms. It also can be automatized to batch process image files by providing a script via command line parameters. So it may have sense for your case to try this approach. To preprocess or postprocess your scans in GIMP with batch mode. Or, perhaps GIMP could even replace ScanTailor in your case. Depends on do you have real images in pages. I believe that your case is beyond ScanTailor scope and user rarely can face with such scans.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.