Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison with DjVuSolo 3.1: I see a bold character where it shouldn't be. #14

Open
rmast opened this issue Sep 15, 2021 · 9 comments
Open

Comments

@rmast
Copy link

rmast commented Sep 15, 2021

Hi Alex,

I made a sampe PBM with scantailor and compressed it with DjVuSolo 3.1 bitonal 300 dpi and minidjvu-mod --lossy for comparison.

Sample.zip

In the version of minidjvu-mod you can see that a normal character has been switched for a bold one:

image

That switch didn't take place in DjVuSolo3.1 bitonal300, which still reached 3k smaller.

Some links on the subject of recognizing italic and bold, just from some googling:
https://www.researchgate.net/publication/235412971_Automatic_Text_Clustering_and_Classification_Based_on_Font_Geometrical_Characteristics

https://stackoverflow.com/questions/62947592/does-google-cloud-vision-api-detect-formatting-in-ocred-text-like-bold-italics

https://github.com/tesseract-ocr/tesseract/issues/1371

https://studylib.net/doc/18711914/detection-of-bold-italic-and-underline-fonts-for-hindi-ocr

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.735.9226&rep=rep1&type=pdf

@trufanov-nok
Copy link
Owner

Thanks, I'll check this out. What were minidjvu-mod params? "-l" ?

@rmast
Copy link
Author

rmast commented Sep 15, 2021 via email

@trufanov-nok
Copy link
Owner

Hey, what version of minidjvu-mod you used? I see no AT&T tag in djvu header. I think I've fixed this bug in recent version...

@rmast
Copy link
Author

rmast commented Sep 15, 2021

Commit 94a78c5
I built the new one, and that sure gives a smaller picture now. But if you look right at the 'a''s. Some others are bold now.
foreground2bitonalga2.zip

image
image

@rmast
Copy link
Author

rmast commented Sep 17, 2021

I would expect just counting the black pixels of the bold and non-bold would reveal a significant percentual difference.

@rmast
Copy link
Author

rmast commented Sep 17, 2021

I extracted the separate characters of the jb2-image with djvutoy, opened the 'a' characters that look like the two 'a's in the above text "dan via", and used OpenCV-python to count the black pixels.

This bold character has 11% more pixels than this non-bold character.
0001_0014

0001_0034

@trufanov-nok
Copy link
Owner

I guess it's less than 10%. minidjvu-mod is already counting pixels and uses 10% difference threshold. That's a bitmap param called "mass" in the program. But the problem isn't in a finding a differences. If it would so we would end at lossless compression. The problem is in finding a visually important differences. Which will allow to keep page original look while compressing it with "equal" characters substitutions. And further "mass" threshold decrease disadvantages are too beg. Narrowing it's comparison threshold will significantly increase filesize. That's a too simple feature.
Browsing the articles you've suggested

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.735.9226&rep=rep1&type=pdf

I've found a better image "feature" - a rate of avg height to avg width of the stripes. It' looks like it works better than mass in distinguishing bold from non-bold and easy to calculate. But during the tests I faced with the fact that the classification bug that I hoped is already fixed in current version is still there. So now I'm working on it. Probably I've found a bug. It will take days to fix and fine-tune encoding parameters again.

@rmast
Copy link
Author

rmast commented Sep 18, 2021 via email

@rmast
Copy link
Author

rmast commented Sep 18, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants