Comparison with DjVuSolo 3.1: I see a bold character where it shouldn't be. #14

rmast · 2021-09-15T12:47:11Z

Hi Alex,

I made a sampe PBM with scantailor and compressed it with DjVuSolo 3.1 bitonal 300 dpi and minidjvu-mod --lossy for comparison.

Sample.zip

In the version of minidjvu-mod you can see that a normal character has been switched for a bold one:

That switch didn't take place in DjVuSolo3.1 bitonal300, which still reached 3k smaller.

Some links on the subject of recognizing italic and bold, just from some googling:
https://www.researchgate.net/publication/235412971_Automatic_Text_Clustering_and_Classification_Based_on_Font_Geometrical_Characteristics

https://stackoverflow.com/questions/62947592/does-google-cloud-vision-api-detect-formatting-in-ocred-text-like-bold-italics

https://github.com/tesseract-ocr/tesseract/issues/1371

https://studylib.net/doc/18711914/detection-of-bold-italic-and-underline-fonts-for-hindi-ocr

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.735.9226&rep=rep1&type=pdf

trufanov-nok · 2021-09-15T14:06:06Z

Thanks, I'll check this out. What were minidjvu-mod params? "-l" ?

rmast · 2021-09-15T14:35:53Z

Yes Outlook voor Android downloaden<https://aka.ms/ghei36>

…

________________________________ From: Alexander Trufanov ***@***.***> Sent: Wednesday, September 15, 2021 4:06:17 PM To: trufanov-nok/minidjvu-mod ***@***.***> Cc: rmast ***@***.***>; Author ***@***.***> Subject: Re: [trufanov-nok/minidjvu-mod] Comparison with DjVuSolo 3.1: I see a bold character where it shouldn't be. (#14) Thanks, I'll check this out. What were minidjvu-mod params? "-l" ? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#14 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAZPZ5VIOIONTLWBGRJ5IKDUCCR5TANCNFSM5ECLZB2Q>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

trufanov-nok · 2021-09-15T15:37:29Z

Hey, what version of minidjvu-mod you used? I see no AT&T tag in djvu header. I think I've fixed this bug in recent version...

rmast · 2021-09-15T16:59:14Z

Commit 94a78c5
I built the new one, and that sure gives a smaller picture now. But if you look right at the 'a''s. Some others are bold now.
foreground2bitonalga2.zip

rmast · 2021-09-17T22:11:36Z

I would expect just counting the black pixels of the bold and non-bold would reveal a significant percentual difference.

rmast · 2021-09-17T23:18:04Z

I extracted the separate characters of the jb2-image with djvutoy, opened the 'a' characters that look like the two 'a's in the above text "dan via", and used OpenCV-python to count the black pixels.

This bold character has 11% more pixels than this non-bold character.

trufanov-nok · 2021-09-17T23:26:55Z

I guess it's less than 10%. minidjvu-mod is already counting pixels and uses 10% difference threshold. That's a bitmap param called "mass" in the program. But the problem isn't in a finding a differences. If it would so we would end at lossless compression. The problem is in finding a visually important differences. Which will allow to keep page original look while compressing it with "equal" characters substitutions. And further "mass" threshold decrease disadvantages are too beg. Narrowing it's comparison threshold will significantly increase filesize. That's a too simple feature.
Browsing the articles you've suggested

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.735.9226&rep=rep1&type=pdf

I've found a better image "feature" - a rate of avg height to avg width of the stripes. It' looks like it works better than mass in distinguishing bold from non-bold and easy to calculate. But during the tests I faced with the fact that the classification bug that I hoped is already fixed in current version is still there. So now I'm working on it. Probably I've found a bug. It will take days to fix and fine-tune encoding parameters again.

rmast · 2021-09-18T11:14:15Z

Would it be an idea to collect a set of example files to somehow automate the testing and finetuning? Outlook voor Android downloaden<https://aka.ms/ghei36>

…

________________________________ From: Alexander Trufanov ***@***.***> Sent: Saturday, September 18, 2021 1:27:05 AM To: trufanov-nok/minidjvu-mod ***@***.***> Cc: rmast ***@***.***>; Author ***@***.***> Subject: Re: [trufanov-nok/minidjvu-mod] Comparison with DjVuSolo 3.1: I see a bold character where it shouldn't be. (#14) I guess it's less than 10%. minidjvu-mod is already counting pixels and uses 10% difference threshold. That's a bitmap param called "mass" in the program. But the problem isn't in a finding a differences. If it would so we would end at lossless compression. The problem is in finding a visually important differences. Which will allow to keep page original look while compressing it with "equal" characters substitutions. And further "mass" threshold decrease disadvantages are too beg. Narrowing it's comparison threshold will significantly increase filesize. That's a too simple feature. Browsing the articles you've suggested https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.735.9226&rep=rep1&type=pdf I've found a better image "feature" - a rate of avg height to avg width of the stripes. It' looks like it works better than mass in distinguishing bold from non-bold and easy to calculate. But during the tests I faced with the fact that the classification bug that I hoped is already fixed in current version is still there. So now I'm working on it. Probably I've found a bug. It will take days to fix and fine-tune encoding parameters again. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#14 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAZPZ5QMWFWT4VZWHSJUV23UCPFETANCNFSM5ECLZB2Q>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

rmast · 2021-09-18T11:27:46Z

Does the algorithm compare individual mass-percentages, or first make some heuristic/statistic about it to pinpoint some dip? I could imagine bold often comes in complete words if it's hard to choose for just one character. I could even imagine OCR recognition of a word in a language could raise the confidence of recognizing all symbols in a word, including the dots and other little marks near to letters. Outlook voor Android downloaden<https://aka.ms/ghei36>

…

________________________________ From: Robert Mast ***@***.***> Sent: Saturday, September 18, 2021 1:14:11 PM To: trufanov-nok/minidjvu-mod ***@***.***>; trufanov-nok/minidjvu-mod ***@***.***> Cc: Author ***@***.***> Subject: Re: [trufanov-nok/minidjvu-mod] Comparison with DjVuSolo 3.1: I see a bold character where it shouldn't be. (#14) Would it be an idea to collect a set of example files to somehow automate the testing and finetuning? Outlook voor Android downloaden<https://aka.ms/ghei36>

________________________________ From: Alexander Trufanov ***@***.***> Sent: Saturday, September 18, 2021 1:27:05 AM To: trufanov-nok/minidjvu-mod ***@***.***> Cc: rmast ***@***.***>; Author ***@***.***> Subject: Re: [trufanov-nok/minidjvu-mod] Comparison with DjVuSolo 3.1: I see a bold character where it shouldn't be. (#14) I guess it's less than 10%. minidjvu-mod is already counting pixels and uses 10% difference threshold. That's a bitmap param called "mass" in the program. But the problem isn't in a finding a differences. If it would so we would end at lossless compression. The problem is in finding a visually important differences. Which will allow to keep page original look while compressing it with "equal" characters substitutions. And further "mass" threshold decrease disadvantages are too beg. Narrowing it's comparison threshold will significantly increase filesize. That's a too simple feature. Browsing the articles you've suggested https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.735.9226&rep=rep1&type=pdf I've found a better image "feature" - a rate of avg height to avg width of the stripes. It' looks like it works better than mass in distinguishing bold from non-bold and easy to calculate. But during the tests I faced with the fact that the classification bug that I hoped is already fixed in current version is still there. So now I'm working on it. Probably I've found a bug. It will take days to fix and fine-tune encoding parameters again. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#14 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAZPZ5QMWFWT4VZWHSJUV23UCPFETANCNFSM5ECLZB2Q>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparison with DjVuSolo 3.1: I see a bold character where it shouldn't be. #14

Comparison with DjVuSolo 3.1: I see a bold character where it shouldn't be. #14

rmast commented Sep 15, 2021

trufanov-nok commented Sep 15, 2021

rmast commented Sep 15, 2021 via email

trufanov-nok commented Sep 15, 2021

rmast commented Sep 15, 2021

rmast commented Sep 17, 2021

rmast commented Sep 17, 2021

trufanov-nok commented Sep 17, 2021

rmast commented Sep 18, 2021 via email

rmast commented Sep 18, 2021 via email

Comparison with DjVuSolo 3.1: I see a bold character where it shouldn't be. #14

Comparison with DjVuSolo 3.1: I see a bold character where it shouldn't be. #14

Comments

rmast commented Sep 15, 2021

trufanov-nok commented Sep 15, 2021

rmast commented Sep 15, 2021 via email

trufanov-nok commented Sep 15, 2021

rmast commented Sep 15, 2021

rmast commented Sep 17, 2021

rmast commented Sep 17, 2021

trufanov-nok commented Sep 17, 2021

rmast commented Sep 18, 2021 via email

rmast commented Sep 18, 2021 via email