Improve OCR layer compatibility with MacOS Preview via hocr renderer #35

watou · 2021-10-18T11:52:38Z

I do duplex scans only with my Fujitsu ScanSnap S1500M and edit the resulting searchable PDF files in macOS Preview to remove any pages I don't want. However, the PDF files combined by pdfunite, once edited, make the embedded OCR'd text into strings of question-mark-in-box characters. This stops the files from being searchable.

When I change the scan script thus it does NOT solve this problem, and tried a number of permutations of gs parameters for PDF output.

310c310
<           pdfunite "${pdffiles[@]}" "${OUTPUT[$index]}" && rm $TMP_DIR/scan-*(0)$scanno.pdf
---
>           gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/default -dNOPAUSE -dQUIET -dBATCH -dDetectDuplicateImages -dCompressFonts=true -r150 -sOutputFile="${OUTPUT[$index]}" "${pdffiles[@]}" && rm $TMP_DIR/scan-*(0)$scanno.pdf
329c329
<     pdfunite "${pdffiles[@]}" "$OUTPUT" && rm $TMP_DIR/scan-[0-9]*.pdf
---
>     gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/default -dNOPAUSE -dQUIET -dBATCH -dDetectDuplicateImages -dCompressFonts=true -r150 -sOutputFile="$OUTPUT" "${pdffiles[@]}" && rm $TMP_DIR/scan-[0-9]*.pdf

Verified that this problem occurs on a different computer running a different macOS version.

The commercial ScanSnap software and included OCR produces PDF files that can be edited and the OCR'd text remains usable.

The text was updated successfully, but these errors were encountered:

rocketraman · 2021-10-22T06:18:04Z

I tried to replicate this problem on Linux with pdfarranger, but was unable to -- the edited PDF remained searchable.

You said that switching from pdfunite to gs did not solve the problem?

I don't have many ideas here, except to note the problem appears to be with macOS Preview.

watou · 2021-10-22T11:06:08Z

Thank you very much for trying to replicate the problem. On both a Big Sur and a Mojave machine, after Preview on either machine removes a page, saves, closes and then re-opens the now-shorter PDF, the text is no longer searchable and if copied to the clipboard and pasted, it's all the same question-mark-in-a-box glyph. Also, the file actually grows slightly when a page is removed. This happens when either pdfunite or gs combines the PDFs as described above.

Without being certain, the most likely explanation so far is that Preview breaks the PDF when it removes a page, but won't do this to a PDF created by the Fujitsu software's generated PDFs.

rocketraman · 2021-10-22T11:24:20Z

Maybe the Fujitsu software outputs a different character encoding? Can you upload a small sample scanned and OCRed with the Fujitsu software?

rocketraman · 2022-01-07T14:31:19Z

Any updates on this @watou ?

rocketraman · 2022-01-16T04:15:27Z

No further information provided from user, closing. Feel free to re-open if you have additional info to provide.

watou · 2022-03-08T19:18:14Z

Just reporting that the following changed pieces yield the same unwanted result (if one removes one or more pages using macOS Preview from a multi-page PDF created by this script that includes text added by tesseract), re-opening the PDF in any viewer changes all text into the same "nonsense" glyph).

current commit version of this project (a5d56a0)
macOS Monterey 12.2.1
instructions followed in [Chris Schuld blog post)[https://chrisschuld.com/2020/01/network-scanner-with-scansnap-and-raspberry-pi/]
Raspberry Pi 2B running Linux net 4.19.66-v7+ #1253 SMP Thu Aug 15 11:49:46 BST 2019 armv7l GNU/Linux and latest required packages.

I'm amazed this result isn't reproduced by others!

rocketraman · 2022-03-08T20:40:54Z

@watou Thanks for the update. The information doesn't help me unfortunately. What I would need from you ideally is an upload of a PDF scanned and OCRed via this project, and one scanned and and OCRed via the Fujitsu software.

I would like to compare the two to see if there are any notable differences in the OCRed content.

I'd also like to try opening both PDFs on a Mac machine I have access to.

watou · 2022-03-08T20:44:30Z

I will send you those shortly; thanks for having a look when you have them!

watou · 2022-03-08T21:03:23Z

Hi @rocketraman, the file SaneScanPdf.pdf was scanned with a Fujitsu ScanSnap S1500M using your script, and the command line

#!/bin/sh
now=`date +"%Y%m%d-%H%M"`
/home/pi/sane-scan-pdf/scan -d -r 300 -v -m Lineart --crop --ocr --skip-empty-pages -o /home/pi/scans/$now.pdf

The file ScanSnapManager.pdf was scanned with ScanSnap Manager Version 6.3 L70 on macOS Mojave 10.14.6.

If you edit the first file with macOS Preview (multiple versions) and remove the second page, save, exit, re-open, the text you can select and copy to the clipboard has become illegible. This does not happen when editing the second file. This problem occurs with a variety of paper document tests, not just this two-page example.

Thank you for your efforts in identifying this issue. It's a genuine nuisance because the duplex scanning often fails to remove the blank reverse of many pages, and when I use Preview to remove them, the OCR is fatally garbled as a result.

rocketraman · 2022-03-08T21:27:16Z

I've replicated the issue with your (non-manager) scan, as well as another I did locally. I'll continue to do a bit of research to see if there is something simple to change to improve compat with MacOS Preview, however, I note this appears to not be the first time (by a long shot) that people have noted this buggy behavior with MacOS Preview, which links to 4 or 5 various posts about this issue:

https://annoying.technology/posts/86f4ea27e4cd90d0/

and a related HackerNews thread:

https://news.ycombinator.com/item?id=25447830

Therefore, its unlikely I'll spend much effort on this, as it is definitely Apple's problem.

watou · 2022-03-08T21:42:19Z

Thanks for looking. From your first link provided it sounds like Apple did some vendor-specific hacks to keep Preview from breaking ABBYY FineReader-produced OCR; if so, that's very fragile territory to try to improve. But just for completeness, the "garbage" I see always looks like the image below, not the garbled text in the image at your first link. Thanks again!

rocketraman · 2022-03-08T21:46:07Z

I also replicated the issue by not using OCR in sane-scan-pdf, and instead adding the OCR layer post-scanning via OCRmyPDF. Same issue when adding OCR via that tool using the default PDF render mode which is "sandwich". However, when using the older "hocr" PDF render mode, editing the PDF in Preview worked fine.

However, the "hocr" mode requires combining the hocr XHTML output from Tesseract with the scanned PDF, and I'd need to do a bit more research to understand how ocrmypdf does that. It may be possible to replicate that behavior in sane-scan-pdf.

Its unlikely I'll have the time to do this myself in the near future, but hopefully that helps someone else that may want to take a crack at this!

rocketraman · 2022-03-08T22:35:46Z

Probably the easiest option is to just use ocrmypdf as an alternative to directly using tesseract.

We could use this if, for example --ocrmypdf were passed as an argument instead of --ocr, and allow ocrmypdf to take options that are passed to ocrmypdf, which would allow doing something like --ocrmypdf "--pdf-renderer hocr" when compatibility with MacOS Preview is desired.

watou · 2022-03-08T22:55:42Z

@rocketraman If the pipeline had the OCR option you've described, that would be a perfect solution, especially since ocrmypdf seems like a mature and well packaged (optional) dependency.

rocketraman · 2022-03-08T22:59:04Z

@watou On the other hand, I'm not sure that using ocrmypdf directly in the scanner script adds much value. You could very easily just not use --ocr in the scanner, and then call ocrmypdf yourself on the results of the scan. Thoughts?

watou · 2022-03-08T23:06:27Z

I like the option of having your script produce the "default" handling of scanned documents, which in my case includes embedded OCR. Since your script already does this but in the limited range of choices that don't overcome the Preview problem, adding in a known workaround by optionally supporting a choice that does keeps your solution to one step per document. Of course I could modify your or my script and do an in-place ocrmypdf directly there, but it's nice if everything is kept to options you offer. I'm happy to fix it in my own scanbd script, but then it's something for the next poor Apple user to bump up against. But whichever path makes more sense to you is great!

watou changed the title ~~pdfunite creates "fragile" searchable PDFs~~ searchable PDFs unsearchable after editing unlike commercial scanner software Oct 18, 2021

rocketraman added the invalid label Jan 16, 2022

rocketraman closed this as completed Jan 16, 2022

rocketraman reopened this Mar 8, 2022

rocketraman added help wanted and removed invalid labels Mar 8, 2022

rocketraman changed the title ~~searchable PDFs unsearchable after editing unlike commercial scanner software~~ Improve OCR layer compatibility with MacOS Preview via hocr renderer Mar 8, 2022

rocketraman added the enhancement label Apr 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve OCR layer compatibility with MacOS Preview via hocr renderer #35

Improve OCR layer compatibility with MacOS Preview via hocr renderer #35

watou commented Oct 18, 2021 •

edited

Loading

rocketraman commented Oct 22, 2021

watou commented Oct 22, 2021

rocketraman commented Oct 22, 2021 •

edited

Loading

rocketraman commented Jan 7, 2022

rocketraman commented Jan 16, 2022

watou commented Mar 8, 2022

rocketraman commented Mar 8, 2022

watou commented Mar 8, 2022

watou commented Mar 8, 2022

rocketraman commented Mar 8, 2022 •

edited

Loading

watou commented Mar 8, 2022

rocketraman commented Mar 8, 2022

rocketraman commented Mar 8, 2022 •

edited

Loading

watou commented Mar 8, 2022

rocketraman commented Mar 8, 2022

watou commented Mar 8, 2022

Improve OCR layer compatibility with MacOS Preview via hocr renderer #35

Improve OCR layer compatibility with MacOS Preview via hocr renderer #35

Comments

watou commented Oct 18, 2021 • edited Loading

rocketraman commented Oct 22, 2021

watou commented Oct 22, 2021

rocketraman commented Oct 22, 2021 • edited Loading

rocketraman commented Jan 7, 2022

rocketraman commented Jan 16, 2022

watou commented Mar 8, 2022

rocketraman commented Mar 8, 2022

watou commented Mar 8, 2022

watou commented Mar 8, 2022

rocketraman commented Mar 8, 2022 • edited Loading

watou commented Mar 8, 2022

rocketraman commented Mar 8, 2022

rocketraman commented Mar 8, 2022 • edited Loading

watou commented Mar 8, 2022

rocketraman commented Mar 8, 2022

watou commented Mar 8, 2022

watou commented Oct 18, 2021 •

edited

Loading

rocketraman commented Oct 22, 2021 •

edited

Loading

rocketraman commented Mar 8, 2022 •

edited

Loading

rocketraman commented Mar 8, 2022 •

edited

Loading