-
Notifications
You must be signed in to change notification settings - Fork 9.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix word box boundaries in rendered PDF #3139
Conversation
Please provide testing case for problem and other details (e.g, which pdf viewer you use...) |
I tried the patch on |
See #2879. |
@stweil thanks for submitting an example. I have tested exclusively under Linux, and only seen improved results with the following PDF viewers:
I just got hold of a windows machine to test with Adobe DC Reader DC 2020.012.20048 (English) with improved results as well (the words are completely covered on selection). No problems selecting words by double-clicking during any of my tests. The height of the selected text stays the same from what I can see by visual inspection. I find it strange that the macOS viewer shows a different height of the text box, as this code change only tweaks the horizontal stretch. Do you see this behavior in all readers on macOS? (I don't have a mac myself, so I can't check). |
(I made a Python port of pdfrenderer.cpp that reads hOCR files rather than Tesseract results, but am seeing the same problem. Code: https://git.archive.org/merlijn/archive-pdf-tools/-/blob/master/pdfrenderer.py) This image shows the 'original' code, with the In the both left and center PDF, I pressed Ctrl+A in evince to 'highlight' all text. The reason that you actually see glyphs is because I inserted a second font that references a system font to show glyphs, which doesn't break text copy / paste. (I plan to submit a PR for this later this month) While the new code does seem to make some bounding boxes better, it also looks like the titles overlap more (but that might be an artifact from my port, or just from the hOCR file(s)) |
First, you pasted an incorrect link. Here is the script location: I don't see a license informantion in your repo.
See this comment: #2879 (comment) You should try to test with this patch, which was provided by the person that wrote the pdf output code in Tesseract. |
That is technically not correct either, since that specific branch has a Chinese font inserted.
Don't want to derail, but just wanted to say that the license will be the same as the pdfrenderer.cpp in Tesseract, of course. There is a comment in the file that already hints at that. I'm planning to finish that this week - sorry for the noise.
I will open a separate issue for this later, not to derail discussion here.
Ok, thanks, I will test that this week and let you know if it improves the situation. |
I think @jbreiden's patch should be used instead of this PR. |
This patch has a good impact on Firefox's pdf.js, unlike the patch from #2879 (comment). Both have a bad impact on macOS's Preview. |
This seems to work better than what we currently have for Evince, Chromium and Firefox. Maybe we should accept this PR with a (runtime) flag that can disable it. |
@amitdo - Do we know what exactly made it regress on macOS preview? You had an idea it might be the zero width space and also had a suggestion on how to fix it? Perhaps we can attempt to fix it for preview as well? |
Merlijn, I'm not an expert in the PDF format. What I do know is that the PDF spec is considered very complex. It has lots of ambiguities and the implementations sometimes use heuristics. macOS Preview word spacing handling seems to be weaker than the other common pdf viewers. It's hard to make all pdf viewers happy. Since you are working on this yourself for IA (in Python), you have a better chance than me to find a proper solution... :-) |
I just tried the patch briefly on evince, and I am not sure if it seems to improve the situation, at least in my case it seems to break selecting a single column in two column layout, so I'm not sure if the workaround works as intended. I wonder if it makes sense if we make I uploaded three text PDFs here: https://archive.org/~merlijn/pdf-text/ Maybe someone can verify that |
Oh, I think I might be conflating this issue (no spaces being added) and the other you one linked, with regards to no spaces being added to the text copy. In any case, the patch on this issue as is doesn't improve the situation on evince for me, it makes it worse. |
Unfortunately it seems, that the problem is caused by different PDF viewers. Tried on MacOS Chrome, Safari and Preview, along with mark complete line, mark word, double click on word, find 'Magazine.' None the combinations works perfect, and the behaviour is different across the viewers. And now the bad news: Try The other problems with the display in viewers are more cosmetic. OK, Mac inserts a space at begin of line, while Chrome inserts one at end of line. None of the viewers work correct. |
@wollmers thanks for the suggested test case. What pdftotext version do you use? I have this: When I use Could you elaborate on what is not acceptable? I've uploaded the |
Regarding the viewers, this bug report in particular is about improving the highlighting of a word (the box drawn around it using the glyphless font), the |
My version:
Yes, |
Be careful about |
@zdenop - Thanks for the heads up, at least in this case we have a similar codebase (but not the same version):
@wollmers - I'm going to experiment a bit with the patch here and the zero width space patch in the other issue, see if I can figure something out that works on mac. Thanks for testing. |
Merlijn, Maybe you want to try to implement the suggestion in this comment: BTW, here are two Python based pdf + ocr layer renderers: https://github.com/jbarlow83/OCRmyPDF/blob/master/src/ocrmypdf/hocrtransform.py https://github.com/ocropus/hocr-tools/blob/master/hocr-pdf Both of them use the reportlab library. |
Right, that was what I was thinking of trying.
Thanks for the pointers. I did see and try both of these before I decided to port the Tesseract renderer last year (https://github.com/internetarchive/archive-pdf-tools/blob/master/internetarchivepdf/pdfrenderer.py) to python. In any case, I'll try to toy with this and see what I can find beyond what others already found (maybe nothing, we'll see). |
I ran into this the other day: nickjwhite/gofpdf@3e3f1fb mentioned in https://blog.rescribe.xyz/posts/pdfs/ - maybe it has something useful there. |
This addresses the dimension of the text boxes in the generated PDFs (essentially a one-off bug) that are slightly too small to accommodate the boundary of the words. Here is an example of some text selected in a sandwiched PDF:
Sandwiched document regenerated after fix: