Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix word box boundaries in rendered PDF #3139

Closed

Conversation

sbjorn
Copy link

@sbjorn sbjorn commented Oct 23, 2020

This addresses the dimension of the text boxes in the generated PDFs (essentially a one-off bug) that are slightly too small to accommodate the boundary of the words. Here is an example of some text selected in a sandwiched PDF:

pre_000

Sandwiched document regenerated after fix:

post_000

@zdenop
Copy link
Contributor

zdenop commented Oct 24, 2020

Please provide testing case for problem and other details (e.g, which pdf viewer you use...)

@stweil
Copy link
Contributor

stweil commented Oct 24, 2020

I tried the patch on test/testing/8087_054.3G.tif. In Firefox the result looks pretty good. With macOS preview it has several problems: For some words the new boxes look better, but many words are no longer separate, so it is for example no longer possible to select a word by double clicking. Some lines even change the height (they are now too high). Therefore I think this needs more changes to be really good.

PDF without patch
PDF with patch

@amitdo
Copy link
Collaborator

amitdo commented Oct 25, 2020

See #2879.

@sbjorn
Copy link
Author

sbjorn commented Oct 26, 2020

@stweil thanks for submitting an example. I have tested exclusively under Linux, and only seen improved results with the following PDF viewers:

  • evince version 3.34.2 (poppler)
  • chromium version 85 (pdfium)

I just got hold of a windows machine to test with Adobe DC Reader DC 2020.012.20048 (English) with improved results as well (the words are completely covered on selection). No problems selecting words by double-clicking during any of my tests. The height of the selected text stays the same from what I can see by visual inspection. I find it strange that the macOS viewer shows a different height of the text box, as this code change only tweaks the horizontal stretch. Do you see this behavior in all readers on macOS? (I don't have a mac myself, so I can't check).

@MerlijnWajer
Copy link
Contributor

(I made a Python port of pdfrenderer.cpp that reads hOCR files rather than Tesseract results, but am seeing the same problem. Code: https://git.archive.org/merlijn/archive-pdf-tools/-/blob/master/pdfrenderer.py)

This image shows the 'original' code, with the pdf_word_len++ increase in place. The center PDF is generated with pdf_word_len++ removed. The right PDF is just provided for reference, since the bounding boxes make much of the PDF not visible.

In the both left and center PDF, I pressed Ctrl+A in evince to 'highlight' all text. The reason that you actually see glyphs is because I inserted a second font that references a system font to show glyphs, which doesn't break text copy / paste. (I plan to submit a PR for this later this month)

pdfspaces

While the new code does seem to make some bounding boxes better, it also looks like the titles overlap more (but that might be an artifact from my port, or just from the hOCR file(s))

@amitdo
Copy link
Collaborator

amitdo commented Nov 17, 2020

@MerlijnWajer,

First, you pasted an incorrect link. Here is the script location:
https://git.archive.org/merlijn/archive-pdf-tools/-/blob/show-text-on-selection/internetarchivepdf/pdfrenderer.py

I don't see a license informantion in your repo.

I inserted a second font that references a system font to show glyphs, which doesn't break text copy / paste. (I plan to submit a PR for this later this month)

See this comment: #2879 (comment)

You should try to test with this patch, which was provided by the person that wrote the pdf output code in Tesseract.

@MerlijnWajer
Copy link
Contributor

MerlijnWajer commented Nov 17, 2020

@MerlijnWajer,

First, you pasted an incorrect link. Here is the script location:
https://git.archive.org/merlijn/archive-pdf-tools/-/blob/show-text-on-selection/internetarchivepdf/pdfrenderer.py

That is technically not correct either, since that specific branch has a Chinese font inserted.

I don't see a license informantion in your repo.

Don't want to derail, but just wanted to say that the license will be the same as the pdfrenderer.cpp in Tesseract, of course. There is a comment in the file that already hints at that. I'm planning to finish that this week - sorry for the noise.

I inserted a second font that references a system font to show glyphs, which doesn't break text copy / paste. (I plan to submit a PR for this later this month)

See this comment: #2879 (comment)

I will open a separate issue for this later, not to derail discussion here.

You should try to test with this patch, which was provided by the person that wrote the pdf output code in Tesseract.

Ok, thanks, I will test that this week and let you know if it improves the situation.

@amitdo
Copy link
Collaborator

amitdo commented Feb 4, 2021

I think @jbreiden's patch should be used instead of this PR.

#2879 (comment)

@amitdo amitdo closed this Feb 4, 2021
@amitdo amitdo added the PDF label Feb 17, 2021
@amitdo
Copy link
Collaborator

amitdo commented Feb 17, 2021

This patch has a good impact on Firefox's pdf.js, unlike the patch from #2879 (comment).

Both have a bad impact on macOS's Preview.

@amitdo
Copy link
Collaborator

amitdo commented Oct 27, 2021

This seems to work better than what we currently have for Evince, Chromium and Firefox.

Maybe we should accept this PR with a (runtime) flag that can disable it.

@MerlijnWajer
Copy link
Contributor

@amitdo - Do we know what exactly made it regress on macOS preview? You had an idea it might be the zero width space and also had a suggestion on how to fix it? Perhaps we can attempt to fix it for preview as well?

@amitdo
Copy link
Collaborator

amitdo commented Oct 27, 2021

Merlijn,

I'm not an expert in the PDF format.

What I do know is that the PDF spec is considered very complex. It has lots of ambiguities and the implementations sometimes use heuristics.

macOS Preview word spacing handling seems to be weaker than the other common pdf viewers.

It's hard to make all pdf viewers happy.

Since you are working on this yourself for IA (in Python), you have a better chance than me to find a proper solution... :-)

@MerlijnWajer
Copy link
Contributor

MerlijnWajer commented Oct 27, 2021

I just tried the patch briefly on evince, and I am not sure if it seems to improve the situation, at least in my case it seems to break selecting a single column in two column layout, so I'm not sure if the workaround works as intended.

I wonder if it makes sense if we make pdf_word_len a float and add something like 0.5 instead of 1, that seems to make the bounding boxes render OK on evince, doesn't break selection like not adding 1 does, and maybe it also helps with text selection on macOS preview? (I don't have macOS available right now)

I uploaded three text PDFs here: https://archive.org/~merlijn/pdf-text/

Maybe someone can verify that do-not-add-1.pdf indeed does not contain spaces, whereas default.pdf does. And if they are checking that, maybe they can see if add-half.pdf does add spaces on preview when copying. (E: Nevermind the checking for spaces, the spaces should just always work, I was confused with the proposed solution in the other issue, still, this solution could potentially just work)

@MerlijnWajer
Copy link
Contributor

MerlijnWajer commented Oct 27, 2021

Oh, I think I might be conflating this issue (no spaces being added) and the other you one linked, with regards to no spaces being added to the text copy. In any case, the patch on this issue as is doesn't improve the situation on evince for me, it makes it worse.

@wollmers
Copy link

@MerlijnWajer

I uploaded three text PDFs here: https://archive.org/~merlijn/pdf-text/

Unfortunately it seems, that the problem is caused by different PDF viewers. Tried on MacOS Chrome, Safari and Preview, along with mark complete line, mark word, double click on word, find 'Magazine.' None the combinations works perfect, and the behaviour is different across the viewers.

And now the bad news: Try pdftotext -layout <PDF-file> [<text-file>] and you will see, that only the result of default.pdf is acceptable. For me this is critical, because I use it in production.

The other problems with the display in viewers are more cosmetic. OK, Mac inserts a space at begin of line, while Chrome inserts one at end of line. None of the viewers work correct.

@MerlijnWajer
Copy link
Contributor

MerlijnWajer commented Oct 27, 2021

@wollmers thanks for the suggested test case. What pdftotext version do you use? I have this: pdftotext version 21.07.0.

When I use pdftotext -layout add-half.pdf I get arguably better results than pdftotext -layout default.pdf, but the do-not-add-1.pdf utterly fails.

Could you elaborate on what is not acceptable? I've uploaded the .txt files for reference to https://archive.org/~merlijn/pdf-text/

@MerlijnWajer
Copy link
Contributor

Regarding the viewers, this bug report in particular is about improving the highlighting of a word (the box drawn around it using the glyphless font), the default.pdf is usually too small, whereas the do-not-add-1.pdf is usually a good fit, but it causes trouble (for me, and for pdftotext) in selecting the text, add-half.pdf visually mostly looks fine, but clearly doesn't fully encompass the words. Is that what you see the viewers, or were you commenting on just the text selection?

@wollmers
Copy link

@MerlijnWajer

My version:

$ pdftotext -v
pdftotext version 20.12.1
Copyright 2005-2020 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC

Yes, add-half.pdf is acceptable, but not the best. default.pdf keeps the centered layout of the first line with your version of pdftotext.

@zdenop
Copy link
Contributor

zdenop commented Oct 28, 2021

Be careful about pdftotext there are several versions/forks of this tool that provide different results. See #2426 (comment)

@MerlijnWajer
Copy link
Contributor

@zdenop - Thanks for the heads up, at least in this case we have a similar codebase (but not the same version):

$ pdftotext -v
pdftotext version 21.07.0
Copyright 2005-2021 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC

@wollmers - I'm going to experiment a bit with the patch here and the zero width space patch in the other issue, see if I can figure something out that works on mac. Thanks for testing.

@amitdo
Copy link
Collaborator

amitdo commented Oct 28, 2021

Merlijn,

Maybe you want to try to implement the suggestion in this comment:

#2879 (comment)

BTW, here are two Python based pdf + ocr layer renderers:

https://github.com/jbarlow83/OCRmyPDF/blob/master/src/ocrmypdf/hocrtransform.py

https://github.com/ocropus/hocr-tools/blob/master/hocr-pdf
This one was written by the same developer that wrote the pdf code in Tesseract.

Both of them use the reportlab library.

@MerlijnWajer
Copy link
Contributor

Merlijn,

Maybe you want to try to implement the suggestion in this comment:

#2879 (comment)

Right, that was what I was thinking of trying.

BTW, here are two Python based pdf + ocr layer renderers:

https://github.com/jbarlow83/OCRmyPDF/blob/master/src/ocrmypdf/hocrtransform.py

https://github.com/ocropus/hocr-tools/blob/master/hocr-pdf This one was written by the same developer that wrote the pdf code in Tesseract.

Both of them use the reportlab library.

Thanks for the pointers. I did see and try both of these before I decided to port the Tesseract renderer last year (https://github.com/internetarchive/archive-pdf-tools/blob/master/internetarchivepdf/pdfrenderer.py) to python. hocr-pdf in particular gave me a lot of problems so I gave up on that pretty quickly, the other tooling doesn't deal that well with large hOCR files (streaming wise, memory usage wise) and also (I think) didn't do as good of a job as Tesseract did. I'm inclined to see what can be improved in Tesseract (in particular, if we can make it work on Preview and have the selection boxes look good), because that's from my experience the best text renderer.

In any case, I'll try to toy with this and see what I can find beyond what others already found (maybe nothing, we'll see).

@MerlijnWajer
Copy link
Contributor

I ran into this the other day: nickjwhite/gofpdf@3e3f1fb

mentioned in https://blog.rescribe.xyz/posts/pdfs/ - maybe it has something useful there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants