Fix word box boundaries in rendered PDF #3139

sbjorn · 2020-10-23T22:42:10Z

This addresses the dimension of the text boxes in the generated PDFs (essentially a one-off bug) that are slightly too small to accommodate the boundary of the words. Here is an example of some text selected in a sandwiched PDF:

Sandwiched document regenerated after fix:

zdenop · 2020-10-24T08:14:25Z

Please provide testing case for problem and other details (e.g, which pdf viewer you use...)

stweil · 2020-10-24T19:38:20Z

I tried the patch on test/testing/8087_054.3G.tif. In Firefox the result looks pretty good. With macOS preview it has several problems: For some words the new boxes look better, but many words are no longer separate, so it is for example no longer possible to select a word by double clicking. Some lines even change the height (they are now too high). Therefore I think this needs more changes to be really good.

PDF without patch
PDF with patch

amitdo · 2020-10-25T01:14:45Z

See #2879.

sbjorn · 2020-10-26T14:45:03Z

@stweil thanks for submitting an example. I have tested exclusively under Linux, and only seen improved results with the following PDF viewers:

evince version 3.34.2 (poppler)
chromium version 85 (pdfium)

I just got hold of a windows machine to test with Adobe DC Reader DC 2020.012.20048 (English) with improved results as well (the words are completely covered on selection). No problems selecting words by double-clicking during any of my tests. The height of the selected text stays the same from what I can see by visual inspection. I find it strange that the macOS viewer shows a different height of the text box, as this code change only tweaks the horizontal stretch. Do you see this behavior in all readers on macOS? (I don't have a mac myself, so I can't check).

MerlijnWajer · 2020-11-17T15:08:18Z

(I made a Python port of pdfrenderer.cpp that reads hOCR files rather than Tesseract results, but am seeing the same problem. Code: https://git.archive.org/merlijn/archive-pdf-tools/-/blob/master/pdfrenderer.py)

This image shows the 'original' code, with the pdf_word_len++ increase in place. The center PDF is generated with pdf_word_len++ removed. The right PDF is just provided for reference, since the bounding boxes make much of the PDF not visible.

In the both left and center PDF, I pressed Ctrl+A in evince to 'highlight' all text. The reason that you actually see glyphs is because I inserted a second font that references a system font to show glyphs, which doesn't break text copy / paste. (I plan to submit a PR for this later this month)

While the new code does seem to make some bounding boxes better, it also looks like the titles overlap more (but that might be an artifact from my port, or just from the hOCR file(s))

amitdo · 2020-11-17T16:25:20Z

@MerlijnWajer,

First, you pasted an incorrect link. Here is the script location:
https://git.archive.org/merlijn/archive-pdf-tools/-/blob/show-text-on-selection/internetarchivepdf/pdfrenderer.py

I don't see a license informantion in your repo.

I inserted a second font that references a system font to show glyphs, which doesn't break text copy / paste. (I plan to submit a PR for this later this month)

See this comment: #2879 (comment)

You should try to test with this patch, which was provided by the person that wrote the pdf output code in Tesseract.

MerlijnWajer · 2020-11-17T16:55:32Z

@MerlijnWajer,

First, you pasted an incorrect link. Here is the script location:
https://git.archive.org/merlijn/archive-pdf-tools/-/blob/show-text-on-selection/internetarchivepdf/pdfrenderer.py

That is technically not correct either, since that specific branch has a Chinese font inserted.

I don't see a license informantion in your repo.

Don't want to derail, but just wanted to say that the license will be the same as the pdfrenderer.cpp in Tesseract, of course. There is a comment in the file that already hints at that. I'm planning to finish that this week - sorry for the noise.

I inserted a second font that references a system font to show glyphs, which doesn't break text copy / paste. (I plan to submit a PR for this later this month)

See this comment: #2879 (comment)

I will open a separate issue for this later, not to derail discussion here.

You should try to test with this patch, which was provided by the person that wrote the pdf output code in Tesseract.

Ok, thanks, I will test that this week and let you know if it improves the situation.

amitdo · 2021-02-04T14:26:29Z

I think @jbreiden's patch should be used instead of this PR.

#2879 (comment)

amitdo · 2021-02-17T23:32:20Z

This patch has a good impact on Firefox's pdf.js, unlike the patch from #2879 (comment).

Both have a bad impact on macOS's Preview.

amitdo · 2021-10-27T11:49:29Z

This seems to work better than what we currently have for Evince, Chromium and Firefox.

Maybe we should accept this PR with a (runtime) flag that can disable it.

MerlijnWajer · 2021-10-27T14:44:20Z

@amitdo - Do we know what exactly made it regress on macOS preview? You had an idea it might be the zero width space and also had a suggestion on how to fix it? Perhaps we can attempt to fix it for preview as well?

amitdo · 2021-10-27T15:22:04Z

Merlijn,

I'm not an expert in the PDF format.

What I do know is that the PDF spec is considered very complex. It has lots of ambiguities and the implementations sometimes use heuristics.

macOS Preview word spacing handling seems to be weaker than the other common pdf viewers.

It's hard to make all pdf viewers happy.

Since you are working on this yourself for IA (in Python), you have a better chance than me to find a proper solution... :-)

MerlijnWajer · 2021-10-27T16:20:53Z

I just tried the patch briefly on evince, and I am not sure if it seems to improve the situation, at least in my case it seems to break selecting a single column in two column layout, so I'm not sure if the workaround works as intended.

I wonder if it makes sense if we make pdf_word_len a float and add something like 0.5 instead of 1, that seems to make the bounding boxes render OK on evince, doesn't break selection like not adding 1 does, and maybe it also helps with text selection on macOS preview? (I don't have macOS available right now)

I uploaded three text PDFs here: https://archive.org/~merlijn/pdf-text/

Maybe someone can verify that do-not-add-1.pdf indeed does not contain spaces, whereas default.pdf does. And if they are checking that, maybe they can see if add-half.pdf does add spaces on preview when copying. (E: Nevermind the checking for spaces, the spaces should just always work, I was confused with the proposed solution in the other issue, still, this solution could potentially just work)

MerlijnWajer · 2021-10-27T16:24:16Z

Oh, I think I might be conflating this issue (no spaces being added) and the other you one linked, with regards to no spaces being added to the text copy. In any case, the patch on this issue as is doesn't improve the situation on evince for me, it makes it worse.

wollmers · 2021-10-27T17:30:19Z

@MerlijnWajer

I uploaded three text PDFs here: https://archive.org/~merlijn/pdf-text/

Unfortunately it seems, that the problem is caused by different PDF viewers. Tried on MacOS Chrome, Safari and Preview, along with mark complete line, mark word, double click on word, find 'Magazine.' None the combinations works perfect, and the behaviour is different across the viewers.

And now the bad news: Try pdftotext -layout <PDF-file> [<text-file>] and you will see, that only the result of default.pdf is acceptable. For me this is critical, because I use it in production.

The other problems with the display in viewers are more cosmetic. OK, Mac inserts a space at begin of line, while Chrome inserts one at end of line. None of the viewers work correct.

MerlijnWajer · 2021-10-27T22:27:02Z

@wollmers thanks for the suggested test case. What pdftotext version do you use? I have this: pdftotext version 21.07.0.

When I use pdftotext -layout add-half.pdf I get arguably better results than pdftotext -layout default.pdf, but the do-not-add-1.pdf utterly fails.

Could you elaborate on what is not acceptable? I've uploaded the .txt files for reference to https://archive.org/~merlijn/pdf-text/

MerlijnWajer · 2021-10-27T22:28:27Z

Regarding the viewers, this bug report in particular is about improving the highlighting of a word (the box drawn around it using the glyphless font), the default.pdf is usually too small, whereas the do-not-add-1.pdf is usually a good fit, but it causes trouble (for me, and for pdftotext) in selecting the text, add-half.pdf visually mostly looks fine, but clearly doesn't fully encompass the words. Is that what you see the viewers, or were you commenting on just the text selection?

wollmers · 2021-10-28T04:47:33Z

@MerlijnWajer

My version:

$ pdftotext -v
pdftotext version 20.12.1
Copyright 2005-2020 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC

Yes, add-half.pdf is acceptable, but not the best. default.pdf keeps the centered layout of the first line with your version of pdftotext.

zdenop · 2021-10-28T07:20:42Z

Be careful about pdftotext there are several versions/forks of this tool that provide different results. See #2426 (comment)

MerlijnWajer · 2021-10-28T07:32:06Z

@zdenop - Thanks for the heads up, at least in this case we have a similar codebase (but not the same version):

$ pdftotext -v
pdftotext version 21.07.0
Copyright 2005-2021 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC

@wollmers - I'm going to experiment a bit with the patch here and the zero width space patch in the other issue, see if I can figure something out that works on mac. Thanks for testing.

amitdo · 2021-10-28T08:14:30Z

Merlijn,

Maybe you want to try to implement the suggestion in this comment:

#2879 (comment)

BTW, here are two Python based pdf + ocr layer renderers:

https://github.com/jbarlow83/OCRmyPDF/blob/master/src/ocrmypdf/hocrtransform.py

https://github.com/ocropus/hocr-tools/blob/master/hocr-pdf
This one was written by the same developer that wrote the pdf code in Tesseract.

Both of them use the reportlab library.

MerlijnWajer · 2021-10-28T08:46:17Z

Merlijn,

Maybe you want to try to implement the suggestion in this comment:

#2879 (comment)

Right, that was what I was thinking of trying.

BTW, here are two Python based pdf + ocr layer renderers:

https://github.com/jbarlow83/OCRmyPDF/blob/master/src/ocrmypdf/hocrtransform.py

https://github.com/ocropus/hocr-tools/blob/master/hocr-pdf This one was written by the same developer that wrote the pdf code in Tesseract.

Both of them use the reportlab library.

Thanks for the pointers. I did see and try both of these before I decided to port the Tesseract renderer last year (https://github.com/internetarchive/archive-pdf-tools/blob/master/internetarchivepdf/pdfrenderer.py) to python. hocr-pdf in particular gave me a lot of problems so I gave up on that pretty quickly, the other tooling doesn't deal that well with large hOCR files (streaming wise, memory usage wise) and also (I think) didn't do as good of a job as Tesseract did. I'm inclined to see what can be improved in Tesseract (in particular, if we can make it work on Preview and have the selection boxes look good), because that's from my experience the best text renderer.

In any case, I'll try to toy with this and see what I can find beyond what others already found (maybe nothing, we'll see).

MerlijnWajer · 2021-11-28T18:26:38Z

I ran into this the other day: nickjwhite/gofpdf@3e3f1fb

mentioned in https://blog.rescribe.xyz/posts/pdfs/ - maybe it has something useful there.

Fix word box boundaries in rendered PDF

733814f

amitdo closed this Feb 4, 2021

amitdo added the PDF label Feb 17, 2021

amitdo mentioned this pull request Feb 26, 2024

Invisible glyph bounds at wrong positions in PDF #2879

Open

amitdo mentioned this pull request Mar 9, 2024

Plans for tesseract 5.x.y #3673

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix word box boundaries in rendered PDF #3139

Fix word box boundaries in rendered PDF #3139

sbjorn commented Oct 23, 2020

zdenop commented Oct 24, 2020

stweil commented Oct 24, 2020

amitdo commented Oct 25, 2020

sbjorn commented Oct 26, 2020

MerlijnWajer commented Nov 17, 2020

amitdo commented Nov 17, 2020 •

edited

MerlijnWajer commented Nov 17, 2020 •

edited

amitdo commented Feb 4, 2021

amitdo commented Feb 17, 2021 •

edited

amitdo commented Oct 27, 2021

MerlijnWajer commented Oct 27, 2021

amitdo commented Oct 27, 2021

MerlijnWajer commented Oct 27, 2021 •

edited

MerlijnWajer commented Oct 27, 2021 •

edited

wollmers commented Oct 27, 2021

MerlijnWajer commented Oct 27, 2021 •

edited

MerlijnWajer commented Oct 27, 2021

wollmers commented Oct 28, 2021

zdenop commented Oct 28, 2021

MerlijnWajer commented Oct 28, 2021

amitdo commented Oct 28, 2021

MerlijnWajer commented Oct 28, 2021

MerlijnWajer commented Nov 28, 2021

Fix word box boundaries in rendered PDF #3139

Fix word box boundaries in rendered PDF #3139

Conversation

sbjorn commented Oct 23, 2020

zdenop commented Oct 24, 2020

stweil commented Oct 24, 2020

amitdo commented Oct 25, 2020

sbjorn commented Oct 26, 2020

MerlijnWajer commented Nov 17, 2020

amitdo commented Nov 17, 2020 • edited

MerlijnWajer commented Nov 17, 2020 • edited

amitdo commented Feb 4, 2021

amitdo commented Feb 17, 2021 • edited

amitdo commented Oct 27, 2021

MerlijnWajer commented Oct 27, 2021

amitdo commented Oct 27, 2021

MerlijnWajer commented Oct 27, 2021 • edited

MerlijnWajer commented Oct 27, 2021 • edited

wollmers commented Oct 27, 2021

MerlijnWajer commented Oct 27, 2021 • edited

MerlijnWajer commented Oct 27, 2021

wollmers commented Oct 28, 2021

zdenop commented Oct 28, 2021

MerlijnWajer commented Oct 28, 2021

amitdo commented Oct 28, 2021

MerlijnWajer commented Oct 28, 2021

MerlijnWajer commented Nov 28, 2021

amitdo commented Nov 17, 2020 •

edited

MerlijnWajer commented Nov 17, 2020 •

edited

amitdo commented Feb 17, 2021 •

edited

MerlijnWajer commented Oct 27, 2021 •

edited

MerlijnWajer commented Oct 27, 2021 •

edited

MerlijnWajer commented Oct 27, 2021 •

edited