wkhtmltopdf wrongly removes spaces before hyperlinks #4960
Comments
I vaguely remember hearing about a similar problem with I just tried your example, on Win10 with 0.12.5 (patched qt), and it did not drop the space. The only change I made to your HTML was to replace the two quotes with |
@PhilterPaper Thank you for your answer.
Unfortunately, I have very little control about how the html looks like. It is rendered with Tp-Note which uses the
It is a pity that you can not reproduce the misbehaviour. To be sure, I copy and pasted the above myself once more and I can reproduce it. Maybe the typographic quotes are the problem? Then, I noticed that you are not using the last version 0.12.6. |
Could you at least try with the "typographic quotes" (MS Smart Quotes?) replaced by the entities, on 0.12.6? That might at least narrow down the problem to whether it's those quotes messing up something in wkHTMLtoPDF. If they're Smart Quotes, and you're trying to encode as UTF-8, they're not going to be compatible (they're Windows-125x, not Latin-x/ISO-8859-x or UTF-8). If it's a matter of having to replace quotes with entities, or split a line, that could conceivably be automated with a preprocessor on the HTML file. |
I tried with the replaced ones and without typographic quotes. Both results in a dropped space before “cathedral”. It means the error has nothing to do with quotes. |
If it wasn't the mangled quotes, the only other thing I could suggest is splitting the line so that |
Bingo! <div class="noteBody"><p>This week I read an essay “A brief explanation of the
<a href="https://graymirror.substack.com/p/a-brief-explanation-of-the-cathedral">cathedral</a>”.</p> If this is a general problem, your suggest workaround might work. Anyway, this bug makes automatic |
If you're directly processing the URL as input to wkHTMLtoPDF, you'd have to find a way to copy the HTML to a local file, preprocess it with sed or the the equivalent, and then run wkHTMLtoPDF against that modified file. You may be able to leave the supporting cast (JS, CSS, etc.) back on the original site. Anyway, it's a bit ugly but might be automatable until this bug gets fixed. |
Thank you for detailed workaround. BTW: are all versions affected by this bug? Maybe one could just downgrade? |
I've seen this bug mentioned a number of times before, but I was surprised that it seemed to work OK in my 0.12.5. Maybe there's some sort of instability in the code (which would be very bad). Anyway, splitting lines at tags (HTML elements) may work for you if your documents experience this problem. |
I am not sure what you mean by “instability”, but the error should be relatively easy to fix: it is reproducible and deterministic. It becomes much harder, when it appears randomly, caused by for example, memory leaks or dangling pointers etc. I would be surprised if for example memory safety is the cause in this case. Sounds more like a typical parser problem to me. |
Yes, that's what I meant by an instability. Inconsistent, hard to reproduce, not obviously deterministic. I wouldn't necessarily call this particular problem "reproducible and deterministic" if it only applies (somewhat consistently) to certain input data patterns and not across all versions, but we all have our own favored terminology. For example, here we have a word, space, HTML tag, word (without spaces), and another HTML tag; where the first space disappears. I don't think it's consistent enough that I could conjure up an example from scratch with full expectations that it would fail in this way. All we have is a (broken) example that might mysteriously heal if some preceding text were changed. I may be wrong, but I fear it will be a difficult fix. |
The workaround is documented on page 35 in the TP note manual. tp-note --export=- "$1" | sed 's_<_\r<_g' - | wkhtmltopdf --footer-center "[page]/[topage]" -B 2cm -L 2cm -R 2cm -T 2cm - "{$1}.pdf" |
OK, so you're editing the HTML to make each HTML tag (element) start on a new line. That's probably overkill, but lacking any way to tell in advance which tags are going to cause a problem, that could be a workable solution. If it works consistently, go for it! |
This quick and dirty hack certainly introduces other rendering errors. But so far, I was lucky as these did not manifest in my documents yet. |
This seems to be an issue with fontconfig as seen here: #2118 (comment) |
wkhtmltopdf version(s) affected:
wkhtmltopdf 0.12.6 (with patched qt)
OS information
Linux version 4.19.0-14-amd64 (debian-kernel@lists.debian.org) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP Debian 4.19.171-2 (2021-01-30)
Description
Wkhtmltopdf wrongly removes spaces in some circumstances.
How to reproduce
Render the html below (I could not attach it here) and observe the missing space before the word “cathedral” (my rendition is attached below)
Expected behavior
Space before the word ”cathedral”.
20210227-test.md.pdf
Possible Solution
none found
The text was updated successfully, but these errors were encountered: