Skip to content
This repository has been archived by the owner on Jan 2, 2023. It is now read-only.

wkhtmltopdf wrongly removes spaces before hyperlinks #4960

Open
getreu opened this issue Feb 27, 2021 · 15 comments
Open

wkhtmltopdf wrongly removes spaces before hyperlinks #4960

getreu opened this issue Feb 27, 2021 · 15 comments

Comments

@getreu
Copy link

getreu commented Feb 27, 2021

wkhtmltopdf version(s) affected:

wkhtmltopdf 0.12.6 (with patched qt)

OS information

Linux version 4.19.0-14-amd64 (debian-kernel@lists.debian.org) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP Debian 4.19.171-2 (2021-01-30)

Description

Wkhtmltopdf wrongly removes spaces in some circumstances.

How to reproduce
Render the html below (I could not attach it here) and observe the missing space before the word “cathedral” (my rendition is attached below)

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>test</title>
<style>
table, th, td { font-weight: normal; }
table.center {
  margin-left: auto;
  margin-right: auto;
  background-color: #f3f2e4;
  border:1px solid grey;
}
th, td {
  padding: 3px;
  padding-left:15px;
  padding-right:15px;
}
th.key{ color:#444444; text-align:right; }
th.val{
  color:#316128;
  text-align:left;
  font-family:sans-serif;
}
th.keygrey{ color:grey; text-align:right; }
th.valgrey{ color:grey; text-align:left; }
pre { white-space: pre-wrap; }
em { color: #523626; }
a { color: #316128; }
h1 { font-size: 150% }
h2 { font-size: 132% }
h3 { font-size: 115% }
h4, h5, h6 { font-size: 100% }
h1, h2, h3, h4, h5, h6 { color: #263292; font-family:sans-serif; }

</style>
  </head>
  <body>
  <table class="center">
    <tr>
    <th class="key">title:</th>
    <th class="val"><b>test</b></th>
  </tr>
    <tr>
    <th class="key">subtitle:</th>
    <th class="val"></th>
  </tr>
    <tr>
    <th class="keygrey">date:</th>
    <th class="valgrey"></th>
  </tr>
  
  </table>
  <div class="noteBody"><p>This week I read an essay “A brief explanation of the <a href="https://graymirror.substack.com/p/a-brief-explanation-of-the-cathedral">cathedral</a>”.</p>
</div>
</body>
</html>

Expected behavior

Space before the word ”cathedral”.

20210227-test.md.pdf

Possible Solution

none found

@PhilterPaper
Copy link

I vaguely remember hearing about a similar problem with <span> losing the space before a word. You might do a search involving "span". IIRC, the solution was to put the offended word (here, "cathedral") starting a new line, or the <a> tag starting a line.

I just tried your example, on Win10 with 0.12.5 (patched qt), and it did not drop the space. The only change I made to your HTML was to replace the two quotes with &ldquo; and &rdquo;, as they did not copy-paste over well.

@getreu
Copy link
Author

getreu commented Mar 1, 2021

@PhilterPaper Thank you for your answer.

the solution was to put the offended word (here, "cathedral") starting a new line, or the tag starting a line.

Unfortunately, I have very little control about how the html looks like. It is rendered with Tp-Note which uses the pulldown_cmark library.

The only change I made to your HTML was to replace the two quotes with “ and ”, as they did not copy-paste over well.

It is a pity that you can not reproduce the misbehaviour. To be sure, I copy and pasted the above myself once more and I can reproduce it. Maybe the typographic quotes are the problem? Then, I noticed that you are not using the last version 0.12.6.

@PhilterPaper
Copy link

Could you at least try with the "typographic quotes" (MS Smart Quotes?) replaced by the entities, on 0.12.6? That might at least narrow down the problem to whether it's those quotes messing up something in wkHTMLtoPDF. If they're Smart Quotes, and you're trying to encode as UTF-8, they're not going to be compatible (they're Windows-125x, not Latin-x/ISO-8859-x or UTF-8).

If it's a matter of having to replace quotes with entities, or split a line, that could conceivably be automated with a preprocessor on the HTML file.

@getreu
Copy link
Author

getreu commented Mar 1, 2021

Could you at least try with the "typographic quotes" (MS Smart Quotes?) replaced by the entities, on 0.12.6?

I tried with the replaced ones and without typographic quotes. Both results in a dropped space before “cathedral”. It means the error has nothing to do with quotes.

@PhilterPaper
Copy link

If it wasn't the mangled quotes, the only other thing I could suggest is splitting the line so that <a.../a> is on a new line. If that does the trick, and since that may be a general problem with any inline tag (such as <span>), that could be automated with some tool like "sed" to split each tag (at <) onto a new line. A more elaborate preprocessor might do this just for inline tags. It's worth considering if you can't find any other workaround.

@getreu
Copy link
Author

getreu commented Mar 2, 2021

Bingo!
When I split the line just before <a...>, the missing space reappears correctly!

  <div class="noteBody"><p>This week I read an essay &ldquo;A brief explanation of the 
<a href="https://graymirror.substack.com/p/a-brief-explanation-of-the-cathedral">cathedral</a>&rdquo;.</p>

If this is a general problem, your suggest workaround might work. Anyway, this bug makes automatic
document processing impossible. A major use case I suppose.

@PhilterPaper
Copy link

If you're directly processing the URL as input to wkHTMLtoPDF, you'd have to find a way to copy the HTML to a local file, preprocess it with sed or the the equivalent, and then run wkHTMLtoPDF against that modified file. You may be able to leave the supporting cast (JS, CSS, etc.) back on the original site. Anyway, it's a bit ugly but might be automatable until this bug gets fixed.

@getreu
Copy link
Author

getreu commented Mar 3, 2021

Thank you for detailed workaround. BTW: are all versions affected by this bug? Maybe one could just downgrade?

@PhilterPaper
Copy link

I've seen this bug mentioned a number of times before, but I was surprised that it seemed to work OK in my 0.12.5. Maybe there's some sort of instability in the code (which would be very bad). Anyway, splitting lines at tags (HTML elements) may work for you if your documents experience this problem.

@getreu
Copy link
Author

getreu commented Mar 3, 2021

Maybe there's some sort of instability in the code (which would be very bad)

I am not sure what you mean by “instability”, but the error should be relatively easy to fix: it is reproducible and deterministic. It becomes much harder, when it appears randomly, caused by for example, memory leaks or dangling pointers etc. I would be surprised if for example memory safety is the cause in this case. Sounds more like a typical parser problem to me.

@PhilterPaper
Copy link

It becomes much harder, when it appears randomly, caused by for example, memory leaks or dangling pointers etc.

Yes, that's what I meant by an instability. Inconsistent, hard to reproduce, not obviously deterministic. I wouldn't necessarily call this particular problem "reproducible and deterministic" if it only applies (somewhat consistently) to certain input data patterns and not across all versions, but we all have our own favored terminology. For example, here we have a word, space, HTML tag, word (without spaces), and another HTML tag; where the first space disappears. I don't think it's consistent enough that I could conjure up an example from scratch with full expectations that it would fail in this way. All we have is a (broken) example that might mysteriously heal if some preceding text were changed. I may be wrong, but I fear it will be a difficult fix.

@getreu
Copy link
Author

getreu commented Mar 16, 2021

The workaround is documented on page 35 in the TP note manual.

 tp-note --export=- "$1" | sed 's_<_\r<_g' -  | wkhtmltopdf --footer-center "[page]/[topage]" -B 2cm -L 2cm -R 2cm -T 2cm - "{$1}.pdf"

@PhilterPaper
Copy link

OK, so you're editing the HTML to make each HTML tag (element) start on a new line. That's probably overkill, but lacking any way to tell in advance which tags are going to cause a problem, that could be a workable solution. If it works consistently, go for it!

@getreu
Copy link
Author

getreu commented Mar 17, 2021

This quick and dirty hack certainly introduces other rendering errors. But so far, I was lucky as these did not manifest in my documents yet.

@vitorreus
Copy link

This seems to be an issue with fontconfig as seen here: #2118 (comment)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

No branches or pull requests

3 participants