Part of Arabic character gets separated after converting XHTML to PDF #3249

Open
sukhoi191 opened this Issue Dec 14, 2016 · 3 comments

Projects

None yet

2 participants

@sukhoi191

XHTML displayed in Chrome:

chrome

After converting to PDF:

pdf

XHTML contents to reproduce: Arabic issue.zip

I'm using standalone version of wkhtmltopdf.
Command: wkhtmltopdf name_of_input_file.xhtml name_of_output_file.pdf

@PhilterPaper

Other alphabets (Sinhalese, for example) have reported problems with composite characters being separately output. See #2764 -- it might be the same, or similar, issue.

@sukhoi191
sukhoi191 commented Dec 14, 2016 edited

@PhilterPaper - You're right, it seems like a very similar (if not the same) issue.
It's worth mentioning, that when I'm using non-standard fonts in XHTML (i.e. Harmattan), this problem is no longer present and characters are rendered correctly.

@PhilterPaper

With some alphabets, some characters are defined in UTF-8 as "combining" marks, to be printed over another character, otherwise without any change to either character. With others, it appears that the two (or more) characters are supposed to be merged into an entirely new single character (presumably its own entry in UTF-8). Thinking about it some more, I suspect that your Arabic characters fall into the first category, and the Sinhalese falls into the second. They would very likely have quite different code handling them, and be quite separate problems. Plus, an Arabic letter has different forms for where in the word it appears, which might have some bearing on the problem. It's quite possible that WebKit (or maybe wkHTMLtoPDF) just wasn't written with such support in mind, just simple Latin script with maybe some diacritic overprints?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment