Skip to content
This repository has been archived by the owner on Jan 2, 2023. It is now read-only.

Incorrect Chinese Characters is converted in PDF outline . #3841

Open
liub-a opened this issue Mar 5, 2018 · 7 comments
Open

Incorrect Chinese Characters is converted in PDF outline . #3841

liub-a opened this issue Mar 5, 2018 · 7 comments

Comments

@liub-a
Copy link

liub-a commented Mar 5, 2018

Hi,

When converting chinese HTML file to PDF format, some Chinese characters in the TOC of result PDF are not correctly converted.

The test.html as :
<h1>不</h1>

According to the PDF standard, section 3.2 "If an end-of-line marker appears within a literal string without a preceding backslash, the result is equivalent to(regardless of whether the end-of-line \n marker was a carriage return, a line feed, or both)."

So an end-of-line marker appearing within a literal string without a preceding backslash shall be treated as a byte value of (0Ah).

「不」 (U+4E0D) is incorrectly coverted to「上」(U+4E0A).

good luck.

Sorry For My Bad English.

@PhilterPaper
Copy link

Is this character (U+4E0D) treated correctly in the document itself, and only in the outline/bookmarks it's incorrect? Is Chinese text in UTF-8 format or something else, such as UTF-16? You have checked that the HTML encoding statement matches what the actual encoding is (such as UTF-8), and that the HTML file hasn't been damaged during file transfers and the 0D turned into 0A there? If this character is being seen with a 0D in it, it sounds like it's not UTF-8. Your file transfers are binary or ASCII mode? Is an older Mac involved at any point (line-end x0D)?

I have very little experience with multibyte encodings and PDF, so the above are just some guesses at possible problem points to check.

@liub-a
Copy link
Author

liub-a commented Mar 6, 2018

The CJk characters will convert to UTF16 When converting to pdf,
This character treated correctly is only in the outline with the text content in other is compressed, the outline "/Title" property does not compressed.
It seems need add a preceding backslash () when write a character to the pdf if it contain a btye OD.
OD shoud coverted to \r (5C72) which is two byte in the pdf.
「不」 (U+4E0D) need coverted to (U+4E5C72)

the attach file has two pdf file , Incorrect.pdf is generate by wkhtmtopdf,

in the Incorrect.pdf 4E 0D is convert to 4E 0A when pdf reader (adobe reader) read it.

image
image

in the Correct.pdf the 「不」is writen with (U+4E5C72) , so the pdf reader will convert is to (U+4E0A)

image

image

image

It seems is the problem in the QPdfEngine. According to the PDF standard, it should convert the character when printString to the pdf.

image
testpdf.zip

Forgive my poor English!

@liub-a
Copy link
Author

liub-a commented Mar 6, 2018

I fixed it by,

void QPdfEnginePrivate::printString(const QString &string) {
    // The 'text string' type in PDF is encoded either as PDFDocEncoding, or
    // Unicode UTF-16 with a Unicode byte order mark as the first character
    // (0xfeff), with the high-order byte first.
    QByteArray array("(\xfe\xff");
    const ushort *utf16 = string.utf16();
    
    for (int i=0; i < string.size(); ++i) {
        char part[2] = {char((*(utf16 + i)) >> 8), char((*(utf16 + i)) & 0xff)};
        for(int j=0; j < 2; ++j) {
            if (part[j] == '(' || part[j] == ')' || part[j] == '\\')
                array.append('\\');
            if (part[j] == '\r' )
                array.append("\\r");
            else
                array.append(part[j]);
        }
    }
    array.append(")");
    write(array);
}

it works fine.
fixed_incorrect_chinese_characters.patch.txt

@doublesixwings
Copy link

I also encountered this problem, but as a non-professional person in programming I still don‘t know how to fix this bug by using this patch.txt, can someone help me? Thank you so much!

@rainydayprogrammer
Copy link

I also encountered this problem, but as a non-professional person in programming I still don‘t know how to fix this bug by using this patch.txt, can someone help me? Thank you so much!

I encountered this today,

docfx080
and also have no enviroment to build and time, I download html page and replace troublesome character to unicode.

<h1>「テスト」</h1>
to
<h1>「テスト&#xFF63;</h1>

docfx090

@langsz
Copy link

langsz commented Aug 9, 2021

#3841 (comment)

The CJk characters will convert to UTF16 When converting to pdf,
This character treated correctly is only in the outline with the text content in other is compressed, the outline "/Title" property does not compressed.
It seems need add a preceding backslash () when write a character to the pdf if it contain a btye OD.
OD shoud coverted to \r (5C72) which is two byte in the pdf.
「不」 (U+4E0D) need coverted to (U+4E5C72)

the attach file has two pdf file , Incorrect.pdf is generate by wkhtmtopdf,

in the Incorrect.pdf 4E 0D is convert to 4E 0A when pdf reader (adobe reader) read it.

image
image

in the Correct.pdf the 「不」is writen with (U+4E5C72) , so the pdf reader will convert is to (U+4E0A)

image

image

image

It seems is the problem in the QPdfEngine. According to the PDF standard, it should convert the character when printString to the pdf.

image
testpdf.zip

Forgive my poor English!

It needs to rebuild the QT?

@PhilterPaper
Copy link

So it sounds to me like an unescaped x0D in text is treated differently between the main PDF and whatever code is handling the outline entries. That is, it's translating x0D to x0A during the building of the PDF, and not after during file transfer or by the Reader. If so, the solution would be to find the outline code and see if other code (as used in the main PDF) could be substituted, or otherwise rewritten to avoid the code which is doing this on-the-fly translation. Perhaps the offending code is not wide-character aware, and treats everything as ASCII?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

No branches or pull requests

5 participants