Skip to content
This repository has been archived by the owner on Jan 2, 2023. It is now read-only.

[qt] generated PDF does not pass the PDF validations #1573

Open
lindi2 opened this issue Mar 5, 2014 · 24 comments
Open

[qt] generated PDF does not pass the PDF validations #1573

lindi2 opened this issue Mar 5, 2014 · 24 comments
Labels
UpstreamChangeNeeded Verified The issue is verified.
Milestone

Comments

@lindi2
Copy link

lindi2 commented Mar 5, 2014

Steps to reproduce:

  1. Ensure that you are using a Debian 7 amd64 system
  2. Download and unpack http://downloads.sourceforge.net/project/wkhtmltopdf/0.12.0/wkhtmltox-linux-amd64_0.12.0-03c001d.tar.xz
  3. Download PDF validator from the Apache project: http://www.apache.org/dyn/closer.cgi/pdfbox/1.8.4/preflight-app-1.8.4.jar
cat > hello.html <<EOF
<html>
<head>
<title>title</title>
</head>
<body>
<h1>h1</h1>
<p>p</p>
</body>
</html>
EOF
  1. wkhtmltox/bin/wkhtmltopdf hello.html hello.pdf
  2. java -jar preflight-app-1.8.4.jar hello.pdf

Expected results:
6) output of wkhtmltopdf passes the PDF validator

Actual results:
6) the PDF validator complains:

The filehello.pdf is not valid, error(s) :
1.1 : Body Syntax error, Second line must contains at least 4 bytes greater than 127
1.0 : Syntax error, Object (9:0) at offset 326 does not end with 'endobj'.

More info:

  1. The file hello.pdf can be downloaded from http://lindi.iki.fi/lindi/wkhtmltopdf/hello.03c001de254b857f08eba80b62d4b6490ffed41d.pdf -- its md5sum is efe87c6b91ad3b8c93071d8689fe144e.
  2. If I look at the PDF internals I can see that object 9:0 indeed does not end with endobj (comments with "#" added by me):
...
9 0 obj      # obj 9:0 starts here
<<
/__WKANCHOR_2 8 0 R
>>
11 0 obj      # obj 11:0 starts here
<</Title (\376\377^@h^@1)
  /Parent 10 0 R
  /Dest /__WKANCHOR_2
  /Count 0
>>
endobj       # obj 11:0 ends here
10 0 obj     # obj 10:0 starts here
...
  1. This seems to be the only object that does not end with endobj since there are 26 objs and 25 endobjs:
$ grep -ca " obj$" hello.pdf
26
$ grep -ca "endobj" hello.pdf
25
@ashkulz
Copy link
Member

ashkulz commented Mar 6, 2014

First of all, thank you for the very detailed and comprehensive bug report! I wish all bug reports were like this 👍

I think the following patch for the patched QT should fix the issue you mentioned:

diff --git a/src/gui/painting/qprintengine_pdf.cpp b/src/gui/painting/qprintengine_pdf.cpp
index fd4e297..5dac92b 100644
--- a/src/gui/painting/qprintengine_pdf.cpp
+++ b/src/gui/painting/qprintengine_pdf.cpp
@@ -175,7 +175,7 @@ bool QPdfEngine::begin(QPaintDevice *pdev)
 bool QPdfEngine::end()
 {
     Q_D(QPdfEngine);
-   
+
     uint dests;
     if (d->anchors.size()) {
         dests = d->addXrefEntry(-1);
@@ -185,8 +185,8 @@ bool QPdfEngine::end()
             d->printAnchor(i.key());
             d->xprintf(" %d 0 R\n", i.value());
         }
-        d->xprintf(">>\n");
-    }  
+        d->xprintf(">>\nendobj\n");
+    }

     if (d->outlineRoot) {
         d->outlineRoot->obj = d->requestObject();
@@ -608,7 +608,7 @@ int QPdfEnginePrivate::gradientBrush(const QBrush &b, const QMatrix &matrix, int
                 ">>\n"
                 "stream\n"
               << content
-              << "endstream\n"
+              << "\nendstream\n"
                 "endobj\n";

             int softMaskFormObject = addXrefEntry(-1);
@@ -1236,7 +1236,7 @@ int QPdfEnginePrivate::writeImage(const QByteArray &data, int width, int height,
             xprintf(">>\nstream\n");
         len = writeCompressed(data);
     }
-    xprintf("endstream\n"
+    xprintf("\nendstream\n"
             "endobj\n");
     addXrefEntry(lenobj);
     xprintf("%d\n"
@@ -1376,7 +1376,7 @@ void QPdfEnginePrivate::embedFont(QFontSubset *font)
             "/CapHeight " << properties.capHeight.toReal()*scale << "\n"
             "/StemV " << properties.lineWidth.toReal()*scale << "\n"
             "/FontFile2 " << fontstream << "0 R\n"
-            ">> endobj\n";
+            ">>\nendobj\n";
         write(descriptor);
     }
     {
@@ -1394,7 +1394,7 @@ void QPdfEnginePrivate::embedFont(QFontSubset *font)
             "stream\n";
         write(header);
         int len = writeCompressed(fontData);
-        write("endstream\n"
+        write("\nendstream\n"
               "endobj\n");
         addXrefEntry(length_object);
         xprintf("%d\n"
@@ -1622,7 +1622,7 @@ void QPdfEnginePrivate::writePage()
     xprintf("stream\n");
     QIODevice *content = currentPage->stream();
     int len = writeCompressed(content);
-    xprintf("endstream\n"
+    xprintf("\nendstream\n"
             "endobj\n");

     addXrefEntry(pageStreamLength);

However, the following errors are still coming, and I don't really have a clue why they are being generated:

The filehello.pdf is not valid, error(s) :
1.1 : Body Syntax error, Second line must contains at least 4 bytes greater than
 127
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, The operator "f" can't be used without Color Profil
e
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, DestOutputProfile is missing
3.1.11 : Invalid Font definition, The CIDSet entry is missing for the Composite
Subset
3.2.3 : Font damaged, The FontFile can't be read
2.4.3 : Invalid Color space, DestOutputProfile is missing
3.1.11 : Invalid Font definition, The CIDSet entry is missing for the Composite
Subset
3.2.3 : Font damaged, The FontFile can't be read
2.4.3 : Invalid Color space, DestOutputProfile is missing
1.4.1 : Trailer Syntax error, The trailer dictionary doesn't contain ID
1.2.5 : Body Syntax error, Stream length is invalide
1.2.5 : Body Syntax error, Stream length is invalide
7.1 : Error on MetaData, Missing Metadata Key in catalog
1.4.9 : Trailer Syntax error, Outline Hierarchy doesn't have Count entry

@ashkulz ashkulz added this to the 0.12.1 milestone Mar 6, 2014
@ashkulz
Copy link
Member

ashkulz commented Mar 10, 2014

@lindi2: any updates? otherwise I will commit the above changes and close the issue.

@ashkulz ashkulz added Fixed and removed Verified labels Mar 28, 2014
@lindi2
Copy link
Author

lindi2 commented Apr 1, 2014

@ashkulz I don't have build environment setup right now. Did you start getting those other validation errors when you updated QT? I don't get them with the exact versions listed in the report.

@ashkulz
Copy link
Member

ashkulz commented Apr 1, 2014

@lindi2: I am getting them on the updated QT while testing on Windows. I don't think there has been a change in the relevant code for a long time... can you check after compiling with the latest code?

@lindi2
Copy link
Author

lindi2 commented Apr 1, 2014

@ashkulz I can put that to my todo list but it'll take some time to setup the build system.

@ashkulz
Copy link
Member

ashkulz commented Apr 1, 2014

let me see if I can build and upload a new development snapshot, will take a day or two for compiling all the combinations.

@ashkulz
Copy link
Member

ashkulz commented Apr 2, 2014

@lindi2: setting up a development environment is very simple, see INSTALL.md. Either way, a new development snapshot is available and linked from the website.

@lindi2
Copy link
Author

lindi2 commented Apr 2, 2014

@ashkulz sure, I just need to find the time to do it :) Anyways, wkhtmltox-0.12.1-b3e000e_linux-wheezy-amd64.tar.xz generates the following validation errors here:

1.1 : Body Syntax error, Second line must contains at least 4 bytes greater than 127
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, The operator "f" can't be used without Color Profile
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, DestOutputProfile is missing
3.1.11 : Invalid Font definition, The CIDSet entry is missing for the Composite Subset
3.2.3 : Font damaged, The FontFile can't be read
2.4.3 : Invalid Color space, DestOutputProfile is missing
3.1.11 : Invalid Font definition, The CIDSet entry is missing for the Composite Subset
3.2.3 : Font damaged, The FontFile can't be read
2.4.3 : Invalid Color space, DestOutputProfile is missing
1.4.1 : Trailer Syntax error, The trailer dictionary doesn't contain ID
1.2.5 : Body Syntax error, Stream length is invalide
1.2.5 : Body Syntax error, Stream length is invalide
7.1 : Error on MetaData, Missing Metadata Key in catalog
1.4.9 : Trailer Syntax error, Outline Hierarchy doesn't have Count entry

Could git bisect be used to figure out which commit created these? Does git bisect work sensibly with git submodules (I'm assuming that qt is a git submodule)?

@ashkulz
Copy link
Member

ashkulz commented Apr 2, 2014

Are you saying you didn't get any other errors with 0.12.0?

@lindi2
Copy link
Author

lindi2 commented Apr 3, 2014

That's right. With wkhtmltox-linux-amd64_0.12.0-03c001d.tar.xz I only get those two validation errors. Anyways, I got a VM for building wkhtmltopdf now and built b3e000e which gives me

The filehello.pdf is not valid, error(s) :
1.1 : Body Syntax error, Second line must contains at least 4 bytes greater than 127
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, The operator "f" can't be used without Color Profile
2.4.3 : Invalid Color space, DestOutputProfile is missing
2.4.3 : Invalid Color space, DestOutputProfile is missing
3.1.11 : Invalid Font definition, The CIDSet entry is missing for the Composite Subset
3.2.3 : Font damaged, The FontFile can't be read
2.4.3 : Invalid Color space, DestOutputProfile is missing
3.1.11 : Invalid Font definition, The CIDSet entry is missing for the Composite Subset
3.2.3 : Font damaged, The FontFile can't be read
2.4.3 : Invalid Color space, DestOutputProfile is missing
1.4.1 : Trailer Syntax error, The trailer dictionary doesn't contain ID
1.2.5 : Body Syntax error, Stream length is invalide
1.2.5 : Body Syntax error, Stream length is invalide
7.1 : Error on MetaData, Missing Metadata Key in catalog
1.4.9 : Trailer Syntax error, Outline Hierarchy doesn't have Count entry

just like with the static binary as expected.

@ashkulz
Copy link
Member

ashkulz commented Apr 3, 2014

There aren't any changes to the PDF rendering code, let me recompile 0.12.0 and verify it.

@lindi2
Copy link
Author

lindi2 commented Apr 3, 2014

Build instructions seem to have changed between these two versions so it's not exactly trivial to git bisect I guess?

@ashkulz
Copy link
Member

ashkulz commented Apr 4, 2014

Verified, just two errors reported for 0.12.0. Bisection is not straightforward because QT is a submodule and the changes are on two separate branches (rebased with the wkhtmltopdf-specific patches). I'll look at it over the weekend.

@ashkulz ashkulz reopened this Apr 4, 2014
@ashkulz ashkulz added Verified and removed Fixed labels Apr 4, 2014
@ashkulz
Copy link
Member

ashkulz commented Apr 6, 2014

It looks like you start getting these errors as soon as you add the missing endobj -- probably the other errors were getting masked by it and are now getting reported. Do you have any idea on what they mean?

@lindi2
Copy link
Author

lindi2 commented Apr 7, 2014

1.1 : Body Syntax error, Second line must contains at least 4 bytes greater than 127

@ashkulz
Copy link
Member

ashkulz commented Apr 7, 2014

If you can give me explanations like that, I think we can crack this very soon 👍 I've added a potential fix in the issue1573 branch, am compiling and testing it right now.

@lindi2
Copy link
Author

lindi2 commented Apr 7, 2014

I already made such a patch and tested that it works. Working on more patches at the moment.

@ashkulz
Copy link
Member

ashkulz commented Apr 7, 2014

Great 👍 I'll wait for a pull request from you, then.

@lindi2
Copy link
Author

lindi2 commented Apr 7, 2014

Is there some way to incrementally rebuild only changed files? "scripts/build.py wheezy-amd64" doesn't seem to pick up my changes to qt/src/gui/painting/qprintengine_pdf.cpp

@ashkulz
Copy link
Member

ashkulz commented Apr 7, 2014

You'll have to do it manually, I'm afraid. What I do is make install inside static-build/wheezy-amd64/qt_build and make inside static-build/wheezy-amd64/app. Having support for incremental builds is on my TODO list.

@lindi2
Copy link
Author

lindi2 commented Apr 7, 2014

Couldn't figure out how to do incremental builds so had to wait a while for full builds. Does wkhtmltopdf/qt#6 look ok to you?

@ashkulz
Copy link
Member

ashkulz commented Apr 8, 2014

I've implemented incremental builds in 3995b6e, let me know if it works for you.

@pinx
Copy link

pinx commented Jun 17, 2014

Using the latest version solved my problems. I haven't investigated, but I found a bug in my haml file, that was the basis for the pdf. I accidentally put the body tag inside the head tag. A html validator pointed out that I was generating invalid html. Solved this bug and installed the latest wkhtmltopdf, and things are working for me.
Do you check generated pdf with a pdf validator? I got some warnings, still (although my pdf viewers no longer complain).

@ashkulz
Copy link
Member

ashkulz commented Jun 18, 2014

The issues are with the upstream Qt PDF generation code, so will have to be fixed/pushed upstream first. Considering that it works fine in most viewers, it is not a high priority as such 😄

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
UpstreamChangeNeeded Verified The issue is verified.
Development

No branches or pull requests

4 participants