Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrong file embedding #176

Closed
ousia opened this issue Jun 8, 2018 · 35 comments
Closed

wrong file embedding #176

ousia opened this issue Jun 8, 2018 · 35 comments

Comments

@ousia
Copy link
Contributor

ousia commented Jun 8, 2018

Using examples/embedded-import.py, I get the following output document: embedded-pymupdf.pdf

Only Acrobat Reader has an issue displaying (or accessing to) the content of the embedded document:

sumatrapdf-ok

evince-fine

acrobatreader-wrong

I get the similar results when using mutool portfolio: embedded-mupdf.pdf.

I reported the issue upstream. I suggested a simpler approach than the portfolio. I don’t know when this will be fixed.

In the meantime, would it be possible that simple file attachments are available, so that they also work with Adobe Acrobat?

I guess my question may be related to #174, regarding the implementation of FileAttachment annotation type.

@JorjMcKie
Copy link
Collaborator

  • As MuPDF has pointed out, (file-) name and description should be ascii only for the time being, when using the PDF EmbeddedFile technique. This is not the case in your example.
  • I tried out your file on Windows successfully with SumatraPDF, Nitro 5 PDF Reader and PDF-XChange Viewer. They all behave just fine like Adobe's Manual prescribes they should: Show a page of the containing file and offer a list of embedded files, from which these may be extracted - perfect.
  • Only Adobe's Reader deviates from its own guidelines and does not show the embedded file. This is independent from "correctly" choosing the name of the embedded file or its description (ascii-only vs. general unicode). So I guess, this feature is simply missing in Adobe Reader!
  • Let me comment on your request for a "simple" file attachment:
    • portfolio / EmbeddedFiles technique is something fairly equivalent to ZIP archives. It means embedding stuff on the document level. You can use any compression yourself before you store the content, btw. I may consider offering this as an option in PyMuPDF.
    • FileAttachment is something different: it is an annotation type and as such linked to a specific page, where it is usually indicated with an appropriate icon (like "PushPin"). Like all annotations, it can be added or removed by users who only have annotation permission (note necessarily the full access). Adding this type of thing is a feature still missing in PyMuPDF. What I can do so far is extracting and replacing existing FileAttachment content and changing some of its meta information.
    • I am in the process of implementing "Add Annotation" features to PyMuPDF. So far I am supporting the first 10 types - please look at the current docu for details. Missing from the relevant subset of all (theoretical) 24 are FileAttachment, Ink and Widgets (i.e. PDF fields). I don't know when I will be able to focus on your request - it is in competition with PDF form fields to be honest ... :-)

@ousia
Copy link
Contributor Author

ousia commented Jun 9, 2018

@JorjMcKie, many thanks for your reply and sorry for mixing file embedding in document at document level and file attachment annotations (it was my fault, since I knew the difference).

The following document (include-attachments.pdf) shows what I mean. And Acrobat displays it fine:

show-attachments

For that, no FileAttachment annotations are required (as I incorrectly took for granted).

Just having a /PageMode /UseAttachments in the catalog dictionary does the work in the document above.

I hope this is something easy to implement.

@ousia
Copy link
Contributor Author

ousia commented Jun 9, 2018

@JorjMcKie, on second thoughts, I would like to comment your previous comment.

First of all, I know that the implementation comes from MuPDF, so that the Python bindings have really little to do here.

  • I tried out your file on Windows successfully with SumatraPDF [...] They all behave just fine like Adobe's Manual prescribes they should: Show a page of the containing file and offer a list of embedded files, from which these may be extracted - perfect.

The first screenshot from my original report shows what SumatraPDF does. Evince is the second screenshot, and Acrobat the third.

Both SumatraPDF and Evince only display the list of embedded files (as in /PageMode /UseAttachments). To the best of my knowledge, none of them implement collections (I would say only Acrobat implements it). Only Acrobat fails, because it does something different.

  • Only Adobe's Reader deviates from its own guidelines and does not show the embedded file. [...] So I guess, this feature is simply missing in Adobe Reader!

It might be that Adobe failed to implement collections (“12.3.5. Collections”, from the PDF specification [sorry, I don’t know how to link to sections in a PDF document]) in Acrobat, both Reader and Professional versions.

But it might be that MuPDF is missing something in the implementation of what Adobe Acrobat sells as portfolios.

Collections are described as (in my opinion, something slightly different and more complex than /PageMode /UseAttachments):

The intent of portable collections is to present, sort, and search collections of related documents embedded in the containing PDF document [...]

A glimpse of the result may be displayed at https://acrobatusers.com/tutorials/how-create-pdf-portfolio (I didn’t watch the whole video).

All other PDF viewers display attached documents fine, because of the following remark from the PDF spec:

The file attachments comprising a collection shall be located in the EmbeddedFiles name tree.

Only Acrobat deals with the /Collection entry in the catalog dictionary.

Why does Acrobat fail with the portfolios generated with MuPDF?

My personal take is that, although all entries in a /Collection dictionary are optional (as stated here), an empty dictionary may be the cause of the wrong display.

Besides avoiding the issues (no matter where they come from), implementing the display of the embedded files is a much more consistent and uncomplicated approach that using collections.

Of course, this is something inherited from MuPDF.

I wanted to comment the issue with the common assumption among technical people implementing PDF solution: “Adobe doesn’t follow its own specification”.

Are there inconsistencies between the PDF specification and Acrobat? Yes, but there may be not so many as we assume.

Sorry, this is not directed to you. In this particular case, we don’t know whether the implementation from MuPDF is correct (when MuPDF isn’t simply pursuing another feature).

I remember a conversation with a PDF engineer (since this is a public issue, this may be a good description) about an inconsistency in what the PDF spec described and how Acrobat dealt with that feature.

At the end of our discussion, I realized that the whole problem was that the engineer was simply not wanting to consider the requirement stated in another part of the PDF documentation.

@JorjMcKie
Copy link
Collaborator

ok, understood.
Investigated the difference to your previous example:

  • the same technique is used in either case (i.e. EmbeddedFiles)
  • your new example contains "/PageMode/UseAttachments", whereas the 1st one has "/PageMode/UseNone",
  • your new example contains no "/Limits" entry in the "/EmbeddedFiles", whereas the first one has it.
  • I did the following manual adjustments with your first example:
    1. Replaced the /PageMode property with "/UseAttachments". Result: AcrobatReader shows a file attachment pane (without entries) on open
    2. Then removed the "/Limits" from "/EmbeddedFiles" and ... HEUREKA ... AdobeReader behaved like the other readers.
    3. Left out the first step and only removed the "/Limits" entry ==> AdobeReader also showed the attachments as it should (in a little bit different layout than with /UseAttachments)

Conclusion / Action
Obviously, I only need to remove the "/Limits" entry and AdobeReader is a good boy again. Let me make these changes to PyMuPDF to confirm that this is all that's needed.

@JorjMcKie
Copy link
Collaborator

addendum:
In iii. above, I did not touch the (empty) /Collections entry, and it still worked.

@ousia
Copy link
Contributor Author

ousia commented Jun 9, 2018

The document include-attachments.pdf was generated with TeX.

I removed the /Limits from embedded-pymupdf.pdf entry and it didn’t work for me: no-limit.pdf.

Steps:

  1. Uncompress PDF file with mutool clean -d embedded-pymupdf.pdf temporary.pdf.
  2. Removed the line with /Limits from temporary.pdf using SciTE.
  3. Fixed broken XRef with mutool clean temporary.pdf no-limit.pdf.

Versions:

  • MuPDF is version 1.13.
  • Acrobat Reader is version 9 (the highest for Linux).

Could you share the PDF document that works for you?

@JorjMcKie
Copy link
Collaborator

this-works-pretty.pdf
I am on Windows. The Adobe Reader I am using is
grafik
It does read the last file you sent!

Your version seems definitely broken then. Semantically, there is no difference in having the complete /EmbeddedFiles entry in the /Catalog object or having it as a seperate object with an indirect reference in the /Catalog (as in your working example).

@JorjMcKie
Copy link
Collaborator

BTW for future manual PDF manipulations, you can conveniently use PyMuPDF:

  • locate the object to manipulate and read its source using its xref: source = doc._getXrefString(xref)
  • manipulate source using Python string features (like .replace)
  • write result back into the PDF: doc._updateObject(xref, source)
  • save the document as usual

This will calculate all the object locations in the file for you.

@JorjMcKie
Copy link
Collaborator

Here is another successfully "cleaned" example (the MuPDF one):
cleaned-embedded-mupdf.pdf
I have developed a little utility method doc.embeddedFileClean() which deletes the /Limits``entry. This one produced the cleaned example. For the time being (MuPDF did not change anything), it seems the adequate way to "clean" a PDF before saving it.

@JorjMcKie
Copy link
Collaborator

You are right with your other admonition (re: storing attachment w/o compression).
This can however be overcome by actively deflating the resulting PDF (deflate = True argument in PyMuPDF-save).

@ousia
Copy link
Contributor Author

ousia commented Jun 9, 2018

Your version seems definitely broken then. Semantically, there is no difference in having the complete /EmbeddedFiles entry in the /Catalog object or having it as a separate object with an indirect reference in the /Catalog (as in your working example).

@JorjMcKie, many thanks for your help and your fix.

If my Acrobat version didn’t allow indirect objects, it would be totally broken (I agree). But it might be something different.

Could you post screenshots with Acrobat displaying both cleaned-embedded-mupdf.pdf and this-works-fine.pdf?

With my Acrobat, this-works-pretty.pdf is displayed fine:

this-works-pretty

cleaned-embedded-mupdf.pdf isn’t displayed fine:

cleaned-embedded-mupdf

Even if I want, I cannot get attachments displayed (I have access only to the navigation panels for pages, layers and comments [as Acrobat calls them]).

no-attachments

At least, I need to remove the /Collection to get the standard view:

standard-view

And there I have access to the attachments navigation panel:

attachment-panel

Replacing /PageMode /UseNone with /PageNone /UseAttachments makes it easier not to overlook that the PDF document contains attachments.

It might be that Acrobat in its newer versions isn’t so strict about how to add a /Collection. But I think the empty /Collection dictionary should be removed.

You are right with your other admonition (re: storing attachment w/o compression).

I’m glad to read it helped, but I cannot recall where I commented it.

@ousia
Copy link
Contributor Author

ousia commented Jun 9, 2018

A comment on your cleaned-embedded-mupdf.pdf about file names.

Here is a sample:

    /EmbeddedFiles <<
      /Names [ (a) <<
            /CI <<
            >>
            /EF <<
              /F 15 0 R
            >>
            /F (a)
            /UF (a)
            /Desc (a)
            /Type /Filespec
          >> ]
    >>

/F and /UF are file name and Unicode file name . The value in the sample above is (a.pdf).

Here you have the output from this little improvement (sorry, if that was obvious to you):

name-attachments

@JorjMcKie
Copy link
Collaborator

Yeah, I am actually not sure, why MuPDF stated this comment about ASCII-only entries. Might be they just didn't want to be bothered with complaints from people not finding their successfully embedded files - just because they made errors with encoding ... I don't know.
You are right, I knew about the /UF entry, but never mind ;-)

Here is the screen print:
grafik

@ousia
Copy link
Contributor Author

ousia commented Jun 9, 2018

Could you also provide a screenshot from this-works-pretty.pdf? (I would like to compare both.)

About the /F and /UF entries, I didn’t mention them because of Unicode. They need to include the complete file name (in that case, a.pdf).

@JorjMcKie
Copy link
Collaborator

I might add another utility function that also removes any empty /Collection entry from the catalog if you think that helps. Or include this feature in that new doc.embeddedFileClean().
grafik

@JorjMcKie
Copy link
Collaborator

What I currently do in PyMuPDF, is inserting identical stuff in both entries, /F and /UF. In an effort not to overload the PyMuPDF programmer API.

Your remark on not compressing embedded files was made in the issue system of MuPDF (bugzilla).

@ousia
Copy link
Contributor Author

ousia commented Jun 9, 2018

I’m interested in having a fix for MuPDF itself.

I hope you agree with me that it is weird to fix in bindings what is wrong in the parent library (or whichever the name may be).

I can investigate the issue further and see whether /Collection should be removed or not.

Sorry, but I think that the minimal difference isn’t worth the more complex implementation (if this is all what portfolios have to show).

You insert identical stuff in the three entries /F, /UF and also /Desc. /Desc is right, but /F and /UF should get a proper file name.

Of course, it is fine that both entries have ASCII only in the file names, but they need to have their proper extensions.

I suspect that if files have extensions, Acrobat may be able to display them in a different way.

@JorjMcKie
Copy link
Collaborator

No, I put in /Desc whatever has been entered in doc.embeddedFileAdd(). It's only /F and /UF that I (and MuPDF) treat as being identical.
And whatever was specified in the PyMuPDF doc.embeddedFileAdd() method, goes in there (with or without any specified extension).
The complete hierarchy is such:

  1. name is mandatory and must be different from all existing entries.
  2. If filenameis not specified, I put the content of name into it. The final content of this entry is then used as /F and as /UF.
  3. If desc is not specified, I take the filename content.

@ousia
Copy link
Contributor Author

ousia commented Jun 9, 2018

I see, I don’t know how you generated the PDF document, but it seems you invoked doc.embeddedFileAdd("a").

How about generating and posting the resulting PDF document (plus screenshot) with doc.embeddedFileAdd("a", "a.pdf")?

Again, if portfolio is working on Acrobat DC, it should be able to display PDF documents in a different way.

@JorjMcKie
Copy link
Collaborator

I think I disagree with your use of the word "complexity":
What is implemented in MuPDF's CLI mutool portfolio is not more and not less than handling embedded files, just like the tool you have been using. The only spots where differences between both exist are the /Collection and the /Limits entries.

The fullblown "portfolio" capability BTW is present in the C-library of MuPDF - but it is not used in mutool (nor in PyMuPDF). It provides an extended way of storing meta-information for embedded files.

@ousia
Copy link
Contributor Author

ousia commented Jun 9, 2018

What is implemented in MuPDF's CLI mutool portfolio is not more and not less than handling embedded files, just like the tool you have been using. The only spots where differences between both exist are the /Collection and the /Limits entries.

Well, I always thought that portfolios were based in /Collection.

But in any case, I wonder why /Collection and /Limits are needed, if /PageMode /UseAttachments would be enough (and [you say] the former entries don”t provide extra functionality).

“Complexity” could be in this case adding code (in the PDF document) that has no significant gain (and it isn’t clear that is valid PDF code [at least, to me]).

@JorjMcKie
Copy link
Collaborator

After reading the Adobe manual carefully again: "(Intermediate and leaf nodes only; required)" it seems to me that indeed there exists some ambiguity:
The /EmbeddedFiles name tree is flat i.e. all entries have the same hierarchy level. If you view these entries as leaves, then you must insert a /Limits entry. If they are regarded as root entries, then not. So maybe, the Adobe Reader just implements how this should be interpreted ...

@ousia
Copy link
Contributor Author

ousia commented Jun 9, 2018

In any case, /Collection is also an issue, because it makes no sense (at least, I don’t get it) to add it as empty dictionary.

@JorjMcKie
Copy link
Collaborator

I read about /Collection and about the full portfolio feature in the Adobe document.
And when we agree to focus on this book, we also agree that this indeed is overly complex.
(The same applies to the linear file format of PDF, by the way - if you read MuPDF's C-sources you find several complaints by the Artifex guys ...)

@JorjMcKie
Copy link
Collaborator

Can you confirm that removing /Limits and /Collection solves your problem?
If yes, I will enter code in PyMuPDF that removes those entries. It will happen automatically with every doc.save() and doc.write() and only requires 2 handful of lines of code.
I will also formulate a suggestion to MuPDF to make that change to the respective source module (it's only one).

@ousia
Copy link
Contributor Author

ousia commented Jun 9, 2018

And when we agree to focus on this book, we also agree that this indeed is overly complex.

This is the main reason why one should be careful before stating that the PDF spec is inconsistent.

In some cases, the spec is wrong. But it might be that all wrongdoings come from the specification.

(The same applies to the linear file format of PDF, by the way - if you read MuPDF's C-sources you find several complaints by the Artifex guys ...)

I have paged through the PDF spec (the previous ISO version) and I wish das käme mir Spanisch vor (I must admit that it’s all Greek to me [pun intended 😉]).

I think the experience with the implementation of the PDF spec may be similar to the translation of a book. It is way too easy to criticize the translation of a passage. Of course, one may be fully right. The not-so-easy task is translating the full book, bearing the responsibility of making the decisions.

Of course, I guess it may be easy to criticize Adobe for inconsistencies in an over thousand pages specification. These inconsistencies should be removed, I totally agree. But it is a totally different task to write the full spec.

Can you confirm that removing /Limits and /Collection solves your problem?
If yes, I will enter code in PyMuPDF that removes those entries. It will happen automatically with every doc.save() and doc.write() and only requires 2 handful of lines of code.

I can confirm that removing /Limits and /Collection solved my problem with the file I checked.

I will also formulate a suggestion to MuPDF to make that change to the respective source module (it's only one).

Many thanks for your help with this. It will fix a long-standing bug in mutool portfolio.

But, please, don’t forget that in this case /PageMode /UseAttachments is essential (also for PyMuPDF).

@JorjMcKie
Copy link
Collaborator

OK, just finished the update. As indicated, it will always modify the /Catalog of PDF about to be saved in the following way:

  • /Catalog of PDF to be saved:
<</Type/Catalog/Pages 2 0 R/Collection<<>>/Names<<
/EmbeddedFiles<</Names[(name)<</CI<<>>/EF<</F 6 0 R>>/F(name)/UF(name)/Desc(name)
/Type/Filespec>>]
/Limits[(name)(name)]>>>>>>
  • Issuing doc.save(...) will modify the /Catalog entry like this:
<</Type/Catalog/Pages 2 0 R/PageMode/UseAttachments/Names<<
/EmbeddedFiles<</Names[(name)<</CI<<>>/EF<</F 6 0 R>>/F(name)/UF(name)/Desc(name)
/Type/Filespec>>]>>>>>>

I will now upload the corresponding version 1.13.9 and create the respective wheels.

@JorjMcKie
Copy link
Collaborator

Cómo saves tanto Alemán? Obviamente vives en una zona horaria compatible con la mia (Venezuela) y no en Alemania ...!

JorjMcKie added a commit that referenced this issue Jun 9, 2018
@ousia
Copy link
Contributor Author

ousia commented Jun 9, 2018

OK, just finished the update. [...]
I will now upload the corresponding version 1.13.9 and create the respective wheels.

From what I read in your PDF code, it looks fine to me. About your patch, I cannot say anything (again, it is all Greek to me, since I cannot code).

Many thanks for your fix (I almost forgot to write).

I don’t know whether you intended to keep the issue open. It would be great, if you could report the progress of the fix in MuPDF’s C library. (But that doesn’t require the issue to be open.)

And many thanks for contributing the fixing code to the original library.

Cómo saves tanto Alemán? Obviamente vives en una zona horaria compatible con la mia (Venezuela) y no en Alemania ...!

Das stimmt zum Teil. Zu meiner Zeitzone gehört auch Deutschland, aber seit fast zwei Jahrzehnten wohne ich in Deutschland nicht mehr. (Selbstverständlich befinde ich mich jetzt in Spanien.)

Entschuldingen Sie dann mein Deutsch. Nach so vielen Jahren, habe ich es nur verlernt.

@JorjMcKie
Copy link
Collaborator

Wow Pablo, meinen Respekt! Dein Deutsch ist viel besser als mein Spanisch! Ich lebe seit 2009 in Venezuela und bin mit meinem Spanisch sehr unzufrieden ...

Was die anderen Dinge betrifft:
Por favor, prueba el fix. Los Linux wheels están listos. Si todo funciona, por favor digame o cierre el issue.
Yo voy a escribir algo en el sistema de Artifex / Bugzilla.

@ousia
Copy link
Contributor Author

ousia commented Jun 9, 2018

[Si no te parece mal, uso tú en español, porque lo usas en alemán.]

Jorj, tu español es muy bueno. Escribes con acentos, algo que no hace mucha gente en España.

Puedes engañarte pensando que tu español no es bueno. Tienes razón. Imagínate que en una página de un libro encuentras una pequeña mancha. Te puedes obsesionar con la mancha, pero puedes ver la página entera. La mancha seguirá ahí, pero desaparece su importancia (y se deja de ver) cuando se contempla la página entera.

I forgot to mention, Fedora is going to ship PyMuPDF soon: https://bugzilla.redhat.com/show_bug.cgi?id=1586324.

Once I compile the new release, I will confirm how it went with the fixed version.

@ousia
Copy link
Contributor Author

ousia commented Jun 9, 2018

I have just checked it myself and it works perfectly fine (just invoking python3 embedded-import.py aæ.pdf aæ.pdf).

Many thanks for your help and your excellent work, Jorj.

@ousia ousia closed this as completed Jun 9, 2018
@ousia
Copy link
Contributor Author

ousia commented Jun 12, 2018

@JorjMcKie, many thanks for your comment at Artifex.

Sorry for asking that, but doesn’t /PageMode /UseAttachments belong to that proposal?

@JorjMcKie
Copy link
Collaborator

Oops, sorry about that - you are right.
I have added this just a minute ago.

BTW, I am looking at enhancing the embedded-import.py script. I think it should be more flexibly controllable via its command line options.
I think it should offer all three options -n(ame) -f(ilename) -d(esc) with filename being the only mandatory one.
Would you agree?

@ousia
Copy link
Contributor Author

ousia commented Jun 12, 2018

I am the one to blame about forgetting /PageMode /UseAttachments.

Many thanks for reporting that at Artifex.

I agree with the three options, being only one mandatory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants