Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing base64_decode on PDF-Keyword-Reading (AppleKeywords) #32

Open
stevenbuehner opened this issue Oct 8, 2016 · 0 comments
Open

Comments

@stevenbuehner
Copy link

stevenbuehner commented Oct 8, 2016

When having a PDF with Keywords added by apple there are some mistakes happening. Especially when non ASCI-Chars are involved. I added a Test.pdf to demonstrate this.

There are three "normal" keywords labled Test1, Test2, and Test3. A fourth keyword is a bit more complex. It contains a comma (!) and some german umlaute: Base64 encoded äöü and comma, foo bar.

This is the exiftool XML-Output of the Keywords and AppkeKeywords section:

<PDF:Keywords>
  <rdf:Bag>
   <rdf:li>Test1</rdf:li>
   <rdf:li>Test2</rdf:li>
   <rdf:li>Base64 encoded äöü and comma</rdf:li>
   <rdf:li>foo bar</rdf:li>
   <rdf:li>Test3</rdf:li>
  </rdf:Bag>
 </PDF:Keywords>
 <PDF:AppleKeywords>
  <rdf:Bag>
   <rdf:li>Test1</rdf:li>
   <rdf:li>Test2</rdf:li>
   <rdf:li rdf:datatype='http://www.w3.org/2001/XMLSchema#base64Binary'>
/v8AQgBhAHMAZQA2ADQAIABlAG4AYwBvAGQAZQBkACAA5AD2APwAIABhAG4A
ZAAgAGMAbwBtAG0AYQAsACAAZgBvAG8AIABiAGEAcg==
</rdf:li>
   <rdf:li>Test3</rdf:li>
  </rdf:Bag>
 </PDF:AppleKeywords>

As you can see there are two problems infolved:

  1. in the "PDF:Keywords" section the comma IN THE KEYWORD ITSELF is recognized and split to two separat Keywords. Well. That is a Exiftool problem and not part of this issue
  2. The "PDF:AppleKeywords is recognized correctly. But it base64 encodes the umlaute. Up to here everything is fine. The issue though is, that PHPExiftool does not decode the String and returns the ugly string.
    Instead I would expect, that PHPExiftool recognizes the Attribute rdf:datatype='http://www.w3.org/2001/XMLSchema#base64Binary' and automatically decodes the string.

If I have seen this right, this behaviour is already implemented for Mono-Types (see source). But I guess it needs to be also implemented for Multi-Types.

This is the Testfile, mentioned: Test.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant