Skip to content

Possible encoding issue in PDF metadata #4564

Closed
@griai

Description

@griai

Description of the bug

I am not sure, whether we hit an actual bug, here, or whether the behavior is actually intended.
I have a PDF document, from which we (among other things) extract some metadata. It occurred to us that our PyMuPDF-based extraction sometimes fails for specific metadata values. But this happens rarely.

In one of such PDF files, I looked up the metadata using mutool info and got the following output:

> mutool info file.pdf
file.pdf:

PDF-1.4
Info object (1 0 R):
<</Creator(Canon iR-ADV C3325  PDF)/CreationDate(D:20200812164013+02'00')/Producer<FEFF00410064006F00620065002000500053004C00200031002E0033006500200066006F0072002000430061006E006F006E0000>>>
Pages: 1

Retrieving info from pages 1-1...
Mediaboxes (1):
        ...

Images (9):
        ...

The problematic value is the "Producer", which seems to be given as UTF-16 encoded string with Byte Order Mark ("FE FF" as first two bytes). This encoded string is terminated by two NULL bytes ("00 00").

Opening this file with PyMuPDF and reading the metadata dictionary results in the following:

In [1]: import pymupdf

In [2]: doc = pymupdf.open("file.pdf")

In [3]: doc.metadata
Out[3]: 
{'format': 'PDF 1.4',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': 'Canon iR-ADV C3325  PDF',
 'producer': 'Adobe PSL 1.3e for Canon\udcc0\udc80',
 'creationDate': "D:20200812164013+02'00'",
...}

We recognize that the decoded string has some UTF-16 (low) surrogate characters at the end, which were the reason for our following encoding to not behave as expected. I know that there is the "surrogateescape" handler in Python (see, e.g., https://peps.python.org/pep-0383/ ), which might be used also in PyMuPDF when decoding the bytes. However, I am wondering where the additional bytes come from in the first place.

Note that a normal UTF-16 decoding of the given bytes produces the following:

In [4]: b = b"\xFE\xFF\x00\x41\x00\x64\x00\x6F\x00\x62\x00\x65\x00\x20\x00\x50\x00\x53\x00\x4C\x00\x20\x00\x31
      ⋮ \x00\x2E\x00\x33\x00\x65\x00\x20\x00\x66\x00\x6F\x00\x72\x00\x20\x00\x43\x00\x61\x00\x6E\x00\x6F\x00\x
      ⋮ 6E\x00\x00"

In [5]: b.decode("utf-16")
Out[5]: 'Adobe PSL 1.3e for Canon\x00'

The string is intact and not showing the surrogate characters. However, there is still the explicit NULL byte at the end, which is not desired (but easy to deal with).

Also a hex-editor view into the original file does show nothing except the NULL bytes at the end of the encoded string:

Image

So, my question is: Is the behavior desired? Or at least expected?

In the meantime, we try to sanitize the strings on our side, but I would be interested to know what happened here. And I apologize if everything is in order and to be expected like this.

How to reproduce the bug

Due to the actual document being customer data, which I am not allowed to share, I, unfortunately, cannot provide a working example. But I tried to put all ingredients in the description, above.

PyMuPDF version

1.26.0

Operating system

Linux

Python version

3.12

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions