Description
Description of the bug
I am not sure, whether we hit an actual bug, here, or whether the behavior is actually intended.
I have a PDF document, from which we (among other things) extract some metadata. It occurred to us that our PyMuPDF-based extraction sometimes fails for specific metadata values. But this happens rarely.
In one of such PDF files, I looked up the metadata using mutool info
and got the following output:
> mutool info file.pdf
file.pdf:
PDF-1.4
Info object (1 0 R):
<</Creator(Canon iR-ADV C3325 PDF)/CreationDate(D:20200812164013+02'00')/Producer<FEFF00410064006F00620065002000500053004C00200031002E0033006500200066006F0072002000430061006E006F006E0000>>>
Pages: 1
Retrieving info from pages 1-1...
Mediaboxes (1):
...
Images (9):
...
The problematic value is the "Producer", which seems to be given as UTF-16 encoded string with Byte Order Mark ("FE FF" as first two bytes). This encoded string is terminated by two NULL bytes ("00 00").
Opening this file with PyMuPDF and reading the metadata dictionary results in the following:
In [1]: import pymupdf
In [2]: doc = pymupdf.open("file.pdf")
In [3]: doc.metadata
Out[3]:
{'format': 'PDF 1.4',
'title': '',
'author': '',
'subject': '',
'keywords': '',
'creator': 'Canon iR-ADV C3325 PDF',
'producer': 'Adobe PSL 1.3e for Canon\udcc0\udc80',
'creationDate': "D:20200812164013+02'00'",
...}
We recognize that the decoded string has some UTF-16 (low) surrogate characters at the end, which were the reason for our following encoding to not behave as expected. I know that there is the "surrogateescape" handler in Python (see, e.g., https://peps.python.org/pep-0383/ ), which might be used also in PyMuPDF when decoding the bytes. However, I am wondering where the additional bytes come from in the first place.
Note that a normal UTF-16 decoding of the given bytes produces the following:
In [4]: b = b"\xFE\xFF\x00\x41\x00\x64\x00\x6F\x00\x62\x00\x65\x00\x20\x00\x50\x00\x53\x00\x4C\x00\x20\x00\x31
⋮ \x00\x2E\x00\x33\x00\x65\x00\x20\x00\x66\x00\x6F\x00\x72\x00\x20\x00\x43\x00\x61\x00\x6E\x00\x6F\x00\x
⋮ 6E\x00\x00"
In [5]: b.decode("utf-16")
Out[5]: 'Adobe PSL 1.3e for Canon\x00'
The string is intact and not showing the surrogate characters. However, there is still the explicit NULL byte at the end, which is not desired (but easy to deal with).
Also a hex-editor view into the original file does show nothing except the NULL bytes at the end of the encoded string:
So, my question is: Is the behavior desired? Or at least expected?
In the meantime, we try to sanitize the strings on our side, but I would be interested to know what happened here. And I apologize if everything is in order and to be expected like this.
How to reproduce the bug
Due to the actual document being customer data, which I am not allowed to share, I, unfortunately, cannot provide a working example. But I tried to put all ingredients in the description, above.
PyMuPDF version
1.26.0
Operating system
Linux
Python version
3.12