-
Notifications
You must be signed in to change notification settings - Fork 648
Description
Description of the bug
Attempts to open a .zip file using pymupdf.open() succeed, leading to unexpected results.
How to reproduce the bug
To reproduce, use this code:
import pymupdf
import tempfile
zipfile_content = b'PK\x03\x04\n\x00\x00\x00\x00\x00\x19U0[\xf40\x8b&\x1b\x00\x00\x00\x1b\x00\x00\x00\x08\x00\x1c\x00textfileUT\t\x00\x03\x92"\xc9h\x94"\xc9hux\x0b\x00\x01\x04\xf5\x01\x00\x00\x04\x14\x00\x00\x00This is a plain text file.\nPK\x01\x02\x1e\x03\n\x00\x00\x00\x00\x00\x19U0[\xf40\x8b&\x1b\x00\x00\x00\x1b\x00\x00\x00\x08\x00\x18\x00\x00\x00\x00\x00\x01\x00\x00\x00\xa4\x81\x00\x00\x00\x00textfileUT\x05\x00\x03\x92"\xc9hux\x0b\x00\x01\x04\xf5\x01\x00\x00\x04\x14\x00\x00\x00PK\x05\x06\x00\x00\x00\x00\x01\x00\x01\x00N\x00\x00\x00]\x00\x00\x00\x00\x00'
tmpfile = tempfile.NamedTemporaryFile(suffix='.zip', delete=True)
with open(tmpfile.name, 'wb') as f:
f.write(zipfile_content)
with pymupdf.open(tmpfile.name) as doc:
print(f"doc.page_count={doc.page_count}")
This (very short!) .zip file contains one plain text file.
The code executes cleanly and prints" doc.page_count=0'.
Expectation: PyMuPDF would recognize that the file content is not a PDF and raise.
PyMuPDF will fail when using pymupdf.open(tmpfile, filetype='pdf')
in this example. But:
(1) We'd expect it to fail even without specifying filetype, I'd hope...?
(2) With longer .zip files it succeeds even with specifying filetype='pdf', indicating (in the instance I tried) that there were 120 PDF pages in the .zip file. (And to be clear, there was no pdf content in that zip file. I can share if needed, but expect the behavior documented here to be problematic enough to merit a fix).
PyMuPDF version
1.26.4
Operating system
MacOS
Python version
3.13