Skip to content

Dealing with Embedded Files

Jorj X. McKie edited this page Apr 29, 2017 · 7 revisions

Since MuPDF v1.11, PyMuPDF with its v1.11.0 can deal with embedded files.

This feature (PDF 1.4 format) allows attaching arbitrary data or files to PDF documents. With PyMuPDF, such embedded data can be added, deleted, extracted and modified.

We have included some example scripts to the resp. directory that demonstrate the use of this new feature.

Here we show some interactive sessions:

>>> doc=fitz.open("test.pdf")
>>> doc.embeddedFileCount                         # show number of embedded
7
>>> for i in range(doc.embeddedFileCount):        # display info about them
print(doc.embeddedFileInfo(i))


{'name': 'pdftest', 'file': 'pdftest', 'desc': 'pdftest', 'size': 609, 'length': 609}
{'name': 'umlaute?', 'file': 't-ink.pdf', 'desc': 'können wir Ùmláútê?', 'size': 2389, 'length': 2389}
{'name': 'testann.py', 'file': 'testann.py', 'desc': 'Beschreibung', 'size': 1222, 'length': 1222}
{'name': 'minpdf.py', 'file': 'minpdf.py', 'desc': 'minpdf.py', 'size': 1693, 'length': 1693}
{'name': 'mit Latin', 'file': 'latin.log', 'desc': 'mit Latin in der Beschreibung, S†áe!', 'size': 40, 'length': 40}
{'name': 'test1.pdf', 'file': 'test1.pdf', 'desc': 'test1.pdf', 'size': 65917, 'length': 65917}
{'name': 'ink-demo', 'file': 't-ink.pdf', 'desc': 'Test neues FileAdd', 'size': 2389, 'length': 2389}
>>>
>>> # change the description of one entry
>>> doc.embeddedFileSetInfo("mit Latin", None, "new description without problematic characters")
0
>>> for i in range(doc.embeddedFileCount):        # show what happend
print(doc.embeddedFileInfo(i))


{'name': 'pdftest', 'file': 'pdftest', 'desc': 'pdftest', 'size': 609, 'length': 609}
{'name': 'umlaute?', 'file': 't-ink.pdf', 'desc': 'können wir Ùmláútê?', 'size': 2389, 'length': 2389}
{'name': 'testann.py', 'file': 'testann.py', 'desc': 'Beschreibung', 'size': 1222, 'length': 1222}
{'name': 'minpdf.py', 'file': 'minpdf.py', 'desc': 'minpdf.py', 'size': 1693, 'length': 1693}
{'name': 'mit Latin', 'file': 'latin.log', 'desc': 'new description without problematic characters', 'size': 40, 'length': 40}
{'name': 'test1.pdf', 'file': 'test1.pdf', 'desc': 'test1.pdf', 'size': 65917, 'length': 65917}
{'name': 'ink-demo', 'file': 't-ink.pdf', 'desc': 'Test neues FileAdd', 'size': 2389, 'length': 2389}
>>>
>>> # a new entry can be entered from arbitrary data (bytes or bytearray)
>>> doc.embeddedFileAdd(b"some arbitrary data", "new data", None, "we do not need files for this")
1
>>> for i in range(doc.embeddedFileCount):        # again show the result  
print(doc.embeddedFileInfo(i))


{'name': 'pdftest', 'file': 'pdftest', 'desc': 'pdftest', 'size': 609, 'length': 609}
{'name': 'umlaute?', 'file': 't-ink.pdf', 'desc': 'können wir Ùmláútê?', 'size': 2389, 'length': 2389}
{'name': 'testann.py', 'file': 'testann.py', 'desc': 'Beschreibung', 'size': 1222, 'length': 1222}
{'name': 'minpdf.py', 'file': 'minpdf.py', 'desc': 'minpdf.py', 'size': 1693, 'length': 1693}
{'name': 'mit Latin', 'file': 'latin.log', 'desc': 'new description without problematic characters', 'size': 40, 'length': 40}
{'name': 'test1.pdf', 'file': 'test1.pdf', 'desc': 'test1.pdf', 'size': 65917, 'length': 65917}
{'name': 'ink-demo', 'file': 't-ink.pdf', 'desc': 'Test neues FileAdd', 'size': 2389, 'length': 2389}
{'name': 'new data', 'file': 'new data', 'desc': 'we do not need files for this', 'size': 19, 'length': 19}
>>> 
>>> # new names must be unique:
>>> doc.embeddedFileAdd(b"some arbitrary data", "new data", None, "we do not need files for this")
Traceback (most recent call last):
  File "<pyshell#18>", line 1, in <module>
doc.embeddedFileAdd(b"some arbitrary data", "new data", None, "we do not need files for this")
  File "C:\Users\Jorj\AppData\Local\Programs\Python\Python36\lib\site-packages\fitz\fitz.py", line 335, in embeddedFileAdd
return _fitz.Document_embeddedFileAdd(self, buffer, name, filename, desc)
Exception: Name already exists in embedded files
>>> 

If an entry is supported file type, we can extract and directly open it from memory:

>>> stream=doc.embeddedFileGet("test1.pdf")
>>> len(stream)
65917
>>> doc2 = fitz.open("pdf", stream)
>>> doc2.pageCount
1
>>> 
Clone this wiki locally