-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Hi,
First off thanks a lot for this awesome library. It does a great job with most of the normal cases.
Recently, however, I came across a case (Word.Body. getOoxml() from Microsoft's Office-JS) where I couldn't find a way to parse/load that into a docx object directly.
I am receiving an XML (MIME application/xml) of the following structure:
<?xml version="1.0" standalone="yes"?>
<?mso-application progid="Word.Document"?>
<pkg:package xmlns:pkg="http://schemas.microsoft.com/office/2006/xmlPackage">
<pkg:part pkg:name="/_rels/.rels" pkg:contentType="application/vnd.openxmlformats-package.relationships+xml" pkg:padding="512">
<pkg:xmlData>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="word/document.xml"/>
</Relationships>
</pkg:xmlData>
</pkg:part>
<pkg:part pkg:name="/word/_rels/document.xml.rels" pkg:contentType="application/vnd.openxmlformats-package.relationships+xml" pkg:padding="256">
<pkg:xmlData>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id="rId8" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/footer" Target="footer1.xml"/>
...
<Relationship Id="rId9" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/footer" Target="footer2.xml"/>
</Relationships>
</pkg:xmlData>
</pkg:part>
<pkg:part pkg:name="/word/document.xml" pkg:contentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml">
<pkg:xmlData>
<w:document ...>
<w:body>
<w:p w:rsidR="009606CA" w:rsidRDefault="009606CA">
...
</w:p>
...
</w:body>
</w:document>
</pkg:xmlData>
</pkg:part>
<pkg:part pkg:name="/word/footer2.xml" pkg:contentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml">
<pkg:xmlData>
<w:ftr ...>
...
</w:ftr>
</pkg:xmlData>
</pkg:part>
...
<pkg:part pkg:name="/word/media/image1.jpg" pkg:contentType="image/jpeg" pkg:compression="store">
<pkg:binaryData>... base64 encoded image binary...
</pkg:binaryData>
</pkg:part>
...
<pkg:part pkg:name="/word/webSettings.xml" pkg:contentType="application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml">
</pkg:part>
</pkg:package>It looked like a serialised from of the docx package, though I couldn't find any schema (XSD etc) for <pkg:package>, so I attempted to implement support directly into python-docx and as a result got that working for me.
The work-in-progress diff of the implementation: master...c-bik:load-from-opc-ooxml-support
Usage:
from docx import Document
if __name__ == '__main__':
ooxml: bytes = b'<pkg:package>...</pkg:package>'
document = Document(ooxml)
print([p.text for p in document.paragraphs])Now I am wondering how should it really be done so I can potentially pull-request such a feature? If this feature is useful/interesting for a future python_docx release I can then invest some time to work on it to turn it into an acceptable PR.
Looking forward.
Best,
Bikram