Skip to content

Enhancement: OPC OOXML support #892

@c-bik

Description

@c-bik

Hi,

First off thanks a lot for this awesome library. It does a great job with most of the normal cases.

Recently, however, I came across a case (Word.Body. getOoxml() from Microsoft's Office-JS) where I couldn't find a way to parse/load that into a docx object directly.

I am receiving an XML (MIME application/xml) of the following structure:

<?xml version="1.0" standalone="yes"?>
<?mso-application progid="Word.Document"?>
<pkg:package xmlns:pkg="http://schemas.microsoft.com/office/2006/xmlPackage">
    <pkg:part pkg:name="/_rels/.rels" pkg:contentType="application/vnd.openxmlformats-package.relationships+xml" pkg:padding="512">
        <pkg:xmlData>
            <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
                <Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="word/document.xml"/>
            </Relationships>
        </pkg:xmlData>
    </pkg:part>
    <pkg:part pkg:name="/word/_rels/document.xml.rels" pkg:contentType="application/vnd.openxmlformats-package.relationships+xml" pkg:padding="256">
        <pkg:xmlData>
            <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
                <Relationship Id="rId8" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/footer" Target="footer1.xml"/>
                ...
                <Relationship Id="rId9" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/footer" Target="footer2.xml"/>
            </Relationships>
        </pkg:xmlData>
    </pkg:part>
    <pkg:part pkg:name="/word/document.xml" pkg:contentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml">
        <pkg:xmlData>
            <w:document ...>
                <w:body>
                    <w:p w:rsidR="009606CA" w:rsidRDefault="009606CA">
                        ...
                    </w:p>
                    ...
                </w:body>
            </w:document>
        </pkg:xmlData>
    </pkg:part>
    <pkg:part pkg:name="/word/footer2.xml" pkg:contentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml">
        <pkg:xmlData>
            <w:ftr ...>
                ...
            </w:ftr>
        </pkg:xmlData>
    </pkg:part>
    ...
    <pkg:part pkg:name="/word/media/image1.jpg" pkg:contentType="image/jpeg" pkg:compression="store">
        <pkg:binaryData>... base64 encoded image binary...
        </pkg:binaryData>
    </pkg:part>
    ...
    <pkg:part pkg:name="/word/webSettings.xml" pkg:contentType="application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml">
    </pkg:part>
</pkg:package>

It looked like a serialised from of the docx package, though I couldn't find any schema (XSD etc) for <pkg:package>, so I attempted to implement support directly into python-docx and as a result got that working for me.

The work-in-progress diff of the implementation: master...c-bik:load-from-opc-ooxml-support

Usage:

from docx import Document

if __name__ == '__main__':
    ooxml: bytes = b'<pkg:package>...</pkg:package>'
    document = Document(ooxml)
    print([p.text for p in document.paragraphs])

Now I am wondering how should it really be done so I can potentially pull-request such a feature? If this feature is useful/interesting for a future python_docx release I can then invest some time to work on it to turn it into an acceptable PR.

Looking forward.

Best,
Bikram

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions