How to Accept ("Preserve") DOCX Commented Changes In order to parse resulting text #566

evbo · 2018-10-31T16:48:47Z

Hi,

Thank you for this amazing library! I'm trying to read the the .text attribute of paragraphs and recently realized that the state of that text content can be defined by an array of DOCX commented changes.

My question is how can I programmatically "accept" all commented changed in a DOCX file, so that the resulting text content has all of its diff (or versioned history) merged?

Some background:
What makes parsing commented DOCX files confusing is that in Microsoft word you might not know that comments exist because you can ignore the comments (allowing you to see the full merge of different author's changes), but when parsing with python-docx, unless the comments are "accepted" or "rejected", only the characters not affected by the "diff" (or series of changes) are defined in the .text attribute. So in order to see the resulting text, you must accept or reject all comments first, save the document, then parse it. Only then will the document you viewed in Word match the resulting text parsed.

Here is a snippet of text (from a large document, I hope it's more helpful having a sample of!) that should read:

SOME CHANGE AND ANOTHER CHANGE AND MORE CHANGES HERE

But actually, that text is formed by a series of <w:t></w:t> objects, some additive, and one deletion defined as follows:

<w:t>SOME CHANGE</w:t>undefined</w:r>undefined<w:ins w:author="Leroy Jim" w:date="2018-10-10T14:12:00Z" w:id="1480">
<w:r w:rsidR="00464094">
    <w:rPr>
        <w:rFonts w:eastAsia="Times New Roman" />
        <w:sz w:val="16" />
        <w:szCs w:val="16" />
        <w:lang w:eastAsia="en-IN" />
    </w:rPr>
 <w:t> AND ANOTHER CHANGE</w:t></w:r>undefined</w:ins>undefined<w:del w:author="Leroy Jim" w:date="2018-10-10T14:12:00Z" w:id="1481">
<w:r w:rsidDel="00464094">
    <w:rPr>
        <w:rFonts w:eastAsia="Times New Roman" />
        <w:sz w:val="16" />
        <w:szCs w:val="16" />
        <w:lang w:eastAsia="en-IN" />
    </w:rPr>
    <w:delText>A DELETE CHANGE HERE</w:delText>
</w:r>undefined</w:del>undefined<w:r>
<w:rPr>
    <w:rFonts w:eastAsia="Times New Roman" />
    <w:sz w:val="16" />
    <w:szCs w:val="16" />
    <w:lang w:eastAsia="en-IN" />
</w:rPr>
<w:t AND MORE CHANGES HERE</w:t>undefined</w:r>undefined</w:p>undefined</w:tc>undefined<w:tc>undefined<w:tcPr>undefined<w:tcW w:w="3631" w:type="dxa" />undefined<w:tcBorders>

So in order to parse the text, all comments must be accepted (marked with "preserve") as follows:

<w:t xml:space="preserve">SOME CHANGE</w:t>undefined</w:r>undefined<w:ins w:author="Leroy Jim" w:id="182" w:date="2018-10-28T19:55:29Z">
<w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
    <w:rPr>
        <w:sz w:val="16"/>
        <w:szCs w:val="16"/>
        <w:rtl w:val="0"/>
    </w:rPr>
    <w:t xml:space="preserve"> AND ANOTHER CHANGE</w:t>
</w:r>undefined</w:ins>undefined<w:del w:author="Leroy Jim" w:id="182" w:date="2018-10-28T19:55:29Z">
<w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
    <w:rPr>
        <w:sz w:val="16"/>
        <w:szCs w:val="16"/>
        <w:rtl w:val="0"/>
    </w:rPr>
    <w:delText xml:space="preserve">A DELETE CHANGE HERE</w:delText>
</w:r>undefined</w:del>undefined<w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
<w:rPr>
    <w:sz w:val="16"/>
    <w:szCs w:val="16"/>
    <w:rtl w:val="0"/>
</w:rPr>
<w:t xml:space="preserve"> AND MORE CHANGES HERE</w:t>undefined</w:r>undefined<w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">

What would be the recipe for "preserving" all commented changes so that I get the expected result? Is there a shortcut to just "accept" all? Or will I need to edit the raw xml of this document?

Thank you, and here are related issues to handling comments that I found useful:

#483
#93

The text was updated successfully, but these errors were encountered:

scanny · 2018-11-01T17:30:59Z

Well, the short answer is there is no support yet for revision marks in python-docx. The complexity you've already encountered a bit of is the reason why; nobody yet has wanted the functionality badly enough to work through all that.

The xml:space="preserve" items you're seeing don't have to do with revision marks, by the way. All that means is that leading and trailing spaces between the tags for that element should be preserved when reading the XML.

If you format the XML, properly indicating the hierarchical relationships, it's easier to see what's going on (also, you should trim out all those 'undefined' items, those are not meaningful or valid .docx XML). Also, I've removed all the redundant font formatting since that doesn't bear on the current question.

<w:p>
  <w:r>
    <w:t>SOME CHANGE</w:t>
  </w:r>
  <w:ins w:author="Leroy Jim" w:date="2018-10-10T14:12:00Z" w:id="1480">
    <w:r w:rsidR="00464094">
      <w:t> AND ANOTHER CHANGE</w:t>
    </w:r>
  </w:ins>
  <w:del w:author="Leroy Jim" w:date="2018-10-10T14:12:00Z" w:id="1481">
    <w:r w:rsidDel="00464094">
      <w:delText>A DELETE CHANGE HERE</w:delText>
    </w:r>
  </w:del>
  <w:r>
    <w:t AND MORE CHANGES HERE</w:t>
  </w:r>
</w:p>

Basically, you have a paragraph element w:p. When you get its text, python-docx reads its runs (w:r) one by one and strings them together. What it doesn't do is pay attention to w:del deletions or w:ins insertions`. That explains why you're getting some text and not others. Basically you're getting the "hide revision marks" or "original version" view.

If you want to process these to get a different view (like perhaps "with changes") you need to parse the w:ins and include them in the right sequence, and skip the w:del elements.

evbo · 2018-11-01T21:21:55Z

Okay thank you that makes sense. So is there a way to access a list of ins elements for a given paragraph? Also, do you have built-in parsers I should try to leverage?

scanny · 2018-11-02T20:56:23Z

You'll have to use lxml primitives. Each element is a subclass of lxml.etree._Element, which has methods like .getchildren() for that sort of thing:
https://lxml.de/api/lxml.etree._Element-class.html

scanny closed this as completed Nov 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Accept ("Preserve") DOCX Commented Changes In order to parse resulting text #566

How to Accept ("Preserve") DOCX Commented Changes In order to parse resulting text #566

evbo commented Oct 31, 2018

scanny commented Nov 1, 2018

evbo commented Nov 1, 2018

scanny commented Nov 2, 2018

How to Accept ("Preserve") DOCX Commented Changes In order to parse resulting text #566

How to Accept ("Preserve") DOCX Commented Changes In order to parse resulting text #566

Comments

evbo commented Oct 31, 2018

scanny commented Nov 1, 2018

evbo commented Nov 1, 2018

scanny commented Nov 2, 2018