Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Accept ("Preserve") DOCX Commented Changes In order to parse resulting text #566

Closed
evbo opened this issue Oct 31, 2018 · 3 comments

Comments

@evbo
Copy link

evbo commented Oct 31, 2018

Hi,

Thank you for this amazing library! I'm trying to read the the .text attribute of paragraphs and recently realized that the state of that text content can be defined by an array of DOCX commented changes.

My question is how can I programmatically "accept" all commented changed in a DOCX file, so that the resulting text content has all of its diff (or versioned history) merged?

Some background:
What makes parsing commented DOCX files confusing is that in Microsoft word you might not know that comments exist because you can ignore the comments (allowing you to see the full merge of different author's changes), but when parsing with python-docx, unless the comments are "accepted" or "rejected", only the characters not affected by the "diff" (or series of changes) are defined in the .text attribute. So in order to see the resulting text, you must accept or reject all comments first, save the document, then parse it. Only then will the document you viewed in Word match the resulting text parsed.

Here is a snippet of text (from a large document, I hope it's more helpful having a sample of!) that should read:

SOME CHANGE AND ANOTHER CHANGE AND MORE CHANGES HERE

But actually, that text is formed by a series of <w:t></w:t> objects, some additive, and one deletion defined as follows:

<w:t>SOME CHANGE</w:t>undefined</w:r>undefined<w:ins w:author="Leroy Jim" w:date="2018-10-10T14:12:00Z" w:id="1480">
<w:r w:rsidR="00464094">
    <w:rPr>
        <w:rFonts w:eastAsia="Times New Roman" />
        <w:sz w:val="16" />
        <w:szCs w:val="16" />
        <w:lang w:eastAsia="en-IN" />
    </w:rPr>
 <w:t> AND ANOTHER CHANGE</w:t></w:r>undefined</w:ins>undefined<w:del w:author="Leroy Jim" w:date="2018-10-10T14:12:00Z" w:id="1481">
<w:r w:rsidDel="00464094">
    <w:rPr>
        <w:rFonts w:eastAsia="Times New Roman" />
        <w:sz w:val="16" />
        <w:szCs w:val="16" />
        <w:lang w:eastAsia="en-IN" />
    </w:rPr>
    <w:delText>A DELETE CHANGE HERE</w:delText>
</w:r>undefined</w:del>undefined<w:r>
<w:rPr>
    <w:rFonts w:eastAsia="Times New Roman" />
    <w:sz w:val="16" />
    <w:szCs w:val="16" />
    <w:lang w:eastAsia="en-IN" />
</w:rPr>
<w:t AND MORE CHANGES HERE</w:t>undefined</w:r>undefined</w:p>undefined</w:tc>undefined<w:tc>undefined<w:tcPr>undefined<w:tcW w:w="3631" w:type="dxa" />undefined<w:tcBorders>

So in order to parse the text, all comments must be accepted (marked with "preserve") as follows:

<w:t xml:space="preserve">SOME CHANGE</w:t>undefined</w:r>undefined<w:ins w:author="Leroy Jim" w:id="182" w:date="2018-10-28T19:55:29Z">
<w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
    <w:rPr>
        <w:sz w:val="16"/>
        <w:szCs w:val="16"/>
        <w:rtl w:val="0"/>
    </w:rPr>
    <w:t xml:space="preserve"> AND ANOTHER CHANGE</w:t>
</w:r>undefined</w:ins>undefined<w:del w:author="Leroy Jim" w:id="182" w:date="2018-10-28T19:55:29Z">
<w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
    <w:rPr>
        <w:sz w:val="16"/>
        <w:szCs w:val="16"/>
        <w:rtl w:val="0"/>
    </w:rPr>
    <w:delText xml:space="preserve">A DELETE CHANGE HERE</w:delText>
</w:r>undefined</w:del>undefined<w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
<w:rPr>
    <w:sz w:val="16"/>
    <w:szCs w:val="16"/>
    <w:rtl w:val="0"/>
</w:rPr>
<w:t xml:space="preserve"> AND MORE CHANGES HERE</w:t>undefined</w:r>undefined<w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">

What would be the recipe for "preserving" all commented changes so that I get the expected result? Is there a shortcut to just "accept" all? Or will I need to edit the raw xml of this document?

Thank you, and here are related issues to handling comments that I found useful:

#483
#93

@scanny
Copy link
Contributor

scanny commented Nov 1, 2018

Well, the short answer is there is no support yet for revision marks in python-docx. The complexity you've already encountered a bit of is the reason why; nobody yet has wanted the functionality badly enough to work through all that.

The xml:space="preserve" items you're seeing don't have to do with revision marks, by the way. All that means is that leading and trailing spaces between the tags for that element should be preserved when reading the XML.

If you format the XML, properly indicating the hierarchical relationships, it's easier to see what's going on (also, you should trim out all those 'undefined' items, those are not meaningful or valid .docx XML). Also, I've removed all the redundant font formatting since that doesn't bear on the current question.

<w:p>
  <w:r>
    <w:t>SOME CHANGE</w:t>
  </w:r>
  <w:ins w:author="Leroy Jim" w:date="2018-10-10T14:12:00Z" w:id="1480">
    <w:r w:rsidR="00464094">
      <w:t> AND ANOTHER CHANGE</w:t>
    </w:r>
  </w:ins>
  <w:del w:author="Leroy Jim" w:date="2018-10-10T14:12:00Z" w:id="1481">
    <w:r w:rsidDel="00464094">
      <w:delText>A DELETE CHANGE HERE</w:delText>
    </w:r>
  </w:del>
  <w:r>
    <w:t AND MORE CHANGES HERE</w:t>
  </w:r>
</w:p>

Basically, you have a paragraph element w:p. When you get its text, python-docx reads its runs (w:r) one by one and strings them together. What it doesn't do is pay attention to w:del deletions or w:ins insertions`. That explains why you're getting some text and not others. Basically you're getting the "hide revision marks" or "original version" view.

If you want to process these to get a different view (like perhaps "with changes") you need to parse the w:ins and include them in the right sequence, and skip the w:del elements.

@evbo
Copy link
Author

evbo commented Nov 1, 2018

Okay thank you that makes sense. So is there a way to access a list of ins elements for a given paragraph? Also, do you have built-in parsers I should try to leverage?

@scanny
Copy link
Contributor

scanny commented Nov 2, 2018

You'll have to use lxml primitives. Each element is a subclass of lxml.etree._Element, which has methods like .getchildren() for that sort of thing:
https://lxml.de/api/lxml.etree._Element-class.html

@scanny scanny closed this as completed Nov 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants