Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue on 'for loop' with shapes inside? #61

Closed
fernandobrito opened this issue Jul 2, 2017 · 9 comments · Fixed by #89
Closed

Issue on 'for loop' with shapes inside? #61

fernandobrito opened this issue Jul 2, 2017 · 9 comments · Fixed by #89
Assignees
Labels

Comments

@fernandobrito
Copy link

fernandobrito commented Jul 2, 2017

Hello,

When I have a "for loop" with a shape inside, and the loop runs more than once, Word claims that output is corrupted.

Sample input:
image

If loop runs only once, output is as expected (one blue rectangle). If loop is run more than once, Word 2016 complains:
image

But if I try to proceed and let Word recover the document, output is as expected (two rectangles). If I replace the figure with text, everything works fine.

I will try to investigate the issue soon. Just posting it here to force myself to come later with a following up and to ask if this is a known bug.

Thanks again for the great work on the gem.

@fernandobrito
Copy link
Author

The error occurs because the for loop generates figures with duplicated ids. I will see if I can come up with a fix.

@senny
Copy link
Owner

senny commented Jul 3, 2017

@fernandobrito thanks for looking into this! You are definitely on the right track. The way sablon loops work is that they copy the contents for every iteration. If there are elements inside the loop that have to have unique properties, Word will most likely complain.

@fernandobrito
Copy link
Author

Any suggestions on how to fix it? I needed something quick for a client, so I'm just appending a unique number to each node inside the loop body that has an "id" attribute. It solved my issue with figures.

However, I'm not sure if it is always safe to update all nodes with an "id" attribute or if other types of nodes may have different unique ids but use some other name on the node attribute.

fernandobrito@eebd895

@stadelmanma
Copy link
Collaborator

stadelmanma commented Jul 3, 2017

@fernandobrito I would assume id's are unique across the entire document but to really be sure you'd need to check the XML spec (5500 pages of light reading if you need something to do on a weekend). While gigantic it is very well organized and bookmarked so its not as bad as it seems. Another good reference for WordML and DrawingML: http://officeopenxml.com, they reference the 3rd edition XML spec but I think that is simply due to websites age.

Could you throw out a small excerpt of the corrupted version i.e. duplicated id's and a excerpt of your fixed version? Seeing what is actually going on will help me out a bit.

I think the best way to fix it is to add functionality that finds the max id value in use and increments it from that point forward. You'll need to check every element that has an id attribute not just the drawing related ones. I'd track the value on the environment instance. You may need to check other files such as footers, headers, etc. as well. Depending on the implementation the 'next' id values could just be stored in a hash with a key matching the unique attribute name.

@fernandobrito
Copy link
Author

fernandobrito commented Jul 14, 2017

@stadelmanma thanks for the references.

I've done a quick search on the XML spec and it seems most figures/drawings have their unique ids described as an attribute either on the wp:docPr or on the wp:cNvPr tag. I couldn't really understand when which one is used.

The description for the id attribute (which should be a unique integer) is:

id (Unique Identifier)
Specifies a unique identifier for the current DrawingML object within the current
document. This ID can be used to assist in uniquely identifying this object so that it can
be referred to by other parts of the document

If multiple objects within the same document share the same id attribute value, then the
document shall be considered non-conformant.
https://msdn.microsoft.com/en-us/library/documentformat.openxml.drawing.wordprocessing.docproperties(v=office.15).aspx

Here you can find a small example of a rectangle inside a loop: https://github.com/fernandobrito/sablon/blob/eebd895954b157e659fbf180ea8312b027a05a69/test/fixtures/xml/figure_loop.xml#L30. Line 30 has a docPr element with an id attribute which will get duplicated when sablon make copies of the loop body, corrupting the output file.

I've started playing around with your idea of finding the highest id and then assigning new ids for each loop iteration (on this fork), but I just realized I did a big mistake. I started doing my work on top of the images support PR :(. When I have more time I will cherry-pick my changes on top of senny/sablon master so you can take a look and maybe provide some feedback.

By the way, I've been using a lot a tool from Microsoft: https://www.microsoft.com/en-us/download/details.aspx?id=30425. It lets me generate diffs between OOXML files and also validate them. Is there anywhere where I should add it for future contributors? Perhaps on a wiki page on this repo?

@stadelmanma
Copy link
Collaborator

@fernandobrito I completely missed the last bit of your comment about the MS tool. Sadly, I don't have access to a Windows computer to try it out.

Does it generate a diff like git showing you the line by line changes to each XML file in the document?

The validation part sounds extremely useful since MS Word tends to be pretty ambiguous when a docx gets corrupted. What kind of information does it generate when validating a document?

@fernandobrito
Copy link
Author

@stadelmanma I had to install Windows on a virtual machine in order to use it.

Yes, the tool provides diffs, such as:
image, but I think you can use normal diff tools or web tools such as: https://www.corefiling.com/opensource/xmldiff/.

Maybe the biggest win is the validation feature. It shows which lines are causing problems:
image

That's how I found about the duplicated id issue on figures.

@stadelmanma
Copy link
Collaborator

I think this is more easily fixable with the new DOM logic, I have a branch on my fork to work on it. All r:id attributes should be unmodified because that will break relationships defined in a *.rels file.

Additional notes on any elements that use the id attribute:

  • Table cell identifier (17.4.66), w:id, only unique within the table itself, optional, any string value. Probably can be ignored.
  • Annotation identifier (17.13.4.2) w:id, applies to comments it says that if more than one comment has the same value for this attribute then only one of them is ignored.
    • This appears to be the same for all of the change tracking elements from 17.13.4 - 17.13.5. I think I should leave these id attributes alone.
    • Changes tracking in a template doesn't make sense from a use case aside from providing the consumer with notes on what they may need to manually up date so I'm less worried about support this corner case.
  • The w:bookmarkStart and w:bookmarkEnd elements share a unique ID value.
    • If I copy and change one I need to ensure the corresponding end element also gets changed.
    • If both elements aren't present in the repeated content then I should just drop the clones
  • The w:permStart and w:permEnd elements follow the same convention as bookmarks.
  • The last relevant id is in cNcPR (19.3.1.12) which we already know needs to be unique.

Side note there is a <w:id> element but we shouldn't need to worry about this one (17.5.2.18).

@stadelmanma stadelmanma self-assigned this Jan 24, 2018
@stadelmanma
Copy link
Collaborator

stadelmanma commented Jan 25, 2018

Implementation stages:

  • Auto-increment the id attribute in the wp:docPr and wp:cNvPr tags. I think I should find these tags for any namespace (https://stackoverflow.com/questions/4440451/how-to-ignore-namespaces-with-xpath)
    • I'll need to test what happens when shapes are in footnotes, endnotes, headers, etc.
  • Handle the bookmark tags and migrate this logic to perm tags
    • Several bookmarks with overlapping id's does not appear to corrupt the document. I will leave this to the end user to handle as bookmarks spanning a loop in a odd fashion shouldn't be the original intent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants