Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run a docx validator ahead of providing output #6365

Open
Tracked by #7978
cscheid opened this issue Jul 28, 2023 · 4 comments
Open
Tracked by #7978

Run a docx validator ahead of providing output #6365

cscheid opened this issue Jul 28, 2023 · 4 comments
Assignees
Labels
docx Issues with the docx format enhancement New feature or request
Milestone

Comments

@cscheid
Copy link
Collaborator

cscheid commented Jul 28, 2023

There are some docx validators out there. If we found a way to run xml validation in typescript, we could check our output before emitting it. (Avoiding situations like #6357)

@cscheid cscheid added the docx Issues with the docx format label Jul 28, 2023
@cscheid cscheid added this to the Future milestone Jul 28, 2023
@cscheid cscheid self-assigned this Jul 28, 2023
@cscheid
Copy link
Collaborator Author

cscheid commented Jul 28, 2023

Hm. It turns out that this specific docx validator appears to be out of date.

A basic

| foo | bar |
|-----|-----|
| 1   | 2   |

table, rendered to .docx, produces readable input (in my case) for Office for Mac, but fails validation.

@edwintorok
Copy link
Contributor

The XML validation errors are mostly due to wrong ordering of XML tags (apparently both the .xsd and RELAXNG schemas demand a particular ordering for some tags).
See jgm/pandoc#9263 where I'm attempting to fix the ordering in the reference document, which now passes validation (there are more validation errors to fix in pandoc and quarto itself).

I've also used a 2nd validator https://github.com/mikeebowen/OOXML-Validator which uses the open source https://github.com/dotnet/Open-XML-SDK, and that fails in similar ways.

I don't know whether the ordering matters for any application, but probably best to fix them so the real errors can be seen.
E.g. the reference doc had an extra > after a tag (now fixed in the PR above), and at first glance the callout tables created by quarto are missing a tblGrid tag, and have a duplicate w:pPr tag (one added by pandoc itself, the other be Quarto's Lua filter with raw opendocumentxml), which is very likely the reason for the linked bug.

@cscheid cscheid modified the milestones: Future, v1.4, v1.5 Dec 18, 2023
@edwintorok
Copy link
Contributor

FWIW here is how to run the docx validators in the CI (one using xmllint and the other using the open source officexml SDK from dotnet): https://github.com/jgm/pandoc/blob/main/.github/workflows/docx-validation.yaml

(Using the latter to validate is sufficient, but I find that the error messages from the former are easier to follow since it gives you line numbers in the XML instead of an xpath)

The latest pandoc master should now produce docx that passes validation (Quarto doesn't yet as discussed in the other linked issue)

@cscheid
Copy link
Collaborator Author

cscheid commented Dec 19, 2023

Thanks for the update! If Pandoc releases sufficiently soon we'll try to fix our docx issues as well ahead of the 1.4 release.

@cscheid cscheid modified the milestones: v1.5, Future Mar 1, 2024
@mcanouil mcanouil added the enhancement New feature or request label Apr 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docx Issues with the docx format enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants