Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First cut at p:validate-with-dtd #579

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

First cut at p:validate-with-dtd #579

wants to merge 1 commit into from

Conversation

ndw
Copy link
Collaborator

@ndw ndw commented Jun 7, 2024

This my first attempt. Feedback eagerly solicited. Formatted versions should appear on the xproc.org/dashboard page a few minutes after I create this request.

@ndw ndw requested a review from a team as a code owner June 7, 2024 13:13
@ndw
Copy link
Collaborator Author

ndw commented Jun 7, 2024

Close #543

'biblio': map { "public-identifier": "-//Bibliograph//EN",
"system-identifier": "bib.xml" }}]]></programlisting>

<para>The <code>system-identifier</code> property must be provided. The
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless the full declaration is given on the doctype port, I guess?


<para>The <tag>p:validate-with-dtd</tag> step does not have an
<option>assert-valid</option> option. If validation fails, a new data model will
not have been constructed. Consequently the step always fails if validation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really all that we can offer? I’d envisaged that it works like xmllint --dtdvalid (do a posteriori validation against a given DTD).

root.xml:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root PUBLIC "-//Public//Identifier" "system-identifier.dtd">
<root>
  <a></a>
  <c></c>
</root>

system-identifier.dtd:

<?xml version="1.0" encoding="UTF-8"?>
<!ELEMENT root (a, b)>
<!ELEMENT a (#PCDATA)>
<!ELEMENT b (#PCDATA)>

Invoke validation:

$ xmllint --noout --dtdvalid system-identifier.dtd root.xml
root.xml:3: element root: validity error : Element root content does not follow the DTD, expecting (a , b), got (a c )
root.xml:5: element c: validity error : No declaration for element c
Document root.xml does not validate against system-identifier.dtd

And then wrap the error message line into appropriate elements for the requested report format.

When I use Calabash and p:load[@dtd-validate='true'], I get something like this:

Error on line 5 column 6 of root.xml:
  SXXP0003   Error reported by XML parser: Element type "c" must be declared.: Element type
  "c" must be declared.
Error on line 6 column 8 of root.xml:
  SXXP0003   Error reported by XML parser: The content of element type "root" must match
  "(a,b)".: The content of element type "root" must match "(a,b)".
<c:errors xmlns:c="http://www.w3.org/ns/xproc-step"><c:error xmlns:err="http://www.w3.org/ns/xproc-error" code="err:XC0027" href="file:/mnt/c/Users/gerrit/XML/XProc/2024-06_validate-with-dtd/validate-with-dtd.xpl" line="3" column="59">The XML parser reported two validation errors</c:error></c:errors>

The lines before c:error are just written to STDOUT. I think they’d need to be collected instead and put into the report, each message wrapped into something like

<detection severity="error" code="SXXP0003">
  <location  line="5" column="6"/>
  <message xml:lang="en"> Element type "c" must be declared.</message>
</detection>
<detection severity="error" code="SXXP0003">
  <location  line="6" column="8"/>
  <message xml:lang="en">The content of element type "root" must match "(a,b)"</message>
</detection>

<para>The resulting text is parsed using a validating XML parser.</para>

<para>Any warning messages produced by the parser will appear on the
<port>report</port> port.</para>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above, I’d also expect the actual validation errors to be listed in the report.

@ndw
Copy link
Collaborator Author

ndw commented Jun 7, 2024

Thank you, Gerrit! You're absolutely right. The step can return the original document unchanged if there was an error. That's much more sensible.

@ndw
Copy link
Collaborator Author

ndw commented Jun 8, 2024

I think we need a document-element option for the case where you send a text document as the source. For example:

<p:identity>
  <p:with-input><p>Paragraph of text.</p></p:with-input>
</p:identity>

<p:validate-with-dtd
  general-entities="map { 'text': 'Hello, world.',
                          'para': . }"
  document-element="doc">
  <p:with-input port="source">
    <p:inline content-type="text/plain"><![CDATA[<doc>
<p>Test</p>
<p>&text;</p>
&para;
</doc>]]></p:inline>
  </p:with-input>
  <p:with-input port="doctype"><p:empty/></p:with-input>
</p:validate-with-dtd>

@ndw
Copy link
Collaborator Author

ndw commented Jun 8, 2024

Having poked at the implementation a bit, I think what I've proposed is way over-the-top. How about:

<p:declare-step type="p:validate-with-dtd">
  <p:input port="source" primary="true" content-types="xml html text"/>
  <p:input port="doctype" content-types="text" sequence="true">
    <p:empty/>
  </p:input>
  <p:output port="result" primary="true" content-types="xml"/>
  <p:output port="report" sequence="true" content-types="xml json"/>
  <p:option name="report-format" select="'xvrl'" as="xs:string"/>
  <p:option name="serialization" as="map(xs:QName,item()*)?"/>
  <p:option name="assert-valid" select="true()" as="xs:boolean"/>
</p:declare-step>
  1. The simple case, you pass a document with a doctype-system serialization property (on the document or the step). We serialize the document with the necessary doctype declaration and validate it.
  2. You provide a doctype, we serialize the source document (without a doctype declaration or XML declaration), slap the doctype you provided in front of it and validate it.
  3. If you want to do anything funky with entity replacements or some such, you construct the text of the document you want to parse, by whatever means you want, and we validate it.

@xml-project
Copy link
Member

Most probably missed something important, but I am confused what the report result port is for. If the validation succeeds, nothing “interesting” is in the documents on this port. If it doesn’t, the report document is not available because a dynamic error is raised.
What do I miss?

@ndw
Copy link
Collaborator Author

ndw commented Jun 8, 2024

Several comments back, @gimsieke persuaded me that we should put the assert-valid option back and just pass the original document through if assert-valid is false() and an error occurs.

@xml-project
Copy link
Member

@ndw thanks. Now I know what I missed. :-))

@xml-project
Copy link
Member

@ndw Two questions came up, while trying to implement the new suggestion:

<p:declare-step type="p:validate-with-dtd">
  <p:input port="source" primary="true" content-types="xml html text"/>
  <p:input port="doctype" content-types="text" sequence="true">
    <p:empty/>
  </p:input>
  <p:output port="result" primary="true" content-types="xml"/>
  <p:output port="report" sequence="true" content-types="xml json"/>
  <p:option name="report-format" select="'xvrl'" as="xs:string"/>
  <p:option name="serialization" as="map(xs:QName,item()*)?"/>
  <p:option name="assert-valid" select="true()" as="xs:boolean"/>
</p:declare-step>
´´´
Please excuse this questions, if they are stupid, but I am not a DTD-expert.
1. What is supposed to happen, if a Text document appears on port "source"?
2. Is an HTML document appears on port "source", is the result type "xml" correct? 

@ndw
Copy link
Collaborator Author

ndw commented Jun 19, 2024

A text document is allowed so that you could construct something like this:

<doc>
  &chap1;
  &chap2;
</doc>

where presumably the chap1 and chap2 entities are defined in the doctype. There's no way to get unexpanded entities into a parsed XDM, so you'd have to do it this way. I haven't thought very hard about how difficult it will be to make a text document that serializes correctly!

DTD validation sort-of implies XML, so I think making the result always be XML makes sense. If you think it makes more sense to give a document with a root element of (X)HTML an HTML content type, I can see how that might make sense too.

@xml-project
Copy link
Member

@ndw Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants