Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FO] Support parsing HTML #74

Closed
rhdunn opened this issue May 14, 2021 · 16 comments
Closed

[FO] Support parsing HTML #74

rhdunn opened this issue May 14, 2021 · 16 comments
Labels
Feature A change that introduces a new feature XQFO An issue related to Functions and Operators

Comments

@rhdunn
Copy link
Contributor

rhdunn commented May 14, 2021

It is common for applications that use an XQuery database engine to want to parse HTML documents when adding content from HTML pages into a database, or in other applications like generating epub documents from HTML source files. Vendors like MarkLogic (xdmp:tidy via HTML Tidy for HTML4), BaseX (html:parse via TagSoup), Saxon (saxon:parse-html via TagSoup), and eXist-db (util:parse-html via Neko) have provided custom methods to support this.

Q: Should there also be functions to list the supported methods and character encodings?

fn:parse-html

Summary

Parses HTML-based input into an XML document.

Signature

fn:parse-html($input as union(xs:string, xs:hexBinary, xs:base64Binary),
              $options as map(*) := map { "method": "html5" }) as document-node()

Properties

This function is ·deterministic·, ·context-independent·, and ·focus-independent·.

Rules

The $options map conforms to record(method as union(enum("html5"), xs:string), encoding as xs:string?, *). A vendor may provide ·implementation-dependent· options that may vary between the different method values.

The method property of $options defines the approach used to convert the HTML document to XML. This specification supports html5 for using the HTML5 parsing rules for HTML content. The exact version of HTML5 used is ·implementation-dependent·.

The encoding property of $options defines the character encoding used to decode binary data. By default, this is an empty sequence. Implementations must support at least utf-8, utf8, ascii, and latin1. Other encoding values are ·implementation-dependent·, but it is recommended that the encodings documented in the WHATWG Encoding specification [3] are supported.

If $input is an xs:string, no character decoding is performed as the input is already decoded.

If $input is an xs:hexBinary or xs:base64Binary, the character encoding used to decode the binary data is determined as follows:

  1. if the binary data has a valid Unicode Byte Order Mark (BOM), the character encoding specified by that BOM is used.
  2. if encoding is specified in $options, that value is used;
  3. if prescanning the first 1024 bytes of data contains a character encoding (using the rules from https://html.spec.whatwg.org/multipage/parsing.html#prescan-a-byte-stream-to-determine-its-encoding), the detected encoding is used;
  4. if ·implementation-dependent· heuristics (in line with the HTML5 rules) detect a character encoding, that encoding is used;
  5. otherwise, the encoding is "utf-8".

If the detected character encoding name is not supported, an FO###### error is raised. Otherwise, the character encoding method associated with the character encoding is used.

If the parsing method is not supported, an FO###### error is raised.

The $input is then parsed according to the specified parsing method, building an intermediate HTML Document object. The XML document-node is then constructed by mapping the HTML document, element, attribute, text, and comment nodes to their XML equivalents.

If a HTML document contains a template element, the contents of that element are added as children of the template element. It is ·implementation-dependent· whether or not a processor ignores this content when evaluating path expressions on these template elements, and how they are represented in any DOM interfaces.

Notes

The character encoding logic follows the https://html.spec.whatwg.org/multipage/parsing.html#encoding-sniffing-algorithm rules.

HTML does not support processing instructions. They are treated as comments in the HTML5 specification.

The HTML template element is complex as the HTML specification defines its content as being part of a separate document that is associated with the template contents property of that element, not its children. The WHATWG specification provides a non-normative guide for XSLT and XPath interacting with these elements (https://html.spec.whatwg.org/#template-XSLT-XPath).

A conforming implementation may choose to parse and return the HTML into a HTML-based data model (e.g. the HTML DOM) instead of generating an XML infoset or PSVI. This is valid as long as the accessor functions (https://www.w3.org/TR/xpath-datamodel-31/#accessors) and the various syntax that works with XML nodes also works for the HTML nodes. That is, expressions like $html/html/body/p instance of element(p) are supported.

Examples

The expression fn:parse-html("<html>") returns an empty html document constructed using the HTML5 document construction rules.

The expression fn:parse-html($html, encoding: "latin2") uses the latin2 character encoding to parse $html, or generates an FO###### error if the processor does not support that encoding.

The expression fn:parse-html($html, method: "html5", encoding: ()) is equivalent to fn:parse-html($html).

The expression fn:parse-html($html, method: "tidy") uses the tidy method (e.g. from the HTML Tidy application) to parse $html into an XML document if supported by the implementation. Otherwise an FO###### error is raised.

The expression fn:parse-html($html, method: "tagsoup", nons: true()) uses the tagsoup method (e.g. from the TagSoup application) to parse $html into an XML document if supported by the implementation, passing the --nons attribute. Otherwise an FO###### error is raised.

References

  1. HTML 5.2, W3C.
  2. HTML Living Standard, WHATWG.
  3. Encoding Living Standard, WHATWG.
@rhdunn rhdunn added XQFO An issue related to Functions and Operators Feature A change that introduces a new feature labels Sep 14, 2022
@ChristianGruen ChristianGruen added this to the QT 4.0 milestone Oct 14, 2022
@michaelhkay
Copy link
Contributor

michaelhkay commented Oct 16, 2022

There's definitely a strong case for providing this function, but writing and agreeing a specification isn't going to be easy. Despite what the proposal suggests, the result of the function has to be an XDM node tree. We could give implemetations some discretion on how HTML5 is mapped to XDM, or we could define very strict conformance rules (which would give better interoperability, but make it harder to reuse existing parsing libraries). Doing this probably requires a rather detailed technical knowledge of HTML5 (volunteers, please!). There may be an issue with features of HTML5 that don't map readily to anything in XDM - an example that has come up with SaxonJS is support for the Shadow DOM concept. (See also issue #75 regarding the HTML5 template element).

@ChristianGruen
Copy link
Contributor

When looking at HTML, I quickly get other formats in mind as well (e.g. CSV).
EXPath/EXQuery standards could be an additional option, even if the platforms are not very active today.

@rhdunn
Copy link
Contributor Author

rhdunn commented Oct 16, 2022

The shadow DOM is not used/accessible from parsing [1]. The shadow DOM and template element both use a DocumentFragment [2], but the template subtree is accessed via the content property on the template element DOM [3].

The key part as I note in the proposal is the handling of the template element [3], [4]. That is discussed in #75 for how we could handle that element correctly -- specifically, having a new "document fragment" node and a content:: axis for accessing the template's content element.

The HTML DOM [5] maps pretty well to the XDM. The difference is in the addition of the DocumentFragment node and the visibility of that in the template content property set during parsing, and the shadow DOM set during JavaScript evaluation. It should be possible to define a precise mapping for the HTML DOM nodes.

[1] https://developer.mozilla.org/en-US/docs/Web/Web_Components/Using_shadow_DOM
[2] https://dom.spec.whatwg.org/#interface-documentfragment
[3] https://html.spec.whatwg.org/multipage/scripting.html#the-template-element
[4] https://html.spec.whatwg.org/multipage/xhtml.html#parsing-xhtml-documents
[5] https://dom.spec.whatwg.org/

@joewiz
Copy link

joewiz commented Oct 16, 2022

Could a corresponding fn:html-doc() function be considered too—which takes a URL and applies the same set of tidying options described above in order to shoehorn the HTML document into a well-formed XHTML document?

@benibela
Copy link

Or just overload fn:doc to return a some cleanedup HTML if the URL specifies a HTML document.

@joewiz
Copy link

joewiz commented Oct 16, 2022

@benibela For that matter we could add an $options parameter to fn:doc() which takes entries analogous to those for fn:serialize(), e.g., "method": "html" - but we already have parse-xml, parse-xml-fragment, parse-json, etc.

@michaelhkay
Copy link
Contributor

@joewiz @benibela Defining a function that parses HTML is the easy part; the challenge is specifying exactly what it returns. @rhdunn says "It should be possible to define a precise mapping for the HTML DOM nodes." but I'm not sure that's true without data model extensions. One can envisage data model extensions (e.g. allowing a node to have an arbitrary set of named properties, defined by a map), but any data model extensions end up being disruptive, because of the pervasive impact (e.g. new features in XQuery and XSLT to construct nodes that have such properties).

@rhdunn
Copy link
Contributor Author

rhdunn commented Oct 17, 2022

I think there is a bit of confusion here. What I mean by "HTML DOM nodes" is the core part of the HTML DOM [1]. I.e.:

  1. Node interface -- Equivalent to node().
  2. Document interface -- Equivalent to document-node() covered by 6.1 Document Nodes of the XDM.
  3. DocumentType -- A DTD node, where the properties are mapped to the XDM document element.
  4. DocumentFragment -- The closest to the XDM here is a document-node(), but this would have slightly different properties (e.g. it could have multiple child elements). This is used by the template element in parsing to bind to a content property. The discussion in Support processing HTML 5 template element content #75 talks about options to support this (e.g. adding document-fragment to the XDM and providing a content:: axis).
  5. ShadowRoot -- These are document fragments. They are only available via JavaScript, so are not needed to parse a HTML document.
  6. Element -- Equivalent to element() covered by 6.2 Element Nodes of the XDM.
  7. Attr -- Equivalent to attribute() covered by 6.3 Attribute Nodes of the XDM.
  8. CharacterData -- This is a base interface for text, comment, and PI nodes, so does not need specific mapping.
  9. Text -- Equivalent to text() covered by 6.7 Text Nodes of the XDM.
  10. CDATASection -- Equivalent to text() covered by 6.7 Text Nodes of the XDM.
  11. ProcessingInstruction -- Equivalent to processing-instruction() covered by 6.5 Processing Instruction Nodes of the XDM.
  12. Comment -- Equivalent to comment() covered by 6.6 Comment Nodes of XDM.

So the work should be defining how those interfaces map to the XDM properties akin to the PSVI and infoset mappings. Ideally, that should be in a "Construction from a HTML DOM Core document" in the XDM, but I'd also be happy with documenting it in the FO spec.

What I'm not suggesting here is mapping all of the HTML element DOM properties to the XDM.

As such, what I'm proposing is the equivalent of:

  1. Parsing the HTML using the designated HTML parser (tidy, html5, tagsoup, etc.).
  2. Serializing that to XHTML.
  3. Parsing that as XML, making sure that HTML entities are correctly mapped if they were preserved in the resulting XHTML.

So in that sense, the template question is answered -- the elements within it are kept as children of the template element.

This is because 1) the HTML parsing algorithm uses tree construction to build a HTML element DOM, and 2) different libraries use their own internal DOM representations. As such, I want the specification to be loose enough to allow an implementor to:

  1. Use the mechanism I outline above by serializing to XHTML.
  2. Skip the XHTML serialization step and construct an XDM tree from the HTML tree.
  3. Return the HTML tree directly and have corresponding XDM accessor implementations for the HTML DOM nodes (e.g. dm:parent).
  4. Construct an XDM tree instead of a HTML tree in the tree construction part of the HTML parsing algorithm -- I think this should be doable. This would allow an implementor to implement their own HTML parser if they wished, instead of using a library that would construct the HTML document in the library-specific DOM tree.

So, for example:


dm:parent

The result of dm:parent($node) for a HTML DOM Node is determined as follows:

  1. If parentNode is null, the result is the empty sequence.
  2. If parentNode is not null, the result is an XDM mapping of the parentNode value.

dm:attributes

The result of dm:attributes($node) for a HTML DOM Node is determined as follows:

  1. If $node is not an instance of HTML DOM Element, an empty sequence is returned.
  2. If $node is an instance of HTML DOM Element, the attributes property is mapped to a sequence as follows:
    1. The length property of the NamedNodeMap is the number of items in the sequence.
    2. The item(n) method of the NamedNodeMap returns the nth element of the sequence, where the resulting Attr is mapped to the XDM properties.

[1] https://dom.spec.whatwg.org/

@ndw
Copy link
Contributor

ndw commented Oct 17, 2022

My understanding is that the HTML5 parsing algorithm is deterministic and returns a tree. I think the bare minimum needed to declare victory here would be to say that the function returns that tree as an XDM. I think I could probably live with leaving the "template" elements in the tree and letting the XPath, XQuery, or XSLT that's going to process the XDM deal with them, but I'm not opposed in principle to trying to do better than that with templates.

@rhdunn
Copy link
Contributor Author

rhdunn commented Oct 17, 2022

Here is a draft of mapping the HTML DOM to the XDM. This should ideally be a part of the XDM document, but I'm happy for it to be part of the FO spec.

This is complete as far as I can tell w.r.t. the HTML DOM specification.

Constructing XDM from HTML DOM

Constructing the XDM from a HTML DOM node follows the infoset mapping rules for the DOM nodes with a few differences to handle HTML specific differences in the DOM nodes.

Note:

The HTML DOM ShadowRoot node type is not supported here as that is not used by the HTML tree construction logic. It is only accessible via JavaScript.

dm:attributes

The result of dm:attributes($node) for a HTML DOM Node is as follows:

  1. If the node is an instance of HTML DOM Element then the result is the value of the attributes property mapped to a sequence as described below;
  2. Otherwise, an empty sequence is returned.

The resulting HTML DOM NamedNodeMap is mapped to a sequence as follows:

  1. length is the length of the sequence, where a length of 0 results in an empty sequence;
  2. item(n) is the nth element of the sequence.

That sequence is filtered as follows:

  1. If the namespaceURI property is "http://www.w3.org/2000/xmlns/", the attribute is not included in this sequence;
  2. If the localName property is xmlns, the attribute is not included in this sequence;
  3. If the localName property starts with xmlns:, the attribute is not included in this sequence;
  4. Otherwise, the attribute is included in this sequence using the XDM mapping rules described in this section.

dm:base-uri

The result of dm:document-uri($node) for a HTML DOM Node is the value of the baseURI property mapped as follows:

  1. If the value is null or an empty string, then the result is an empty sequence;
  2. Otherwise, the string value is cast to an xs:anyURI;

dm:children

The result of dm:children($node) for a HTML DOM Node is as follows:

  1. If the node is an instance of HTML DOM Document then the result is the value of the childNode property mapped to a sequence;
  2. If the node is an instance of HTML DOM HTMLTemplateElement then the result is determined as follows:
    1. Select the DocumentFragment from the content property;
    2. Map the result of the DocumentFragment's childNode property to a sequence;
  3. If the node is an instance of HTML DOM Element then the result the value of the childNode property mapped to a sequence;
  4. Otherwise, the result is an empty sequence.

The resulting HTML DOM NodeList is mapped to a sequence as follows:

  1. length is the length of the sequence, where a length of 0 results in an empty sequence;
  2. item(n) is the nth element of the sequence.

That sequence is filtered as follows:

  1. If the child is an instance of HTML DOM DocumentType, that child is not included in this sequence.
  2. Otherwise, the HTML DOM Node nodes are mapped to XDM according to the rules in this section.

Note:

This behaviour for the HTML template element is a willful violation of the HTML specification. This is intentional in order to preserve the XDM and XML document semantics. This undoes the corresponding willful violation of the XML specification in the HTML specification.

dm:document-uri

The result of dm:document-uri($node) for a HTML DOM Node is as follows:

  1. If the node is an instance of HTML DOM Document then the value of the documentURI property is mapped as follows:
    1. If the value is null or an empty string, then the result is an empty sequence;
    2. Otherwise, the string value is cast to an xs:anyURI;
  2. Otherwise, the result is an empty sequence.

dm:is-id

The result of dm:is-id($node) for a HTML DOM Node is as follows:

  1. If the node is an instance of HTML DOM Attr then:
    1. If the name property (its qualified name) is id the result is fn:true();
    2. Otherwise, the result is fn:false();
  2. Otherwise, the result is fn:false().

dm:is-idrefs

The result of dm:is-idrefs($node) for a HTML DOM Node returns an empty sequence.

dm:namespace-nodes

The result of dm:namespace-nodes($node) for a HTML DOM Node is as follows:

  1. If the node is an instance of HTML DOM Element then the result is the value of the attributes property mapped to a sequence as described below;
  2. Otherwise, an empty sequence is returned.

The resulting HTML DOM NamedNodeMap is mapped to a sequence as follows:

  1. length is the length of the sequence, where a length of 0 results in an empty sequence;
  2. item(n) is the nth element of the sequence.

That sequence is filtered as follows:

  1. If the namespaceURI property is "http://www.w3.org/2000/xmlns/", the attribute is included in this sequence using the XDM mapping rules described in this section;
  2. If the localName property is xmlns, the attribute is included in this sequence using the XDM mapping rules described in this section;
  3. If the localName property starts with xmlns: the attribute is included in this sequence using the XDM mapping rules described in this section;,
  4. Otherwise, the attribute is not included in this sequence.

dm:nilled

The result of dm:nilled($node) for a HTML DOM Node returns fn:false().

dm:node-kind

The result of dm:node-kind($node) for a HTML DOM Node is as follows:

  1. If the node is an instance of HTML DOM Document then the result is "document";
  2. If the node is an instance of HTML DOM Element then the result is "element";
  3. If the node is an instance of HTML DOM Attr then:
    1. If the namespaceURI property is "http://www.w3.org/2000/xmlns/", the result is "namespace";
    2. If the localName is xmlns then the result is "namespace";
    3. If the localName property starts with xmlns:, the result is "namespace";
    4. Otherwise, the result is "attribute";
  4. If the node is an instance of HTML DOM ProcessingInstruction then the result is "processing-instruction";
  5. If the node is an instance of HTML DOM Comment then the result is "comment";
  6. If the node is an instance of HTML DOM Text then the result is "text".

Note:

When parsing a document using the HTML parser instead of an XML parser, the localName and name properties of Attr nodes are both set to the qualified name.

dm:node-name

The result of dm:node-name($node) for a HTML DOM Node is as follows:

  1. If the node is an instance of HTML DOM Attr then:
    1. If the localName is xmlns then the value is an empty sequence;
    2. If the localName starts with xmlns: then:
      1. The local name is the value of the localName property after the xmlns: part;
      2. The namespace prefix is empty;
      3. The namespace URI is empty;
    3. If the localName property contains a :, then the QName properties are taken by parsing localName as a QName in the current element context;
    4. Otherwise:
      1. The local name is the value of the localName property;
      2. The namespace prefix is the value of the prefix property, or empty if the value is null;
      3. The namespace URI is the value of the namespaceURI property, or empty if the value is null;
  2. If the node is an instance of HTML DOM Element then:
    1. If the localName property contains a :, then the QName properties are taken by parsing localName as a QName in the current element context;
    2. Otherwise:
      1. The local name is the value of the localName property;
      2. The namespace prefix is the value of the prefix property, or empty if the value is null;
      3. The namespace URI is the value of the namespaceURI property, or "http://www.w3.org/1999/xhtml" if the value is null;
  3. If the node is an instance of HTML DOM ProcessingInstruction then:
    1. The local name is the value of the target property;
    2. The namespace prefix is empty;
    3. The namespace URI is empty;
  4. Otherwise, an empty sequence is returned.

Note:

When parsing a document using the HTML parser instead of an XML parser, the localName and name properties of Element and Attr nodes are both set to the qualified name.

Note:

The HTML Interactions with XPath and XSLT section defines this in terms of an implicit default element namespace declaration.

dm:parent

The result of dm:parent($node) for a HTML DOM Node is as follows:

  1. If the parentNode property is null, the result is an empty sequence.
  2. Otherwise, the result is the parentNode property mapped to XDM according to the rules in this section.

dm:string-value

The result of dm:string-value($node) for a HTML DOM Node is as follows:

  1. If the node is an instance of HTML DOM Document then the result is the XDM infoset mapping for document nodes;
  2. If the node is an instance of HTML DOM Element then the result is the XDM infoset mapping for element nodes;
  3. If the node is an instance of HTML DOM Attr then the result is the normalized value of the value property:
  4. If the node is an instance of HTML DOM ProcessingInstruction then the result is the value of the nodeValue property, or "" if that is null;
  5. If the node is an instance of HTML DOM Comment then the result is the value of the nodeValue property, or "" if that is null;
  6. If the node is an instance of HTML DOM Text then the result is the value of the nodeValue property, or "" if that is null.

dm:type-name

The result of dm:type-name($node) for a HTML DOM Node is as follows:

  1. If the node is an instance of HTML DOM Element then the result is xs:untyped;
  2. If the node is an instance of HTML DOM Attr then:
    1. If the namespaceURI property is "http://www.w3.org/2000/xmlns/", the result is an empty sequence;
    2. If the localName property starts with xmlns:, the result is an empty sequence;
    3. Otherwise, the result is xs:untypedAtomic;
  3. If the node is an instance of HTML DOM Text then the result is xs:untypedAtomic;
  4. Otherwise, an empty sequence is returned.

Note:

When parsing a document using the HTML parser instead of an XML parser, the localName and name properties of Attr nodes are both set to the qualified name.

dm:typed-value

The result of dm:typed-value($node) for a HTML DOM Node is as follows:

  1. If the node is an instance of HTML DOM Document then the result is dm:string-value($node) cast to xs:untypedAtomic;
  2. If the node is an instance of HTML DOM Element then the result is dm:string-value($node) cast to xs:untyped;
  3. If the node is an instance of HTML DOM Attr then:
    1. If the namespaceURI property is "http://www.w3.org/2000/xmlns/", the result is dm:string-value($node), preserving the xs:string type;
    2. If the localName property starts with xmlns:, the result is dm:string-value($node), preserving the xs:string type;
    3. If the localName property is xmlns, the result is dm:string-value($node), preserving the xs:string type;
    4. Otherwise, the result is dm:string-value($node) cast to xs:untypedAtomic;
  4. If the node is an instance of HTML DOM Text then the result is dm:string-value($node) cast to xs:untypedAtomic;
  5. Otherwise, the result is dm:string-value($node), preserving the xs:string type.

dm:dm:unparsed-entity-public-id

  1. If the node is an instance of HTML DOM Document then locate the DocumentType child node in the childNodes property:
    1. If no child exists, the result is an empty sequence;
    2. If the DocumentType child's publicId property is an empty string, the result is an empty sequence;
    3. Otherwise, the result is the publicId property value.
  2. Otherwise, an empty sequence is returned.

dm:unparsed-entity-system-id

  1. If the node is an instance of HTML DOM Document then locate the DocumentType child node in the childNodes property:
    1. If no child exists, the result is an empty sequence;
    2. If the DocumentType child's systemId property is an empty string, the result is an empty sequence;
    3. Otherwise, the result is the systemId property value.
  2. Otherwise, an empty sequence is returned.

References

  1. https://dom.spec.whatwg.org/
  2. https://html.spec.whatwg.org/multipage/scripting.html#the-template-element
  3. https://html.spec.whatwg.org/multipage/infrastructure.html#interactions-with-xpath-and-xslt
  4. https://html.spec.whatwg.org/multipage/introduction.html#html-vs-xhtml

@michaelhkay
Copy link
Contributor

Looks a pretty good start.

A couple of details:

  • There seems to be a need for a top-level description that says Element nodes convert to Element nodes, Attribute nodes convert either to Attribute or Namespace nodes, etc.

  • You can't derive document-uri() from base-uri(), sadly, because the contract for document-uri() says that doc(document-uri(X)) is X, which isn't possible if two different documents have the same base URI (which is often the case).

  • Need to map HTML DocumentFragment nodes to XDM Document nodes.

  • There are phrases like "If the node is an instance of HTML DOM Attr then" but I thought an HTML Attr was not a node.

  • I think your rules for namespaces end up with only the locally-declared namespaces on an element, not all the in-scope namespaces.

  • Special rules are probably needed for the parentNode property of attribute and namespace nodes.

  • I thought the HTML DOM couldn't contain processing instructions?

  • The mapping for unparsed-entity-system/public-id is wrong (it gives you the public ID and system ID of the DTD),

  • Does this account for all node types encountered? Are there any left over like entity nodes or CDATA nodes?

  • Is there ever a need to merge adjacent text nodes?

  • What happens if the DOM was created programmatically and has "inconsistencies" like undeclared namespaces? Are we defining a mapping only for a DOM constructed directly by a conformant HTML5 parser?

  • Do we need to say anything about case normalization of names, and default namespaces, or is that all implicit in the HTML5 parsing algorithm?

@rhdunn
Copy link
Contributor Author

rhdunn commented Oct 17, 2022

There seems to be a need for a top-level description that says Element nodes convert to Element nodes, Attribute nodes convert either to Attribute or Namespace nodes, etc.

How about the following wording:

HTML DOM nodes (as defined in the DOM Standard) map to XDM nodes as follows:

  1. HTML DOM Document nodes map to XDM Document Nodes.
  2. HTML DOM Element nodes map to XDM Element Nodes.
  3. HTML DOM Attr nodes map to XDM Attribute Nodes.
  4. HTML DOM Text and CDATASection nodes map to XDM Text Nodes. The CDATASection interface is a subclass of Text, so the rules below only mention Text nodes.
  5. HTML DOM ProcessingInstruction nodes map to XDM Processing Instruction Nodes.
  6. HTML DOM Comment nodes map to XDM Comment Nodes.

Note:

The following HTML DOM nodes do not have an equivalent XDM mapping:

  1. HTML DOM DocumentType nodes -- the properties of this are mapped to XDM Document Node properties.
  2. HTML DOM DocumentFragment nodes -- these are handled specifically for the template element, so don't appear in the XDM result set.
  3. HTML DOM ShadowRoot nodes -- these should not appear in a parsed HTML document as they are created via JavaScript.

Constructing the XDM from a HTML DOM node follows the infoset mapping rules for the DOM nodes with a few differences to handle HTML specific differences in the DOM nodes. The specifics for the mapping are detailed in the section below.

[... section as my previous comment ...]


You can't derive document-uri() from base-uri(), sadly, because the contract for document-uri() says that doc(document-uri(X)) is X, which isn't possible if two different documents have the same base URI (which is often the case).

I'm not doing that. I'm mapping documentURI to dm:document-uri and baseURI to dm:base-uri. That should be correct and will work in the case you specify.


Need to map HTML DocumentFragment nodes to XDM Document nodes.

Is this needed? I explicitly avoid doing this to prevent modifying the XDM model. -- This is because a HTML DocumentFragment could have multiple element nodes, which breaks XDM document node constraints.


There are phrases like "If the node is an instance of HTML DOM Attr then" but I thought an HTML Attr was not a node.

If you look at the referenced HTML DOM specification (https://dom.spec.whatwg.org/#interface-attr for Attr), it is defined as:

interface Attr : Node

I think your rules for namespaces end up with only the locally-declared namespaces on an element, not all the in-scope namespaces.

Ok. We will need to be careful to constrain it to the in-scope namespaces of the document, to prevent namespaces defined in XQuery and XSLT to be used.


Special rules are probably needed for the parentNode property of attribute and namespace nodes.

I think that shouldn't be needed as IIUC, in both cases those are the element in which the attribute or namespace was defined (when thinking in the context of the HTML DOM).


I thought the HTML DOM couldn't contain processing instructions?

They are defined in https://dom.spec.whatwg.org/#interface-processinginstruction of the HTML DOM specification.

The HTML spec does not support them when using the HTML parser (for text/html documents -- see https://html.spec.whatwg.org/multipage/parsing.html#parse-error-unexpected-question-mark-instead-of-tag-name) but they are present when using an XML based parser (i.e. for application/xhtml+xml documents).


The mapping for unparsed-entity-system/public-id is wrong (it gives you the public ID and system ID of the DTD),

Then I don't understand what these mean/should be from reading the XDM specification.

From https://html.spec.whatwg.org/multipage/parsing.html#the-initial-insertion-mode, these come from the DOCTYPE element. For example, given:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

a DocumentType node will be created with publicId set to -//W3C//DTD HTML 4.01 Transitional//EN. -- Is that not what dm:unparsed-entity-public-id returns?


Does this account for all node types encountered? Are there any left over like entity nodes or CDATA nodes?

With the exception of DocumentFragment and ShadowRoot (which is a DocumentFragment), yes. You can check the referenced HTML DOM specification and my extracted list above.


Is there ever a need to merge adjacent text nodes?

Yes. The HTML DOM separates Text and CDATASection nodes, so those would need merging. I'm not sure about other cases.


What happens if the DOM was created programmatically and has "inconsistencies" like undeclared namespaces? Are we defining a mapping only for a DOM constructed directly by a conformant HTML5 parser?

I would suggest for this to limit it to the results of a conformant HTML5 parser. That allows us to avoid dealing with DocumentFragment and ShadowRoot in the XDM.

Additionally, the HTML spec has this note in the definition of the template element:

It is also possible, as a result of DOM manipulation, for a template element to contain Text nodes and element nodes; however, having any is a violation of the template element's content model, since its content model is defined as nothing.


Do we need to say anything about case normalization of names, and default namespaces, or is that all implicit in the HTML5 parsing algorithm?

IIUC, HTML does not handle case normalization of names, so that would need to be specified here.

For default namespaces, the HTML specification only mentions the "http://www.w3.org/1999/xhtml" namespace in reference [3] AFAICT that I handle in the dm:node-name mapping.

@rhdunn
Copy link
Contributor Author

rhdunn commented Nov 18, 2022

You can't derive document-uri() from base-uri(), sadly, because the contract for document-uri() says that doc(document-uri(X)) is X, which isn't possible if two different documents have the same base URI (which is often the case).

I see now. That was a typo in the dm:base-uri section. It should be the implementation of dm:base-uri($node), not dm:document-uri($node).

@pgfearo
Copy link

pgfearo commented Nov 22, 2022

A conforming HTML5 parser generates a DOM that fixes tree-structure problems in a prescribed way. For example, by hoisting a tbody element (and following siblings) to the right level when it is nested within another tbody element within a table element.

Should this 'tree-structure fixup' be performed by the parser function? Or, perhaps there should be a secondary function for this?


Here's an XSpec scenario for XSLT we use that fixes the structure of an HTML table element:

    <x:scenario label="mix of tbody and tr hoisting within added wrapper elements">
      <x:context>
        <table>
          <td id="1">alone</td>
          <tbody id="2">
            <tbody id="3"><td id="4">first</td></tbody>
            <tr id="5">
              <tr id="6"><td id="7">nested-tr</td></tr>
              <td id="8">one<tr id="8.5"><td id="9">two</td></tr></td>
            </tr>
            <td id="10">three</td>
            <tbody id="11"><td id="12">four</td></tbody>
          </tbody>
        </table> 
      </x:context>
      <x:expect label="hoist all children out of tbody element">
        <table>
          <tbody>
            <tr><td id="1">alone</td></tr>
          </tbody>
          <tbody id="2">
          </tbody>
          <tbody id="3">
            <tr><td id="4">first</td></tr>
          </tbody>
          <tbody>
            <tr id="5"></tr>
            <tr id="6"><td id="7">nested-tr</td></tr>
            <tr><td id="8">one</td></tr>
            <tr id="8.5"><td id="9">two</td></tr>
            <tr><td id="10">three</td></tr>
          </tbody>
          <tbody id="11">
            <tr><td id="12">four</td></tr>
          </tbody>
        </table>
      </x:expect>    
    </x:scenario>

@rhdunn
Copy link
Contributor Author

rhdunn commented Nov 22, 2022

Yes, the tree fix-up should be done by the parse-html function for HTML 5. For HTML 4 and earlier, that behaviour is not defined, so in that case will be implementation dependent. For XHTML it shouldn't do any tree fix-up.

rhdunn added a commit that referenced this issue Jan 15, 2023
Issue #74 - add the fn:parse-html function
@michaelhkay
Copy link
Contributor

We accepted the parse-html() function into the spec, so this can now be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature A change that introduces a new feature XQFO An issue related to Functions and Operators
Projects
None yet
Development

No branches or pull requests

7 participants