-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FO] Support parsing HTML #74
Comments
There's definitely a strong case for providing this function, but writing and agreeing a specification isn't going to be easy. Despite what the proposal suggests, the result of the function has to be an XDM node tree. We could give implemetations some discretion on how HTML5 is mapped to XDM, or we could define very strict conformance rules (which would give better interoperability, but make it harder to reuse existing parsing libraries). Doing this probably requires a rather detailed technical knowledge of HTML5 (volunteers, please!). There may be an issue with features of HTML5 that don't map readily to anything in XDM - an example that has come up with SaxonJS is support for the Shadow DOM concept. (See also issue #75 regarding the HTML5 template element). |
When looking at HTML, I quickly get other formats in mind as well (e.g. CSV). |
The shadow DOM is not used/accessible from parsing [1]. The shadow DOM and template element both use a DocumentFragment [2], but the template subtree is accessed via the The key part as I note in the proposal is the handling of the template element [3], [4]. That is discussed in #75 for how we could handle that element correctly -- specifically, having a new "document fragment" node and a The HTML DOM [5] maps pretty well to the XDM. The difference is in the addition of the DocumentFragment node and the visibility of that in the template content property set during parsing, and the shadow DOM set during JavaScript evaluation. It should be possible to define a precise mapping for the HTML DOM nodes. [1] https://developer.mozilla.org/en-US/docs/Web/Web_Components/Using_shadow_DOM |
Could a corresponding |
Or just overload |
@benibela For that matter we could add an |
@joewiz @benibela Defining a function that parses HTML is the easy part; the challenge is specifying exactly what it returns. @rhdunn says "It should be possible to define a precise mapping for the HTML DOM nodes." but I'm not sure that's true without data model extensions. One can envisage data model extensions (e.g. allowing a node to have an arbitrary set of named properties, defined by a map), but any data model extensions end up being disruptive, because of the pervasive impact (e.g. new features in XQuery and XSLT to construct nodes that have such properties). |
I think there is a bit of confusion here. What I mean by "HTML DOM nodes" is the core part of the HTML DOM [1]. I.e.:
So the work should be defining how those interfaces map to the XDM properties akin to the PSVI and infoset mappings. Ideally, that should be in a "Construction from a HTML DOM Core document" in the XDM, but I'd also be happy with documenting it in the FO spec. What I'm not suggesting here is mapping all of the HTML element DOM properties to the XDM. As such, what I'm proposing is the equivalent of:
So in that sense, the This is because 1) the HTML parsing algorithm uses tree construction to build a HTML element DOM, and 2) different libraries use their own internal DOM representations. As such, I want the specification to be loose enough to allow an implementor to:
So, for example: dm:parentThe result of
dm:attributesThe result of
|
My understanding is that the HTML5 parsing algorithm is deterministic and returns a tree. I think the bare minimum needed to declare victory here would be to say that the function returns that tree as an XDM. I think I could probably live with leaving the "template" elements in the tree and letting the XPath, XQuery, or XSLT that's going to process the XDM deal with them, but I'm not opposed in principle to trying to do better than that with templates. |
Here is a draft of mapping the HTML DOM to the XDM. This should ideally be a part of the XDM document, but I'm happy for it to be part of the FO spec. This is complete as far as I can tell w.r.t. the HTML DOM specification. Constructing XDM from HTML DOMConstructing the XDM from a HTML DOM node follows the infoset mapping rules for the DOM nodes with a few differences to handle HTML specific differences in the DOM nodes.
dm:attributesThe result of
The resulting HTML DOM
That sequence is filtered as follows:
dm:base-uriThe result of
dm:childrenThe result of
The resulting HTML DOM
That sequence is filtered as follows:
dm:document-uriThe result of
dm:is-idThe result of
dm:is-idrefsThe result of dm:namespace-nodesThe result of
The resulting HTML DOM
That sequence is filtered as follows:
dm:nilledThe result of dm:node-kindThe result of
dm:node-nameThe result of
dm:parentThe result of
dm:string-valueThe result of
dm:type-nameThe result of
dm:typed-valueThe result of
dm:dm:unparsed-entity-public-id
dm:unparsed-entity-system-id
References |
Looks a pretty good start. A couple of details:
|
How about the following wording: HTML DOM nodes (as defined in the DOM Standard) map to XDM nodes as follows:
Constructing the XDM from a HTML DOM node follows the infoset mapping rules for the DOM nodes with a few differences to handle HTML specific differences in the DOM nodes. The specifics for the mapping are detailed in the section below. [... section as my previous comment ...]
I'm not doing that. I'm mapping
Is this needed? I explicitly avoid doing this to prevent modifying the XDM model. -- This is because a HTML
If you look at the referenced HTML DOM specification (https://dom.spec.whatwg.org/#interface-attr for Attr), it is defined as:
Ok. We will need to be careful to constrain it to the in-scope namespaces of the document, to prevent namespaces defined in XQuery and XSLT to be used.
I think that shouldn't be needed as IIUC, in both cases those are the element in which the attribute or namespace was defined (when thinking in the context of the HTML DOM).
They are defined in https://dom.spec.whatwg.org/#interface-processinginstruction of the HTML DOM specification. The HTML spec does not support them when using the HTML parser (for
Then I don't understand what these mean/should be from reading the XDM specification. From https://html.spec.whatwg.org/multipage/parsing.html#the-initial-insertion-mode, these come from the DOCTYPE element. For example, given:
a
With the exception of
Yes. The HTML DOM separates Text and CDATASection nodes, so those would need merging. I'm not sure about other cases.
I would suggest for this to limit it to the results of a conformant HTML5 parser. That allows us to avoid dealing with Additionally, the HTML spec has this note in the definition of the
IIUC, HTML does not handle case normalization of names, so that would need to be specified here. For default namespaces, the HTML specification only mentions the |
I see now. That was a typo in the |
A conforming HTML5 parser generates a DOM that fixes tree-structure problems in a prescribed way. For example, by hoisting a Should this 'tree-structure fixup' be performed by the parser function? Or, perhaps there should be a secondary function for this? Here's an XSpec scenario for XSLT we use that fixes the structure of an HTML table element:
|
Yes, the tree fix-up should be done by the parse-html function for HTML 5. For HTML 4 and earlier, that behaviour is not defined, so in that case will be implementation dependent. For XHTML it shouldn't do any tree fix-up. |
Issue #74 - add the fn:parse-html function
We accepted the parse-html() function into the spec, so this can now be closed. |
It is common for applications that use an XQuery database engine to want to parse HTML documents when adding content from HTML pages into a database, or in other applications like generating epub documents from HTML source files. Vendors like MarkLogic (
xdmp:tidy
via HTML Tidy for HTML4), BaseX (html:parse
via TagSoup), Saxon (saxon:parse-html
via TagSoup), and eXist-db (util:parse-html
via Neko) have provided custom methods to support this.Q: Should there also be functions to list the supported methods and character encodings?
fn:parse-html
Summary
Parses HTML-based input into an XML document.
Signature
Properties
This function is ·deterministic·, ·context-independent·, and ·focus-independent·.
Rules
The $options map conforms to
record(method as union(enum("html5"), xs:string), encoding as xs:string?, *)
. A vendor may provide ·implementation-dependent· options that may vary between the differentmethod
values.The
method
property of $options defines the approach used to convert the HTML document to XML. This specification supportshtml5
for using the HTML5 parsing rules for HTML content. The exact version of HTML5 used is ·implementation-dependent·.The
encoding
property of $options defines the character encoding used to decode binary data. By default, this is an empty sequence. Implementations must support at leastutf-8
,utf8
,ascii
, andlatin1
. Other encoding values are ·implementation-dependent·, but it is recommended that the encodings documented in the WHATWG Encoding specification [3] are supported.If $input is an
xs:string
, no character decoding is performed as the input is already decoded.If $input is an
xs:hexBinary
orxs:base64Binary
, the character encoding used to decode the binary data is determined as follows:encoding
is specified in $options, that value is used;If the detected character encoding name is not supported, an
FO######
error is raised. Otherwise, the character encoding method associated with the character encoding is used.If the parsing method is not supported, an
FO######
error is raised.The $input is then parsed according to the specified parsing method, building an intermediate HTML Document object. The XML
document-node
is then constructed by mapping the HTML document, element, attribute, text, and comment nodes to their XML equivalents.If a HTML document contains a
template
element, the contents of that element are added as children of thetemplate
element. It is ·implementation-dependent· whether or not a processor ignores this content when evaluating path expressions on thesetemplate
elements, and how they are represented in any DOM interfaces.Notes
Examples
References
The text was updated successfully, but these errors were encountered: