Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default namespace for elements; especially in the context of HTML #296

Open
michaelhkay opened this issue Dec 22, 2022 · 9 comments
Open
Labels
Enhancement A change or improvement to an existing feature PR Pending A PR has been raised to resolve this issue XPath An issue related to XPath

Comments

@michaelhkay
Copy link
Contributor

michaelhkay commented Dec 22, 2022

There can be little doubt that the fact that an unprefixed name in XPath fails to select an unprefixed element in the source document is one of the major gotcha's, causing massive bewilderment to all newbie users.

The XPath 2.0 solution of using a default element namespace in the static context is a partial solution; its main drawback is that it doesn't help the newbies who didn't know about the problem or its solution.

The HTML "living standard" introduces a "wilful violation" of the XPath 1.0 spec to address the issue. Given that most elements in an HTML DOM will be in the XHTML namespace, it states:

If the QName has no prefix and the principal node type of the axis is element, then the default element namespace is used. Otherwise if the QName has no prefix, the namespace URI is null. The default element namespace is a member of the context for the XPath expression. The value of the default element namespace when executing an XPath expression through the DOM3 XPath API is determined in the following way:

If the context node is from an HTML DOM, the default element namespace is "http://www.w3.org/1999/xhtml".
Otherwise, the default element namespace URI is null.

It then adds a note which is blatantly untrue:

This is equivalent to adding the default element namespace feature of XPath 2.0 to XPath 1.0, and using the HTML namespace as the default element namespace for HTML documents. It is motivated by the desire to have implementations be compatible with legacy HTML content while still supporting the changes that this specification introduces to HTML regarding the namespace used for HTML elements, and by the desire to use XPath 1.0 rather than XPath 2.0.

Since the XPath 2.0 facility picks up the default namespace from the static context, while the HTML "wilful violation" picks it up dynamically from a property of the context node (namely "being from an HTML DOM") there is no way these can be considered equivalent.

(Note also, there's a significant ambiguity in the "wilful violation" rules: what exactly is the "context node" that determines this behaviour? I think they're suggesting it is the context node at the point of XPath API invocation, not the context node for the specific axis step. This makes it rather unclear how the rule is supposed to apply to XSLT. And: if an XSLT stylesheet creates a temporary tree with nodes in the XHTML namespaces, do we consider those nodes as being "from an HTML DOM"?)

Nevertheless, the intent of the "violation" is worthy, and it would be nice if we can find a solution to this problem that works both for HTML and for other vocabularies.

Our current proposal for fn:parse-html is that HTML elements should go in the XHTML namespace and this means that users familiar with XPath 1.0 implementations in the browser will trip over this problem. A lot.

@michaelhkay
Copy link
Contributor Author

A couple of possible ways forward:

(a) we define a mode of operation in which the interpretation of unprefixed element names in paths is decided dynamically. Specifically, when this mode of operation is in force, an unprefixed element name matches an element if the element is in the same namespace as the outermost element of the containing document. This is influenced by the HTML "violation" where the default namespace depends on what document you are processing. (But there's potentially a difficulty here with element names used other than in axis steps, for example element names used in types).

(b) we define a mode of operation in which unprefixed element names in axis steps match on local name only. This is something of a radical departure, but I introduced it with the Saxon Gizmo tool which is designed for interactive (and therefore informal) use, and it works very well in that environment. It basically says, if you care about the namespace, use a prefix, and if you don't, just use the local name. I think that probably meets many users' expectations. Even in the rare cases where the same local name is used with multiple namespaces, they often have a semantic relationship, and there's no harm in "//title" selecting any of them. In particular it's better to over-select than to under-select, because the former problem is much easier to diagnose and correct.

@michaelhkay
Copy link
Contributor Author

Note that in the current drafts I have already made the change that allows the default element namespace to differ from the default type namespace. (There was really no need for them ever to be coupled. but decoupling them creates some backwards compatibility issues that the draft spec addresses.)

Allowing a setting of "any" for the default element namespace is not difficult. I have only found one place that needs special attention: a schema element test schema-element(X) can only match a single element declaration in the schema, so we have to say that in this context, an unprefixed name would refer to a no-namespace name. In all other context, interpreting an unprefixed name as *:name seems to work fine.

@benibela
Copy link

That would help a lot to work with HTML

(a) might be confusing if the outermost element has a prefix. Or if the namespace is redefined, <r><a xmlns="A"> <b xmlns="B"/></a></r> / a / b should perhaps return <b>. And it might be inefficient to search for namespaces in the document

(b) is then easier to use

A third option could be prefix only matching where it checks the prefix but ignores the namespace url. /html would match <html/> and <html xmlns="http://www.w3.org/1999/xhtml"/> and <html xmlns="foo"/> but not <html:html xmlns:html="http://www.w3.org/1999/xhtml"/>

@michaelhkay
Copy link
Contributor Author

The idea of treating prefixes as significant goes against the grain simply because we've spent so many years educating people to treat the choice of prefix as insignificant, it would cause great confusion to reverse that. I'm not going to defend the orthodox wisdom because I've always been highly critical of the way namespaces are done, but we need to be very careful to avoid making matters worse.

@ChristianGruen ChristianGruen added XPath An issue related to XPath Enhancement A change or improvement to an existing feature labels Jan 26, 2023
@rhdunn
Copy link
Contributor

rhdunn commented Oct 24, 2023

Note that the spec changes to separate the element and type namespaces has been reverted following comittee review of the behaviour. That would need reinstating or revising to make work with the HTML/browser-like XPath matching rules that ignore the element namespace.

@gsnedders
Copy link

The HTML "living standard" introduces a "wilful violation" of the XPath 1.0 spec to address the issue.

(This is https://html.spec.whatwg.org/#interactions-with-xpath-and-xslt, to give the link here)

@michaelhkay
Copy link
Contributor Author

michaelhkay commented Oct 24, 2023

@gsnedders Yes, we're well aware of the "wilful violation": that's the crux of this issue, mentioned in the original issue description.

My preferred approach is to have a mode of operation where an unprefixed name in a name test is interpreted as *:local, that is it matches by local name alone regardless of namespace.

@gsnedders
Copy link

I was just… making sure we actually had the link to the relevant part of the spec in the issue, nothing more, rather than everyone reading this having to dig up the relevant section.

@michaelhkay
Copy link
Contributor Author

michaelhkay commented Oct 24, 2023

My preferred solution to this is as follows.

Currently the default namespace for elements and types can be either a namespace URI or absent. I propose that it can take an additional setting, "auto". If the value is "auto", then:

  • An unprefixed name N appearing as a NameTest is interpreted as *:N, that it, it matches local name N in any namespace or none.
  • In any other context where the default namespace for elements and type is used (for example, in a type name, or in schema-element(N)) the "auto" setting has the same effect as "absent", that is no prefix means no namespace.

I'm tempted also to suggest that for types, when "auto" is set, an unprefixed name T should mean xs:T. On the one hand that seems to be an unwarranted bundling of two unrelated enhancements. But on the other hand, the two defaults are already bundled together so we might as well make the most of it.

The surface syntax can be

XQuery: declare default element namespace "##auto";
XSLT: xpath-default-namespace="##auto"

For XPath the setting would typically be controlled by the host language API. A browser-based API optimized for HTML could well choose to make this the default, or it could do something akin to the "wilful violation" by making the default depend on whether the context node is XML or HTML.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement A change or improvement to an existing feature PR Pending A PR has been raised to resolve this issue XPath An issue related to XPath
Projects
None yet
Development

No branches or pull requests

5 participants