fn:parse-html: Finalization #850

ChristianGruen · 2023-11-21T13:04:45Z

Now that fn:parse-html has been added to the specification, we need test cases for all provided options and input types (including binary input).

Looking at the current set of test cases, it seems unrealistic to use older libraries such as TagSoup for this function. I wonder if we should support ·implementation-defined· parsing algorithms at all. What do others think?

Next, is there any implementation available that supports all given method/html-version variants?

The text was updated successfully, but these errors were encountered:

michaelhkay · 2023-11-21T14:51:24Z

I think we've gone over the top in terms of the number of options provided. I'd like to see implementations given a bit more flexibility and users given a bit less (spurious) control.

rhdunn · 2023-11-22T20:42:08Z

I'd be happy to simplify the option set to {"method": "html", "html-version": "5"} and any other vendor/implementation supported values with the description saying that 5 can refer to any of the W3C HTML 5.x RECs, or the WHATWG HTML LS spec. -- This then still allows the vendor to provide any other HTML parser they have (tagsoup, html tidy, etc.), or additional support for things like Microsoft Word flavoured HTML if they want.

Note: A conforming HTML 5/LS parser will be able to parse XHTML 1.0/1.1 documents into an XMLDocument per the https://html.spec.whatwg.org/#parsing-xhtml-documents section, including detecting those from the DOCTYPE/DTD.

I think its useful to keep the list of older HTML specs in some way for reference.

The encoding option can be useful, and is easy to integrate into the HTML parsing pipeline, as there is specific HTML spec language referenced for this behaviour.

The include-template-content option is also useful/necessary for interpreting the template element content. The HTML spec provides different rules for XSLT and XPath, which this option supports.

ChristianGruen · 2023-12-14T11:29:43Z

If we keep support for non-conforming parsers like TagSoup, I have some concerns that the result of the function will not be comparable to the output of other processors. It is also not testable via the test suite.

Is this something we want to live with, or shouldn’t we rather enforce conformance? For other results, people could still use vendor-specific extensions.

rhdunn · 2023-12-14T11:32:58Z

The intention w.r.t. parsers like TagSoup and HTML tidy is that they are intended to be vendor-specific extensions. The mentions in the spec are non-normative notes/examples. I can update the wording to make the relevant sections clearer.

michaelhkay · 2023-12-14T11:33:11Z

I think we should stick to the tradition (cf regular expressions) where our specifications set high expectations for conformance. That won't always ensure that implementors achieve the high standards we set, but that's their choice.

michaelhkay · 2024-01-17T17:29:35Z

I've been looking at this again. I think it's very unlikely that implementations will offer multiple options on how to parse the HTML, or that users will select the right options if they do.

The choice between HTML and XHTML is real, but reading the spec carefully it's not actually clear what method="xhtml" is expected to do.

I'm submitting a PR that does some editorial tidying up but it's not making any substantive changes.

rhdunn · 2024-01-17T17:40:36Z

Given we are simplifying this, I suggest just using an implementation defined version of HTML 5 - 5.2 and WHATWG HTML Living Standard. That will then take care of parsing the different older versions of HTML, including XML variants. It then allows implementations to use more recent versions of the living standard.

Having a method/parser selection is still useful for implementations that also provide their own HTML parsers -- e.g. MarkLogic support for HTML Tidy -- as it gives a standardized API to access those.

michaelhkay · 2024-01-17T17:43:22Z

So long as we have an options parameter, if we define it to follow the option parameter conventions, (see issue #955), we don't need to mention any vendor-specific options because they're covered by the general rules.

kosek · 2024-02-19T15:12:28Z

I think that supporting other parsing algorithms than HTML5 doesn't bring any additional value. Browsers are currently supporting also only this algorithm and using it for parsing all other older versions of HTML or for parsing XHTML served with a wrong media type. So supporting anything else than HTML5 parsing algorithm would be just additional burden to implementers.

michaelhkay · 2024-06-25T08:20:29Z

Droppping the "PR Pending" tag. PR850 has been accepted, but it claimed that it didn't entirely close this issue.

ChristianGruen added Editorial Minor typos, wording clarifications, example fixes, etc. Propose for V4.0 The WG should consider this item critical to 4.0 Tests Needed Tests need to be written or merged labels Nov 21, 2023

michaelhkay mentioned this issue Jan 17, 2024

850-partial Editorial improvements to parse-html() #956

Merged

ChristianGruen added the PR Pending A PR has been raised to resolve this issue label Mar 6, 2024

michaelhkay removed the PR Pending A PR has been raised to resolve this issue label Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fn:parse-html: Finalization #850

fn:parse-html: Finalization #850

ChristianGruen commented Nov 21, 2023

michaelhkay commented Nov 21, 2023

rhdunn commented Nov 22, 2023

ChristianGruen commented Dec 14, 2023

rhdunn commented Dec 14, 2023

michaelhkay commented Dec 14, 2023

michaelhkay commented Jan 17, 2024

rhdunn commented Jan 17, 2024

michaelhkay commented Jan 17, 2024

kosek commented Feb 19, 2024

michaelhkay commented Jun 25, 2024

fn:parse-html: Finalization #850

fn:parse-html: Finalization #850

Comments

ChristianGruen commented Nov 21, 2023

michaelhkay commented Nov 21, 2023

rhdunn commented Nov 22, 2023

ChristianGruen commented Dec 14, 2023

rhdunn commented Dec 14, 2023

michaelhkay commented Dec 14, 2023

michaelhkay commented Jan 17, 2024

rhdunn commented Jan 17, 2024

michaelhkay commented Jan 17, 2024

kosek commented Feb 19, 2024

michaelhkay commented Jun 25, 2024