Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fn:parse-html: Finalization #850

Open
ChristianGruen opened this issue Nov 21, 2023 · 10 comments
Open

fn:parse-html: Finalization #850

ChristianGruen opened this issue Nov 21, 2023 · 10 comments
Labels
Editorial Minor typos, wording clarifications, example fixes, etc. Propose for V4.0 The WG should consider this item critical to 4.0 Tests Needed Tests need to be written or merged

Comments

@ChristianGruen
Copy link
Contributor

Now that fn:parse-html has been added to the specification, we need test cases for all provided options and input types (including binary input).

Looking at the current set of test cases, it seems unrealistic to use older libraries such as TagSoup for this function. I wonder if we should support ·implementation-defined· parsing algorithms at all. What do others think?

Next, is there any implementation available that supports all given method/html-version variants?

@ChristianGruen ChristianGruen added Editorial Minor typos, wording clarifications, example fixes, etc. Propose for V4.0 The WG should consider this item critical to 4.0 Tests Needed Tests need to be written or merged labels Nov 21, 2023
@michaelhkay
Copy link
Contributor

I think we've gone over the top in terms of the number of options provided. I'd like to see implementations given a bit more flexibility and users given a bit less (spurious) control.

@rhdunn
Copy link
Contributor

rhdunn commented Nov 22, 2023

I'd be happy to simplify the option set to {"method": "html", "html-version": "5"} and any other vendor/implementation supported values with the description saying that 5 can refer to any of the W3C HTML 5.x RECs, or the WHATWG HTML LS spec. -- This then still allows the vendor to provide any other HTML parser they have (tagsoup, html tidy, etc.), or additional support for things like Microsoft Word flavoured HTML if they want.

Note: A conforming HTML 5/LS parser will be able to parse XHTML 1.0/1.1 documents into an XMLDocument per the https://html.spec.whatwg.org/#parsing-xhtml-documents section, including detecting those from the DOCTYPE/DTD.

I think its useful to keep the list of older HTML specs in some way for reference.

The encoding option can be useful, and is easy to integrate into the HTML parsing pipeline, as there is specific HTML spec language referenced for this behaviour.

The include-template-content option is also useful/necessary for interpreting the template element content. The HTML spec provides different rules for XSLT and XPath, which this option supports.

@ChristianGruen
Copy link
Contributor Author

If we keep support for non-conforming parsers like TagSoup, I have some concerns that the result of the function will not be comparable to the output of other processors. It is also not testable via the test suite.

Is this something we want to live with, or shouldn’t we rather enforce conformance? For other results, people could still use vendor-specific extensions.

@rhdunn
Copy link
Contributor

rhdunn commented Dec 14, 2023

The intention w.r.t. parsers like TagSoup and HTML tidy is that they are intended to be vendor-specific extensions. The mentions in the spec are non-normative notes/examples. I can update the wording to make the relevant sections clearer.

@michaelhkay
Copy link
Contributor

I think we should stick to the tradition (cf regular expressions) where our specifications set high expectations for conformance. That won't always ensure that implementors achieve the high standards we set, but that's their choice.

@michaelhkay
Copy link
Contributor

I've been looking at this again. I think it's very unlikely that implementations will offer multiple options on how to parse the HTML, or that users will select the right options if they do.

The choice between HTML and XHTML is real, but reading the spec carefully it's not actually clear what method="xhtml" is expected to do.

I'm submitting a PR that does some editorial tidying up but it's not making any substantive changes.

@rhdunn
Copy link
Contributor

rhdunn commented Jan 17, 2024

Given we are simplifying this, I suggest just using an implementation defined version of HTML 5 - 5.2 and WHATWG HTML Living Standard. That will then take care of parsing the different older versions of HTML, including XML variants. It then allows implementations to use more recent versions of the living standard.

Having a method/parser selection is still useful for implementations that also provide their own HTML parsers -- e.g. MarkLogic support for HTML Tidy -- as it gives a standardized API to access those.

@michaelhkay
Copy link
Contributor

So long as we have an options parameter, if we define it to follow the option parameter conventions, (see issue #955), we don't need to mention any vendor-specific options because they're covered by the general rules.

@kosek
Copy link

kosek commented Feb 19, 2024

I think that supporting other parsing algorithms than HTML5 doesn't bring any additional value. Browsers are currently supporting also only this algorithm and using it for parsing all other older versions of HTML or for parsing XHTML served with a wrong media type. So supporting anything else than HTML5 parsing algorithm would be just additional burden to implementers.

@ChristianGruen ChristianGruen added the PR Pending A PR has been raised to resolve this issue label Mar 6, 2024
@michaelhkay michaelhkay removed the PR Pending A PR has been raised to resolve this issue label Jun 25, 2024
@michaelhkay
Copy link
Contributor

Droppping the "PR Pending" tag. PR850 has been accepted, but it claimed that it didn't entirely close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Editorial Minor typos, wording clarifications, example fixes, etc. Propose for V4.0 The WG should consider this item critical to 4.0 Tests Needed Tests need to be written or merged
Projects
None yet
Development

No branches or pull requests

4 participants