-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fn:parse-html: Finalization #850
Comments
I think we've gone over the top in terms of the number of options provided. I'd like to see implementations given a bit more flexibility and users given a bit less (spurious) control. |
I'd be happy to simplify the option set to Note: A conforming HTML 5/LS parser will be able to parse XHTML 1.0/1.1 documents into an XMLDocument per the https://html.spec.whatwg.org/#parsing-xhtml-documents section, including detecting those from the DOCTYPE/DTD. I think its useful to keep the list of older HTML specs in some way for reference. The The |
If we keep support for non-conforming parsers like TagSoup, I have some concerns that the result of the function will not be comparable to the output of other processors. It is also not testable via the test suite. Is this something we want to live with, or shouldn’t we rather enforce conformance? For other results, people could still use vendor-specific extensions. |
The intention w.r.t. parsers like TagSoup and HTML tidy is that they are intended to be vendor-specific extensions. The mentions in the spec are non-normative notes/examples. I can update the wording to make the relevant sections clearer. |
I think we should stick to the tradition (cf regular expressions) where our specifications set high expectations for conformance. That won't always ensure that implementors achieve the high standards we set, but that's their choice. |
I've been looking at this again. I think it's very unlikely that implementations will offer multiple options on how to parse the HTML, or that users will select the right options if they do. The choice between HTML and XHTML is real, but reading the spec carefully it's not actually clear what I'm submitting a PR that does some editorial tidying up but it's not making any substantive changes. |
Given we are simplifying this, I suggest just using an implementation defined version of HTML 5 - 5.2 and WHATWG HTML Living Standard. That will then take care of parsing the different older versions of HTML, including XML variants. It then allows implementations to use more recent versions of the living standard. Having a method/parser selection is still useful for implementations that also provide their own HTML parsers -- e.g. MarkLogic support for HTML Tidy -- as it gives a standardized API to access those. |
So long as we have an options parameter, if we define it to follow the option parameter conventions, (see issue #955), we don't need to mention any vendor-specific options because they're covered by the general rules. |
I think that supporting other parsing algorithms than HTML5 doesn't bring any additional value. Browsers are currently supporting also only this algorithm and using it for parsing all other older versions of HTML or for parsing XHTML served with a wrong media type. So supporting anything else than HTML5 parsing algorithm would be just additional burden to implementers. |
Droppping the "PR Pending" tag. PR850 has been accepted, but it claimed that it didn't entirely close this issue. |
Now that
fn:parse-html
has been added to the specification, we need test cases for all provided options and input types (including binary input).Looking at the current set of test cases, it seems unrealistic to use older libraries such as TagSoup for this function. I wonder if we should support ·implementation-defined· parsing algorithms at all. What do others think?
Next, is there any implementation available that supports all given method/html-version variants?
The text was updated successfully, but these errors were encountered: