Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make HTML4/XHTML1 Strict doctypes non-conforming #2048

Merged
merged 3 commits into from Nov 17, 2016

Conversation

sideshowbarker
Copy link
Contributor

@sideshowbarker sideshowbarker commented Nov 16, 2016

It was never intended that HTML4 Strict and XHTML1/1.1 Strict doctypes would remain conforming forever. Given that HTML4 is nearly 20 years old (and XHTML1 is just a reformulation of HTML4 in XML), it’s time to consider making the HTML4 Strict and XHTML1/1.1 Strict doctypes non-conforming—just as are all other HTML4 and XHTML1/1.1 doctypes (and HTML 3.2, etc., doctypes are).

The spec currently defines the HTML4 Strict and XHTML1/1.1 Strict doctypes are obsolete but still conforming—obsolete permitted DOCTYPEs—and says that “Authors should not use obsolete permitted DOCTYPEs, as they are unnecessarily long”.

The reason the spec states for allowing them in conforming documents is in order to “help authors transition from HTML4 and XHTML1”.

But at this point continuing to allow HTML4 Strict and XHTML1/1.1 Strict doctypes as conforming isn’t helping authors transition; instead it seems to be having the effect of continuing to proliferate use of those doctypes long past what rightly should have been their proper expiration date.

For the HTML checker I still get issue reports from authors requesting that if they put an HTML4 or XHTML1 doctype on a document, the checker should evaluate it using HTML4/XHTML1 requirements (as the SGML/DTD-based legacy W3C validator does and as validator.nu used to do) instead of requirements in the current HTML spec.

In other words, some authors are continuing to intentionally use the HTML4/XHTML1 doctypes so that their documents can be “valid” even though they contain markup that the current HTML spec defines as non-conforming.

So, it’d be helpful if we made the spec clearly disallow use of all legacy HTML doctypes, including the HTML4 Strict and XHTML1/1.1 Strict doctypes (the only remaining legacy docytpes still allowed).

@@ -98010,62 +98010,6 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> {
<p>The <span>DOCTYPE legacy string</span> should not be used unless the document is generated from
a system that cannot output the shorter string.</p>

<hr>

<!-- see the parser section before changing this bit -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was meant by this?

Also, in XHTML you still need to use a DOCTYPE kinda like this for entities and we still don't have a replacement. But I guess we shouldn't really let that influence what is okay for text/html.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<!-- see the parser section before changing this bit -->

What was meant by this?

Dunno for sure. Maybe @zcorpan knows better. But anyway I took it as a statement about effects as far as changing the contents of that section—not about dropping the whole thing entirely.

Also, in XHTML you still need to use a DOCTYPE kinda like this for entities and we still don't have a replacement

hmm yeah I had not thought about that, because the spec doesn’t give it as a reason

@zcorpan
Copy link
Member

zcorpan commented Nov 16, 2016

The parser has parse errors for doctypes other than the permitted ones I believe.

The spec has this:

Conformance checkers may, based on the values (including presence or lack thereof) of the DOCTYPE token's name, public identifier, or system identifier, switch to a conformance checking mode for another language (e.g. based on the DOCTYPE token a conformance checker could recognize that the document is an HTML4-era document, and defer to an HTML4 conformance checker.)

Does that not address the issue for the checker?

I'd like to check the reasons we permitted these doctypes in the first place, why they are no longer relevant. Or what the effects will be if we change this (and the behavior of the checker). Will people replace all instances of "new" elements with div instead of switching to <!doctype html>?

@annevk
Copy link
Member

annevk commented Nov 16, 2016

Does that not address the issue for the checker?

I think we basically should not allow that kind of behavior. There should be only one path for checking HTML. Not version-dependent paths.

@sideshowbarker
Copy link
Contributor Author

The spec has this:

Conformance checkers may, based on the values (including presence or lack thereof) of the DOCTYPE token's name, public identifier, or system identifier, switch to a conformance checking mode for another language (e.g. based on the DOCTYPE token a conformance checker could recognize that the document is an HTML4-era document, and defer to an HTML4 conformance checker.)

Does that not address the issue for the checker?

No, because the checker does not (any longer) switch into any different modes based on the doctype—because I agree with what @annevk said:

There should be only one path for checking HTML. Not version-dependent paths.

@zcorpan
Copy link
Member

zcorpan commented Nov 16, 2016

OK, so then we should remove that paragraph as well. And change the HTML parser to emit more parse errors.

Are you going to remove support for checking HTML4 from the checker completely?

@sideshowbarker
Copy link
Contributor Author

OK, so then we should remove that paragraph as well.

Good point—made it so.

And change the HTML parser to emit more parse errors.

I’d prefer to do that in a separate follow-up PR—since changing the parsing algorithm potentially affects browsers and all other parser implementations, while this PR as currently scoped only affects document conformance/authors and conformance checkers.

@sideshowbarker
Copy link
Contributor Author

Are you going to remove support for checking HTML4 from the checker completely?

Yes, from the HTML checker I’d like to remove any traces of HTML4-related checking that still remain. However, I guess the vnu source still needs to contain an HTML4-checking path as long as the https://validator.nu/ Web UI continues to offer an HTML4-checking option (which https://checker.html5.org/ and https://validator.w3.org/nu/ do not).

The W3C will continue to offer HTML4 and XHTML1 checking using the legacy backend for those that https://validator.w3.org/ relies on. That is anyway what most people who want HTML4/XHTML1 checking actually use (not the https://validator.nu/ HTML4-checking option).

@domenic
Copy link
Member

domenic commented Nov 16, 2016

I think we should do the parse errors in this PR too? They won't affect browsers, just checkers, and it seems good for them to be consistent with the requirements changed here.

@sideshowbarker
Copy link
Contributor Author

[changes to parse errors] won't affect browsers, just checkers

The gecko HTML parser exposes parse errors in its View source but yeah changes to parse errors otherwise don’t affect gecko parsing behavior, or behavior in any other browsers.

That said, we do have other parsers that do error reporting—at least two of them I can think of.

I think we should do the parse errors in this PR too… it seems good for them to be consistent with the requirements changed here.

OK, I can add them here.

(FWIW my thinking had been that it would not be ideal to conflate into one PR both (A) document-conformance changes that have no normative requirements for parser implementors and (B) parser changes that do have normative requirements for implementors who have implemented the error-reporting parts of the parsing algorithm).

@annevk
Copy link
Member

annevk commented Nov 16, 2016

I tend to agree that we want to land conformance changes on both sides. The parser and syntax section ought to be updated together since they rely on each other to some extent. The specification would be inconsistent otherwise.

@sideshowbarker
Copy link
Contributor Author

change the HTML parser to emit more parse errors.

See 34c4d1b and lemme know if anything more beyond that needs changing in the parsing algorithm.

@sideshowbarker
Copy link
Contributor Author

Also, in XHTML you still need to use a DOCTYPE kinda like this for entities and we still don't have a replacement

See #2056 which eliminates the need for authors to be forced to forever continue putting obsolete XHTML1 doctypes in HTML documents that are served with XML mime types.

Instead it just changes the spec to say:

If the document element of a Document is in the HTML namespace, user agents should attempt to retrieve the URL given by this link (this URL is a DTD containing the entity declarations for the names listed in the named character references section), and should not attempt to retrieve any other external entity's content.[XML]

@domenic
Copy link
Member

domenic commented Nov 17, 2016

Looks great, but can you or someone help work on a nice explanatory commit message for this? To avoid misunderstandings, I think we should stress exactly what this does and does not do, i.e. it removes the legacy XHTML and HTML 4 doctypes as conformant, so that only <!DOCTYPE html> and <!DOCTYPE html SYSTEM "about:legacy-compat"> are conformant. But it does not remove XHTML support or impact the browser processing model.

@annevk
Copy link
Member

annevk commented Nov 17, 2016

Remove obsolete permitted DOCTYPEs

From now on conformance checkers can only allow <!doctype html> and <!doctype html SYSTEM "about:legacy-compat"> as doctypes in HTML syntax. The HTML4 and XHTML1 DOCTYPEs are no longer allowed.

(XHTML syntax continues to be supported and is not influenced by this change.)

@domenic domenic merged commit 31c20af into master Nov 17, 2016
@domenic domenic deleted the doctypes-html4-xhtml1-nonconforming branch November 17, 2016 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

4 participants