Make HTML4/XHTML1 Strict doctypes non-conforming #2048

sideshowbarker · 2016-11-16T11:26:27Z

It was never intended that HTML4 Strict and XHTML1/1.1 Strict doctypes would remain conforming forever. Given that HTML4 is nearly 20 years old (and XHTML1 is just a reformulation of HTML4 in XML), it’s time to consider making the HTML4 Strict and XHTML1/1.1 Strict doctypes non-conforming—just as are all other HTML4 and XHTML1/1.1 doctypes (and HTML 3.2, etc., doctypes are).

The spec currently defines the HTML4 Strict and XHTML1/1.1 Strict doctypes are obsolete but still conforming—obsolete permitted DOCTYPEs—and says that “Authors should not use obsolete permitted DOCTYPEs, as they are unnecessarily long”.

The reason the spec states for allowing them in conforming documents is in order to “help authors transition from HTML4 and XHTML1”.

But at this point continuing to allow HTML4 Strict and XHTML1/1.1 Strict doctypes as conforming isn’t helping authors transition; instead it seems to be having the effect of continuing to proliferate use of those doctypes long past what rightly should have been their proper expiration date.

For the HTML checker I still get issue reports from authors requesting that if they put an HTML4 or XHTML1 doctype on a document, the checker should evaluate it using HTML4/XHTML1 requirements (as the SGML/DTD-based legacy W3C validator does and as validator.nu used to do) instead of requirements in the current HTML spec.

In other words, some authors are continuing to intentionally use the HTML4/XHTML1 doctypes so that their documents can be “valid” even though they contain markup that the current HTML spec defines as non-conforming.

So, it’d be helpful if we made the spec clearly disallow use of all legacy HTML doctypes, including the HTML4 Strict and XHTML1/1.1 Strict doctypes (the only remaining legacy docytpes still allowed).

annevk · 2016-11-16T11:43:22Z

source

@@ -98010,62 +98010,6 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> {
  <p>The <span>DOCTYPE legacy string</span> should not be used unless the document is generated from
  a system that cannot output the shorter string.</p>

-  <hr>
-
-  <!-- see the parser section before changing this bit -->


What was meant by this?

Also, in XHTML you still need to use a DOCTYPE kinda like this for entities and we still don't have a replacement. But I guess we shouldn't really let that influence what is okay for text/html.



What was meant by this?

Dunno for sure. Maybe @zcorpan knows better. But anyway I took it as a statement about effects as far as changing the contents of that section—not about dropping the whole thing entirely.

Also, in XHTML you still need to use a DOCTYPE kinda like this for entities and we still don't have a replacement

hmm yeah I had not thought about that, because the spec doesn’t give it as a reason

zcorpan · 2016-11-16T11:48:20Z

The parser has parse errors for doctypes other than the permitted ones I believe.

The spec has this:

Conformance checkers may, based on the values (including presence or lack thereof) of the DOCTYPE token's name, public identifier, or system identifier, switch to a conformance checking mode for another language (e.g. based on the DOCTYPE token a conformance checker could recognize that the document is an HTML4-era document, and defer to an HTML4 conformance checker.)

Does that not address the issue for the checker?

I'd like to check the reasons we permitted these doctypes in the first place, why they are no longer relevant. Or what the effects will be if we change this (and the behavior of the checker). Will people replace all instances of "new" elements with div instead of switching to <!doctype html>?

annevk · 2016-11-16T11:51:01Z

Does that not address the issue for the checker?

I think we basically should not allow that kind of behavior. There should be only one path for checking HTML. Not version-dependent paths.

sideshowbarker · 2016-11-16T12:13:37Z

The spec has this:

Conformance checkers may, based on the values (including presence or lack thereof) of the DOCTYPE token's name, public identifier, or system identifier, switch to a conformance checking mode for another language (e.g. based on the DOCTYPE token a conformance checker could recognize that the document is an HTML4-era document, and defer to an HTML4 conformance checker.)

Does that not address the issue for the checker?

No, because the checker does not (any longer) switch into any different modes based on the doctype—because I agree with what @annevk said:

There should be only one path for checking HTML. Not version-dependent paths.

zcorpan · 2016-11-16T12:40:47Z

OK, so then we should remove that paragraph as well. And change the HTML parser to emit more parse errors.

Are you going to remove support for checking HTML4 from the checker completely?

sideshowbarker · 2016-11-16T13:05:12Z

OK, so then we should remove that paragraph as well.

Good point—made it so.

And change the HTML parser to emit more parse errors.

I’d prefer to do that in a separate follow-up PR—since changing the parsing algorithm potentially affects browsers and all other parser implementations, while this PR as currently scoped only affects document conformance/authors and conformance checkers.

sideshowbarker · 2016-11-16T13:13:37Z

Are you going to remove support for checking HTML4 from the checker completely?

Yes, from the HTML checker I’d like to remove any traces of HTML4-related checking that still remain. However, I guess the vnu source still needs to contain an HTML4-checking path as long as the https://validator.nu/ Web UI continues to offer an HTML4-checking option (which https://checker.html5.org/ and https://validator.w3.org/nu/ do not).

The W3C will continue to offer HTML4 and XHTML1 checking using the legacy backend for those that https://validator.w3.org/ relies on. That is anyway what most people who want HTML4/XHTML1 checking actually use (not the https://validator.nu/ HTML4-checking option).

domenic · 2016-11-16T13:34:43Z

I think we should do the parse errors in this PR too? They won't affect browsers, just checkers, and it seems good for them to be consistent with the requirements changed here.

sideshowbarker · 2016-11-16T13:51:12Z

[changes to parse errors] won't affect browsers, just checkers

The gecko HTML parser exposes parse errors in its View source but yeah changes to parse errors otherwise don’t affect gecko parsing behavior, or behavior in any other browsers.

That said, we do have other parsers that do error reporting—at least two of them I can think of.

I think we should do the parse errors in this PR too… it seems good for them to be consistent with the requirements changed here.

OK, I can add them here.

(FWIW my thinking had been that it would not be ideal to conflate into one PR both (A) document-conformance changes that have no normative requirements for parser implementors and (B) parser changes that do have normative requirements for implementors who have implemented the error-reporting parts of the parsing algorithm).

annevk · 2016-11-16T13:52:58Z

I tend to agree that we want to land conformance changes on both sides. The parser and syntax section ought to be updated together since they rely on each other to some extent. The specification would be inconsistent otherwise.

sideshowbarker · 2016-11-17T06:38:30Z

change the HTML parser to emit more parse errors.

See 34c4d1b and lemme know if anything more beyond that needs changing in the parsing algorithm.

Relates to #2048

sideshowbarker · 2016-11-17T07:42:39Z

Also, in XHTML you still need to use a DOCTYPE kinda like this for entities and we still don't have a replacement

See #2056 which eliminates the need for authors to be forced to forever continue putting obsolete XHTML1 doctypes in HTML documents that are served with XML mime types.

Instead it just changes the spec to say:

If the document element of a Document is in the HTML namespace, user agents should attempt to retrieve the URL given by this link (this URL is a DTD containing the entity declarations for the names listed in the named character references section), and should not attempt to retrieve any other external entity's content.[XML]

domenic · 2016-11-17T14:30:32Z

Looks great, but can you or someone help work on a nice explanatory commit message for this? To avoid misunderstandings, I think we should stress exactly what this does and does not do, i.e. it removes the legacy XHTML and HTML 4 doctypes as conformant, so that only <!DOCTYPE html> and <!DOCTYPE html SYSTEM "about:legacy-compat"> are conformant. But it does not remove XHTML support or impact the browser processing model.

annevk · 2016-11-17T14:34:15Z

Remove obsolete permitted DOCTYPEs

From now on conformance checkers can only allow <!doctype html> and <!doctype html SYSTEM "about:legacy-compat"> as doctypes in HTML syntax. The HTML4 and XHTML1 DOCTYPEs are no longer allowed.

(XHTML syntax continues to be supported and is not influenced by this change.)

Make HTML4/XHTML1 doctypes non-conforming

e940e20

stevefaulkner mentioned this pull request Nov 16, 2016

Make HTML4/XHTML1 Strict doctypes non-conforming w3c/html#729

Closed

annevk reviewed Nov 16, 2016

View reviewed changes

zcorpan added the document conformance label Nov 16, 2016

Don’t suggest validator doctype-switching modes

e524de8

zcorpan approved these changes Nov 16, 2016

View reviewed changes

sideshowbarker mentioned this pull request Nov 16, 2016

Change error reporting in HTML parser to report more errors for HTML4 cases #2050

Closed

Make any HTML4 doctype a parse error

34c4d1b

sideshowbarker added a commit that referenced this pull request Nov 17, 2016

Enable named character refs in XHTML w/o doctype

384f772

Relates to #2048

sideshowbarker mentioned this pull request Nov 17, 2016

Make UAs support named character references in all XML docs #2056

Closed

sideshowbarker added a commit that referenced this pull request Nov 17, 2016

Enable named character refs in XHTML w/o doctype

24d620a

Relates to #2048

annevk approved these changes Nov 17, 2016

View reviewed changes

domenic merged commit 31c20af into master Nov 17, 2016

domenic deleted the doctypes-html4-xhtml1-nonconforming branch November 17, 2016 15:22

sideshowbarker mentioned this pull request Nov 18, 2016

Drop <!DOCTYPE html SYSTEM "about:legacy-compat"> #2065

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make HTML4/XHTML1 Strict doctypes non-conforming #2048

Make HTML4/XHTML1 Strict doctypes non-conforming #2048

sideshowbarker commented Nov 16, 2016 •

edited

annevk Nov 16, 2016

sideshowbarker Nov 16, 2016

zcorpan commented Nov 16, 2016

annevk commented Nov 16, 2016

sideshowbarker commented Nov 16, 2016

zcorpan commented Nov 16, 2016

sideshowbarker commented Nov 16, 2016

sideshowbarker commented Nov 16, 2016

domenic commented Nov 16, 2016

sideshowbarker commented Nov 16, 2016

annevk commented Nov 16, 2016

sideshowbarker commented Nov 17, 2016

sideshowbarker commented Nov 17, 2016

domenic commented Nov 17, 2016

annevk commented Nov 17, 2016

Make HTML4/XHTML1 Strict doctypes non-conforming #2048

Make HTML4/XHTML1 Strict doctypes non-conforming #2048

Conversation

sideshowbarker commented Nov 16, 2016 • edited

annevk Nov 16, 2016

Choose a reason for hiding this comment

sideshowbarker Nov 16, 2016

Choose a reason for hiding this comment

zcorpan commented Nov 16, 2016

annevk commented Nov 16, 2016

sideshowbarker commented Nov 16, 2016

zcorpan commented Nov 16, 2016

sideshowbarker commented Nov 16, 2016

sideshowbarker commented Nov 16, 2016

domenic commented Nov 16, 2016

sideshowbarker commented Nov 16, 2016

annevk commented Nov 16, 2016

sideshowbarker commented Nov 17, 2016

sideshowbarker commented Nov 17, 2016

domenic commented Nov 17, 2016

annevk commented Nov 17, 2016

sideshowbarker commented Nov 16, 2016 •

edited