further thoughts about pip as an HTML5 reader #10880

ppena-LiveData · 2022-02-03T19:49:23Z

Description

As Donald Stufft pointed out, the HTML5 spec has different requirements for HTML5 writers versus parsers, and in fact, the 13.1 Writing HTML documents part of the spec explicitly says that the section "does not apply to conformance checkers", so we were looking at the wrong section when we added a DOCTYPE regex to handle_decl(). If pip is going to continue to want to validate the HTML5 (i.e. be a conformance checker, which some of us have said should not be PIP's job, and which @dstufft said his PEP 503 is ambiguous about), then the 13.2 Parsing HTML documents section is what needs to be followed.

It is a complicated spec, so it will take some careful reading to figure out what exactly is required for HTML5 validation, but one thing I found is that the parser also allows for "PUBLIC" to be in the <!DOCTYPE ...>, which is not mentioned for the writer, so that is definitely a problem in the regex.

Expected behavior

When using a Simple Repository API repo that outputs a <!DOCTYPE ...> that has "PUBLIC" in it. Expected behavior is not to get a warning saying "does not have a proper HTML doctype declaration," but pip would give that warning.

pip version

22.0.3

Code of Conduct

I agree to follow the PSF Code of Conduct.

The text was updated successfully, but these errors were encountered:

notatallshaw · 2022-02-04T04:24:17Z

The relevant part of the spec is: https://html.spec.whatwg.org/#after-doctype-name-state Specifically these lines:

If the six characters starting from the current input character are an ASCII case-insensitive match for the word "PUBLIC", then consume those characters and switch to the after DOCTYPE public keyword state.

Otherwise, if the six characters starting from the current input character are an ASCII case-insensitive match for the word "SYSTEM", then consume those characters and switch to the after DOCTYPE system keyword state.

Otherwise, this is an invalid-character-sequence-after-doctype-name parse error. Set the current DOCTYPE token's force-quirks flag to on. Reconsume in the bogus DOCTYPE state.

So this would add an addition of optionally "PUBLIC" or "SYSTEM" after "html" as part of the "doctype name state". This doesn't look too complicated to add if someone wants to make a PR.

However PEP 503 only says that the simple index "MUST be a valid HTML5 page" and this is not part of the HTML5 writing spec: https://html.spec.whatwg.org/#the-doctype . PEP 503 does not say the tool parsing the HTML5 must conform to the HTML5 parser spec, which was designed to handle non-valid HTML5 docs and presumably has a complicated history to do with what browsers accept).

So actually I think it would be at best optional and at worst incorrect for Pip to check against this, as Pip is not trying to be an HTML5 parser here and doesn't need to deal with non-valid HTML5 documents. Pip's motivation is checking for a valid PEP 503 index, otherwise it would also have to be handling things like the bogus doctype state and many other parts of the HTML5 spec that was designed for browser makers to handle non-valid documents in a standard way.

pradyunsg · 2022-02-04T13:58:44Z

Hiya! Thanks for filing this.

I don't really see much point in arguing at length about fine print here -- things work at the moment, albeit being slightly annoying for a subset of our users due to use of tools/platforms that don't have doctypes (and correspondingly, a valid document structure). I'm not interested in spending more time or energy on this topic anytime soon. If someone wants to file PRs to tweak the logic or remove things, I'm not going to block them (unless they're doing something really bad/stupid, which... is unlikely) and likewise, I'm also unlikely to spend time reviewing such PRs.

There's other maintainers though, and I trust their judgement on what the right thing to do here would be. Please don't @-mention me here, since I'm unsubscribing for now.

ppena-LiveData added S: needs triage Issues/PRs that need to be triaged type: bug A confirmed bug or unintended behavior labels Feb 3, 2022

pradyunsg added state: needs discussion This needs some more discussion type: deprecation Related to deprecation / removal. and removed type: bug A confirmed bug or unintended behavior S: needs triage Issues/PRs that need to be triaged labels Feb 4, 2022

ppena-LiveData mentioned this issue Feb 14, 2022

--find-links should not warn about missing HTML5 doctype #10903

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

further thoughts about pip as an HTML5 reader #10880

further thoughts about pip as an HTML5 reader #10880

ppena-LiveData commented Feb 3, 2022 •

edited

Loading

notatallshaw commented Feb 4, 2022 •

edited

Loading

pradyunsg commented Feb 4, 2022

further thoughts about pip as an HTML5 reader #10880

further thoughts about pip as an HTML5 reader #10880

Comments

ppena-LiveData commented Feb 3, 2022 • edited Loading

Description

Expected behavior

pip version

Code of Conduct

notatallshaw commented Feb 4, 2022 • edited Loading

pradyunsg commented Feb 4, 2022

ppena-LiveData commented Feb 3, 2022 •

edited

Loading

notatallshaw commented Feb 4, 2022 •

edited

Loading