Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

further thoughts about pip as an HTML5 reader #10880

Open
1 task done
ppena-LiveData opened this issue Feb 3, 2022 · 2 comments
Open
1 task done

further thoughts about pip as an HTML5 reader #10880

ppena-LiveData opened this issue Feb 3, 2022 · 2 comments
Labels
state: needs discussion This needs some more discussion type: deprecation Related to deprecation / removal.

Comments

@ppena-LiveData
Copy link

ppena-LiveData commented Feb 3, 2022

Description

As Donald Stufft pointed out, the HTML5 spec has different requirements for HTML5 writers versus parsers, and in fact, the 13.1 Writing HTML documents part of the spec explicitly says that the section "does not apply to conformance checkers", so we were looking at the wrong section when we added a DOCTYPE regex to handle_decl(). If pip is going to continue to want to validate the HTML5 (i.e. be a conformance checker, which some of us have said should not be PIP's job, and which @dstufft said his PEP 503 is ambiguous about), then the 13.2 Parsing HTML documents section is what needs to be followed.

It is a complicated spec, so it will take some careful reading to figure out what exactly is required for HTML5 validation, but one thing I found is that the parser also allows for "PUBLIC" to be in the <!DOCTYPE ...>, which is not mentioned for the writer, so that is definitely a problem in the regex.

Expected behavior

When using a Simple Repository API repo that outputs a <!DOCTYPE ...> that has "PUBLIC" in it. Expected behavior is not to get a warning saying "does not have a proper HTML doctype declaration," but pip would give that warning.

pip version

22.0.3

Code of Conduct

@ppena-LiveData ppena-LiveData added S: needs triage Issues/PRs that need to be triaged type: bug A confirmed bug or unintended behavior labels Feb 3, 2022
@notatallshaw
Copy link
Member

notatallshaw commented Feb 4, 2022

The relevant part of the spec is: https://html.spec.whatwg.org/#after-doctype-name-state Specifically these lines:

If the six characters starting from the current input character are an ASCII case-insensitive match for the word "PUBLIC", then consume those characters and switch to the after DOCTYPE public keyword state.

Otherwise, if the six characters starting from the current input character are an ASCII case-insensitive match for the word "SYSTEM", then consume those characters and switch to the after DOCTYPE system keyword state.

Otherwise, this is an invalid-character-sequence-after-doctype-name parse error. Set the current DOCTYPE token's force-quirks flag to on. Reconsume in the bogus DOCTYPE state.

So this would add an addition of optionally "PUBLIC" or "SYSTEM" after "html" as part of the "doctype name state". This doesn't look too complicated to add if someone wants to make a PR.

However PEP 503 only says that the simple index "MUST be a valid HTML5 page" and this is not part of the HTML5 writing spec: https://html.spec.whatwg.org/#the-doctype . PEP 503 does not say the tool parsing the HTML5 must conform to the HTML5 parser spec, which was designed to handle non-valid HTML5 docs and presumably has a complicated history to do with what browsers accept).

So actually I think it would be at best optional and at worst incorrect for Pip to check against this, as Pip is not trying to be an HTML5 parser here and doesn't need to deal with non-valid HTML5 documents. Pip's motivation is checking for a valid PEP 503 index, otherwise it would also have to be handling things like the bogus doctype state and many other parts of the HTML5 spec that was designed for browser makers to handle non-valid documents in a standard way.

@pradyunsg pradyunsg added state: needs discussion This needs some more discussion type: deprecation Related to deprecation / removal. and removed type: bug A confirmed bug or unintended behavior S: needs triage Issues/PRs that need to be triaged labels Feb 4, 2022
@pradyunsg
Copy link
Member

Hiya! Thanks for filing this.

I don't really see much point in arguing at length about fine print here -- things work at the moment, albeit being slightly annoying for a subset of our users due to use of tools/platforms that don't have doctypes (and correspondingly, a valid document structure). I'm not interested in spending more time or energy on this topic anytime soon. If someone wants to file PRs to tweak the logic or remove things, I'm not going to block them (unless they're doing something really bad/stupid, which... is unlikely) and likewise, I'm also unlikely to spend time reviewing such PRs.

There's other maintainers though, and I trust their judgement on what the right thing to do here would be. Please don't @-mention me here, since I'm unsubscribing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
state: needs discussion This needs some more discussion type: deprecation Related to deprecation / removal.
Projects
None yet
Development

No branches or pull requests

3 participants