Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validator skips second 'charset' in meta content attribute #877

Open
openandclose opened this issue Oct 10, 2019 · 3 comments
Open

Validator skips second 'charset' in meta content attribute #877

openandclose opened this issue Oct 10, 2019 · 3 comments

Comments

@openandclose
Copy link

openandclose commented Oct 10, 2019

WHATWG HTML spec uses a very simplified algorithm to retrieve encoding name
in http-equiv="Content-Type" content="text/html; charset=... case.
https://html.spec.whatwg.org/multipage/urls-and-fetching.html#extracting-character-encodings-from-meta-elements

It only searches 'charset' literal characters, and if failed, continues searching 'charset'.

It seems it is used both in prescan and main parsing.
https://html.spec.whatwg.org/multipage/parsing.html#prescan-a-byte-stream-to-determine-its-encoding
https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inhead

Validator returns on first failure
(perhaps extractCharsetFromContent() in htmlparser/impl/TreeBuilder.java).

Example:

  1. <meta http-equiv="Content-Type" content="text/html; charset charset=iso-8859-2">
  2. <meta http-equiv="Content-Type" content="text/html; charsetxxxxxcharset=iso-8859-2">
  3. <meta http-equiv="Content-Type" content="text/html; charsetcharset=iso-8859-2">

Validator(https://validator.w3.org/nu/#textarea) reports that
The legacy encoding declarationdid not contain charset= after the semicolon..

Cf.
When validator succeeds in getting encoding, it usually reports e.g.
Internal encoding declaration iso-8859-2 disagrees with the actual encoding of the document (utf-8)..

(edited, to make a little easier to understand)

@sideshowbarker
Copy link
Contributor

Example:

  1. <meta http-equiv="Content-Type" content="text/html; charset charset=iso-8859-2">
  2. <meta http-equiv="Content-Type" content="text/html; charsetxxxxxcharset=iso-8859-2">
  3. <meta http-equiv="Content-Type" content="text/html; charsetcharset=iso-8859-2">

What should the expected behavior for those be per the current spec?

When it gets encoding,
(e.g. <meta http-equiv="Content-Type" content="text/html; xxxxxcharset=iso-8859-2">),
it reports Internal encoding declaration iso-8859-2 disagrees with the actual encoding of the document (utf-8)..

What should the expected behavior be per the current spec?

@openandclose
Copy link
Author

Per spec, parsers should get Encoding Name 'ISO-8859-2' from Example 1, 2 and 3.
So validator should report:
Internal encoding declaration iso-8859-2 disagrees with the actual encoding of the document (utf-8).

While aucual reports are e.g.:
Bad value text/html; charset charset=iso-8859-2 for attribute content on element meta: The legacy encoding declarationdid not contain charset= after the semicolon.

@sideshowbarker
Copy link
Contributor

OK thanks — so yeah, for the record here, I see that the specific steps which are relevant here are these:

  1. Loop: Find the first seven characters in s after position that are an ASCII case-insensitive match for the word "charset". If no such match is found, return nothing.

  2. Skip any ASCII whitespace that immediately follow the word "charset" (there might not be any).

  3. If the next character is not a U+003D EQUALS SIGN (=), then move position to point just before that next character, and jump back to the step labeled loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants