You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Cf.
When validator succeeds in getting encoding, it usually reports e.g. Internal encoding declaration iso-8859-2 disagrees with the actual encoding of the document (utf-8)..
(edited, to make a little easier to understand)
The text was updated successfully, but these errors were encountered:
What should the expected behavior for those be per the current spec?
When it gets encoding,
(e.g. <meta http-equiv="Content-Type" content="text/html; xxxxxcharset=iso-8859-2">),
it reports Internal encoding declaration iso-8859-2 disagrees with the actual encoding of the document (utf-8)..
What should the expected behavior be per the current spec?
Per spec, parsers should get Encoding Name 'ISO-8859-2' from Example 1, 2 and 3.
So validator should report: Internal encoding declaration iso-8859-2 disagrees with the actual encoding of the document (utf-8).
While aucual reports are e.g.: Bad value text/html; charset charset=iso-8859-2 for attribute content on element meta: The legacy encoding declarationdid not contain charset= after the semicolon.
OK thanks — so yeah, for the record here, I see that the specific steps which are relevant here are these:
Loop: Find the first seven characters in s after position that are an ASCII case-insensitive match for the word "charset". If no such match is found, return nothing.
Skip any ASCII whitespace that immediately follow the word "charset" (there might not be any).
If the next character is not a U+003D EQUALS SIGN (=), then move position to point just before that next character, and jump back to the step labeled loop.
WHATWG HTML spec uses a very simplified algorithm to retrieve encoding name
in
http-equiv="Content-Type" content="text/html; charset=...
case.https://html.spec.whatwg.org/multipage/urls-and-fetching.html#extracting-character-encodings-from-meta-elements
It only searches 'charset' literal characters, and if failed, continues searching 'charset'.
It seems it is used both in prescan and main parsing.
https://html.spec.whatwg.org/multipage/parsing.html#prescan-a-byte-stream-to-determine-its-encoding
https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inhead
Validator returns on first failure
(perhaps
extractCharsetFromContent()
in htmlparser/impl/TreeBuilder.java).Example:
<meta http-equiv="Content-Type" content="text/html; charset charset=iso-8859-2">
<meta http-equiv="Content-Type" content="text/html; charsetxxxxxcharset=iso-8859-2">
<meta http-equiv="Content-Type" content="text/html; charsetcharset=iso-8859-2">
Validator(https://validator.w3.org/nu/#textarea) reports that
The legacy encoding declarationdid not contain charset= after the semicolon.
.Cf.
When validator succeeds in getting encoding, it usually reports e.g.
Internal encoding declaration iso-8859-2 disagrees with the actual encoding of the document (utf-8).
.(edited, to make a little easier to understand)
The text was updated successfully, but these errors were encountered: