Validator skips second 'charset' in meta content attribute #877

openandclose · 2019-10-10T20:11:25Z

WHATWG HTML spec uses a very simplified algorithm to retrieve encoding name
in http-equiv="Content-Type" content="text/html; charset=... case.
https://html.spec.whatwg.org/multipage/urls-and-fetching.html#extracting-character-encodings-from-meta-elements

It only searches 'charset' literal characters, and if failed, continues searching 'charset'.

It seems it is used both in prescan and main parsing.
https://html.spec.whatwg.org/multipage/parsing.html#prescan-a-byte-stream-to-determine-its-encoding
https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inhead

Validator returns on first failure
(perhaps extractCharsetFromContent() in htmlparser/impl/TreeBuilder.java).

Example:

<meta http-equiv="Content-Type" content="text/html; charset charset=iso-8859-2">
<meta http-equiv="Content-Type" content="text/html; charsetxxxxxcharset=iso-8859-2">
<meta http-equiv="Content-Type" content="text/html; charsetcharset=iso-8859-2">

Validator(https://validator.w3.org/nu/#textarea) reports that
The legacy encoding declarationdid not contain charset= after the semicolon..

Cf.
When validator succeeds in getting encoding, it usually reports e.g.
Internal encoding declaration iso-8859-2 disagrees with the actual encoding of the document (utf-8)..

(edited, to make a little easier to understand)

The text was updated successfully, but these errors were encountered:

sideshowbarker · 2019-10-11T04:02:41Z

Example:

<meta http-equiv="Content-Type" content="text/html; charset charset=iso-8859-2">

<meta http-equiv="Content-Type" content="text/html; charsetxxxxxcharset=iso-8859-2">

<meta http-equiv="Content-Type" content="text/html; charsetcharset=iso-8859-2">

What should the expected behavior for those be per the current spec?

When it gets encoding,
(e.g. <meta http-equiv="Content-Type" content="text/html; xxxxxcharset=iso-8859-2">),
it reports Internal encoding declaration iso-8859-2 disagrees with the actual encoding of the document (utf-8)..

What should the expected behavior be per the current spec?

openandclose · 2019-10-11T10:54:58Z

Per spec, parsers should get Encoding Name 'ISO-8859-2' from Example 1, 2 and 3.
So validator should report:
Internal encoding declaration iso-8859-2 disagrees with the actual encoding of the document (utf-8).

While aucual reports are e.g.:
Bad value text/html; charset charset=iso-8859-2 for attribute content on element meta: The legacy encoding declarationdid not contain charset= after the semicolon.

sideshowbarker · 2019-10-11T11:05:12Z

OK thanks — so yeah, for the record here, I see that the specific steps which are relevant here are these:

Loop: Find the first seven characters in s after position that are an ASCII case-insensitive match for the word "charset". If no such match is found, return nothing.

Skip any ASCII whitespace that immediately follow the word "charset" (there might not be any).

If the next character is not a U+003D EQUALS SIGN (=), then move position to point just before that next character, and jump back to the step labeled loop.

sideshowbarker added known issue spec-conformance labels Oct 11, 2019

openandclose mentioned this issue Oct 26, 2019

Fail to parse second 'charset' in meta content attribute zackw/html5-chardet#1

Open

openandclose mentioned this issue Mar 26, 2020

Fix prescan bug: content attribute, second charset html5lib/html5lib-python#434

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validator skips second 'charset' in meta content attribute #877

Validator skips second 'charset' in meta content attribute #877

openandclose commented Oct 10, 2019 •

edited

sideshowbarker commented Oct 11, 2019

openandclose commented Oct 11, 2019

sideshowbarker commented Oct 11, 2019

Validator skips second 'charset' in meta content attribute #877

Validator skips second 'charset' in meta content attribute #877

Comments

openandclose commented Oct 10, 2019 • edited

sideshowbarker commented Oct 11, 2019

openandclose commented Oct 11, 2019

sideshowbarker commented Oct 11, 2019

openandclose commented Oct 10, 2019 •

edited