Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require utf-8 when specifying character encoding #3091

Merged
merged 3 commits into from Oct 6, 2017

Conversation

6 participants
@sideshowbarker
Copy link
Member

commented Oct 3, 2017

This addresses #3006.

@annevk

This comment has been minimized.

Copy link
Member

commented Oct 3, 2017

Is @hsivonen now comfortable with this? When Encoding initially required this there was a little bit of fear it might be too soon. So maybe we should split it out for <script charset> since it seems fine to start there.

@domenic

This comment has been minimized.

Copy link
Member

commented Oct 3, 2017

I am in support of doing this everywhere. E.g. https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg implies a good trend , and in general the "only UTF-8" meme has gotten pretty widespread.

I haven't reviewed the commits yet, but will do so soon, under the assumption that we're gonna go all the way.

@hsivonen

This comment has been minimized.

Copy link
Member

commented Oct 4, 2017

Is @hsivonen now comfortable with this? When Encoding initially required this there was a little bit of fear it might be too soon. So maybe we should split it out for <script charset> since it seems fine to start there.

I think we should nudge authors towards making everything UTF-8. I'm am still a bit worried about authors reacting to an error in a silly way: Making the charset attribute UTF-8 without changing the encoding of the resource to UTF-8.

I guess the exact message that the validator gives matters here. Assuming a message that is worded to complain more about the resource not being UTF-8 than about the value of the attribute per se, I'm OK with this.

As for script vs. link, I think non-UTF-8 CSS is more harmful than non-UTF-8 JS, because style sheet encoding gets inherited into URL parsing (i.e. URLs become context-dependent and don't work in the URL bar) but JS encoding doesn't get inherited anywhere.

Reviewing the patch:

- <p>Authoring tools should default to using <span>UTF-8</span> for newly-created documents. <ref
-  spec=ENCODING></p>

It seems to me that the patch should upgrade this to a MUST instead of removing it. I fail to locate a normative statement to that effect for the BOM and HTTP cases. (I see it only for the meta case.)

@zcorpan
Copy link
Member

left a comment

#3006 is a pull request -- I suppose this addresses #3004 and replaces #3006?

source Outdated

<p class="note">A character encoding declaration is required (either in the <span
<div class="note">
<p>A character encoding declaration is required (either in the <span

This comment has been minimized.

Copy link
@zcorpan

zcorpan Oct 4, 2017

Member

indent by a space

source Outdated
data-x="Content-Type">Content-Type metadata</span> or explicitly in the file) even when all
characters are in the ASCII range, because a character encoding is needed to process non-ASCII
characters entered by the user in forms, in URLs generated by scripts, and so forth.</p>
<p>Using non-UTF-8 encodings can have unexpected results on form submission and URL encodings,

This comment has been minimized.

Copy link
@zcorpan

zcorpan Oct 4, 2017

Member

Insert a blank line between paragraphs.

data-x="attr-meta-http-equiv-content-type">Encoding declaration state</span>, then the character
encoding used must be an <span>ASCII-compatible encoding</span>.</p>

<p>Authors should use <span>UTF-8</span>. Conformance checkers may advise authors against using

This comment has been minimized.

Copy link
@zcorpan

zcorpan Oct 4, 2017

Member

Is the idea here that the Encoding Standard already requires utf-8? Maybe put in a statement of fact here mentioning that requirement, so we're clear (since the meta encoding declaration is itself optional and encoding could be specified in HTTP/BOM/XML decl).

This comment has been minimized.

Copy link
@sideshowbarker

sideshowbarker Oct 4, 2017

Author Member

Is the idea here that the Encoding Standard already requires utf-8? Maybe put in a statement of fact here mentioning that requirement, so we're clear

OK, 94517b3 attempts to do that

@annevk

This comment has been minimized.

Copy link
Member

commented Oct 4, 2017

Yeah, most of #3006 ends up withdrawn. I need to separate a separate PR for the minor things I fixed on the side.

@sideshowbarker

This comment has been minimized.

Copy link
Member Author

commented Oct 4, 2017

When Encoding initially required this there was a little bit of fear it might be too soon.

I know — but that was nearly 5 years ago (January 2013). So finally requiring UTF-8 in HTML almost 5 years after Encoding initially required it doesn’t seem like we’re exactly rushing things…

So maybe we should split it out for <script charset> since it seems fine to start there.

I’m OK with just merging the <script charset> part for now if that’s all we can get agreement on at the moment, but if we were to do that, I wonder how we then decide what process we follow for deciding when to finally go all the way with the rest?

I assume we’d agree we don’t want to wait, say, another 5 years. But short of that it’s not clear to me how we can measure when it’s no longer too soon and we’re instead finally ready to go forward with it.

So it seems like instead we just need to choose some point at which to do it, and then finally just do it.

@sideshowbarker

This comment has been minimized.

Copy link
Member Author

commented Oct 4, 2017

I'm am still a bit worried about authors reacting to an error in a silly way: Making the charset attribute UTF-8 without changing the encoding of the resource to UTF-8.

Yeah, agreed that would be a counterproductive outcome

I guess the exact message that the validator gives matters here. Assuming a message that is worded to complain more about the resource not being UTF-8 than about the value of the attribute per se, I'm OK with this.

OK, that‘s doable of course — but to be clear: in the validator architecture, that message would need to come from the parser code, right?

Reviewing the patch:

- <p>Authoring tools should default to using <span>UTF-8</span> for newly-created documents. > <ref
-  spec=ENCODING></p>

It seems to me that the patch should upgrade this to a MUST instead of removing it.

OK, I’ll make that change.

I fail to locate a normative statement to that effect for the BOM and HTTP cases. (I see it only for the meta case.)

Not sure what you mean. I take it you don’t meant a normative statement about authoring tools in relation to the BOM, or a normative statement about authoring tools in relation to the HTTP-delivered charset.

@sideshowbarker

This comment has been minimized.

Copy link
Member Author

commented Oct 4, 2017

#3006 is a pull request -- I suppose this addresses #3004 and replaces #3006?

Yeah (as @annevk noted)

@annevk

This comment has been minimized.

Copy link
Member

commented Oct 4, 2017

@sideshowbarker it seems that everyone who commented here is okay with going ahead with it, so let's (finally) do it.

@sideshowbarker

This comment has been minimized.

Copy link
Member Author

commented Oct 4, 2017

Reviewing the patch:

- <p>Authoring tools should default to using <span>UTF-8</span> for newly-created documents. > <ref
-  spec=ENCODING></p>

It seems to me that the patch should upgrade this to a MUST instead of removing it.

OK, made it so in 769d6fe

@hsivonen

This comment has been minimized.

Copy link
Member

commented Oct 4, 2017

OK, that‘s doable of course — but to be clear: in the validator architecture, that message would need to come from the parser code, right?

While the parser could make sense for meta, a datatype in the datatype library would make more sense especially for link and script.

@hsivonen

This comment has been minimized.

Copy link
Member

commented Oct 4, 2017

Not sure what you mean. I take it you don’t meant a normative statement about authoring tools in relation to the BOM, or a normative statement about authoring tools in relation to the HTTP-delivered charset.

I meant the same thing as @zcorpan meant in the comment right after mine.

@sideshowbarker

This comment has been minimized.

Copy link
Member Author

commented Oct 4, 2017

OK, that‘s doable of course — but to be clear: in the validator architecture, that message would need to come from the parser code, right?

While the parser could make sense for meta, a datatype in the datatype library would make more sense especially for link and script.

Aha yeah OK I’ll add a datatype checker for it that way to the validator sources

@annevk

This comment has been minimized.

Copy link
Member

commented Oct 4, 2017

We should update those too.

@zcorpan

This comment has been minimized.

Copy link
Member

commented Oct 4, 2017

I agree about text/html. But I think we should probably separate accept-encoding in order to do proper reasoning and compat analysis for that.

@sideshowbarker
Copy link
Member Author

left a comment

“Update text/html registration” change LGTM

@domenic

This comment has been minimized.

Copy link
Member

commented Oct 5, 2017

Per #3006 (comment) , I was thinking we should make charset="utf-8" on script elements obsolete but conforming (i.e. validators display a warning), since in a UTF-8 document it is redundant, and we've recently been making redundant script attributes obsolete but conforming. This would mean the charset attribute on script gets a treatment similar to type on style.

@sideshowbarker sideshowbarker force-pushed the sideshowbarker/require-utf-8 branch from 13efba8 to dfef71a Oct 5, 2017

@sideshowbarker

This comment has been minimized.

Copy link
Member Author

commented Oct 5, 2017

I was thinking we should make charset="utf-8" on script elements obsolete but conforming… This would mean the charset attribute on script gets a treatment similar to type on style.

Yes, will update the source on this branch to do that

@domenic

This comment has been minimized.

Copy link
Member

commented Oct 5, 2017

I was going to do a review but then I thought it'd be easier to just tweak things myself so I got carried away and did a bit more. Let me know what you think :).

@sideshowbarker

This comment has been minimized.

Copy link
Member Author

commented Oct 5, 2017

I was going to do a review but then I thought it'd be easier to just tweak things myself so I got carried away and did a bit more. Let me know what you think :).

Looks beautiful 🎉

@sideshowbarker sideshowbarker force-pushed the sideshowbarker/require-utf-8 branch from d891e56 to 7a64e46 Oct 6, 2017

Require UTF-8
This change adds a “must” requirement for UTF-8 in all but one of the places in
the spec that define a means for specifying a character encoding.

Specifically, it makes UTF-8 required for any “character encoding declaration”,
which includes the HTTP Content-Type header sent with any document, the
`<meta charset>` element, and the `<meta http-equiv=content-type>` element.

Along with those, this change also makes UTF-8 required for `<script charset>`
but also moves `<script charset>` to being obsolete-but-conforming (because now
that both documents and scripts are required to use UTF-8, it’s redundant to
specify `charset` on the `script` element, since it inherits from the document).

To make the normative source of those requirements clear, this change also adds
a specific citation to the relevant requirement from the Encoding standard, and
updates the in-spec IANA registration for text/html media type to indicate that
UTF-8 is required. Finally, it changes an existing requirement for authoring
tools to use UTF-8 from a “should” to a “must”.

The one place where this change doesn’t yet add a requirement for UTF-8 is for
the `form` element’s `accept-charset` attribute. For that, see issue #3097.

@sideshowbarker sideshowbarker force-pushed the sideshowbarker/require-utf-8 branch from 7a64e46 to 4089e5c Oct 6, 2017

@sideshowbarker sideshowbarker referenced this pull request Oct 6, 2017

Closed

Require UTF-8 #1039

@annevk
Copy link
Member

left a comment

I found a couple more nits. Happy to fix these later today.

source Outdated
<p>The Encoding standard requires use of the <span>UTF-8</span> <span data-x="encoding">character
encoding</span> and requires use of the "<code data-x="">utf-8</code>" <span>encoding label</span>
to identify it. Those requirements necessitate that the document's <span>character encoding
declaration</span>, if it exists, specify an <span>encoding label</span> using an <span>ASCII

This comment has been minimized.

Copy link
@annevk

annevk Oct 6, 2017

Member

specifies?

source Outdated
case-insensitive</span> match for the string "<code data-x="">utf-8</code>". Regardless of whether
a <span>character encoding declaration</span> is present or not, the actual <span
data-x="document's character encoding">character encoding</span> used to store or transmit the
document must be <span>UTF-8</span>. <ref spec=ENCODING></p>

This comment has been minimized.

Copy link
@annevk

annevk Oct 6, 2017

Member

Maybe say "to encode the document". Storage and transmission have little to do with text encoding.

source Outdated
data-x="attr-script-async">async</code>, and <code data-x="attr-script-defer">defer</code>
attributes. Authors should omit the attribute instead of redundantly setting it.</p></li>
<code data-x="attr-script-async">async</code> and <code data-x="attr-script-defer">defer</code>
attributes (as well as the legacy <code data-x="attr-script-charset">charset</code> attribute).

This comment has been minimized.

Copy link
@annevk

annevk Oct 6, 2017

Member

Other legacy attributes can influence the processing model as well, but we don't mention them here. Is this really needed?

source Outdated
changes to the base URL also have no effect -->
<code data-x="attr-script-integrity">integrity</code> attributes dynamically has no direct effect;
these attributes are only used at specific times described below. (The same is true for the legacy
<code data-x="attr-script-charset">charset</code> attribute.</p> <!-- by implication, changes to

This comment has been minimized.

Copy link
@annevk

annevk Oct 6, 2017

Member

Again, I don't think we mention the other legacy attributes here.

source Outdated
<code>script</code> element must <span>reflect</span> the element's
<code data-x="attr-script-event">event</code> content attribute.</p>
<p>The <dfn><code data-x="dom-script-event">event</code></dfn> and <dfn><code
data-x="dom-script-charset">charset</code></dfn> IDL attributes of the <code>script</code> element

This comment has been minimized.

Copy link
@annevk

annevk Oct 6, 2017

Member

I think we do these in alphabetical order normally.

@annevk

annevk approved these changes Oct 6, 2017

Copy link
Member

left a comment

I think it looks good now, but someone should probably double check my edit.

@sideshowbarker

This comment has been minimized.

Copy link
Member

commented on 716ab5f Oct 6, 2017

716ab5f changes LGTM

@annevk annevk merged commit fae77e3 into master Oct 6, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@annevk annevk deleted the sideshowbarker/require-utf-8 branch Oct 6, 2017

@annevk annevk restored the sideshowbarker/require-utf-8 branch Oct 6, 2017

@annevk annevk deleted the sideshowbarker/require-utf-8 branch Oct 6, 2017

@annevk annevk added the i18n-comment label Dec 14, 2017

@annevk

This comment has been minimized.

Copy link
Member

commented Dec 14, 2017

Sure thing, done.

@duerst

This comment has been minimized.

Copy link

commented Dec 31, 2017

domenic commented on Oct 4:

E.g. https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg implies a good trend , and in general the "only UTF-8" meme has gotten pretty widespread.

It would be good to have some more recent data. The graph on Wikipedia is about 5 years old.

@sideshowbarker

This comment has been minimized.

Copy link
Member Author

commented Dec 31, 2017

It would be good to have some more recent data. The graph on Wikipedia is about 5 years old.

https://w3techs.com/technologies/history_overview/character_encoding/ms/y has up-to-date data:

image

 Encoding 2012 Jan 2013 Jan 2014 Jan 2015 Jan 2016 Jan 2017 Jan 2017 Dec
UTF-8 68.0% 74.7% 78.7% 82.3% 86.0% 88.2% 90.5%
ISO-8859-1 17.2% 13.5% 10.8% 9.3% 6.9% 5.5% 4.3%
Windows‑1251 3.3% 2.8% 2.7% 2.2% 1.9% 1.7% 1.5%
Shift JIS 1.7% 1.4% 1.4% 1.3% 1.1% 1.0% 0.8%

So the 5-6 year trend is, UTF-8 usage has grown from 68% in January 2012 to over 90% now.

And while it does show the rate of increase leveling off a bit, over the last 3 years it’s still been growing at over 2% per year.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.