New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require utf-8 when specifying character encoding #3091

Merged
merged 3 commits into from Oct 6, 2017

Conversation

6 participants
@sideshowbarker
Member

sideshowbarker commented Oct 3, 2017

This addresses #3006.

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Oct 3, 2017

Member

Is @hsivonen now comfortable with this? When Encoding initially required this there was a little bit of fear it might be too soon. So maybe we should split it out for <script charset> since it seems fine to start there.

Member

annevk commented Oct 3, 2017

Is @hsivonen now comfortable with this? When Encoding initially required this there was a little bit of fear it might be too soon. So maybe we should split it out for <script charset> since it seems fine to start there.

@domenic

This comment has been minimized.

Show comment
Hide comment
@domenic

domenic Oct 3, 2017

Member

I am in support of doing this everywhere. E.g. https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg implies a good trend , and in general the "only UTF-8" meme has gotten pretty widespread.

I haven't reviewed the commits yet, but will do so soon, under the assumption that we're gonna go all the way.

Member

domenic commented Oct 3, 2017

I am in support of doing this everywhere. E.g. https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg implies a good trend , and in general the "only UTF-8" meme has gotten pretty widespread.

I haven't reviewed the commits yet, but will do so soon, under the assumption that we're gonna go all the way.

@hsivonen

This comment has been minimized.

Show comment
Hide comment
@hsivonen

hsivonen Oct 4, 2017

Member

Is @hsivonen now comfortable with this? When Encoding initially required this there was a little bit of fear it might be too soon. So maybe we should split it out for <script charset> since it seems fine to start there.

I think we should nudge authors towards making everything UTF-8. I'm am still a bit worried about authors reacting to an error in a silly way: Making the charset attribute UTF-8 without changing the encoding of the resource to UTF-8.

I guess the exact message that the validator gives matters here. Assuming a message that is worded to complain more about the resource not being UTF-8 than about the value of the attribute per se, I'm OK with this.

As for script vs. link, I think non-UTF-8 CSS is more harmful than non-UTF-8 JS, because style sheet encoding gets inherited into URL parsing (i.e. URLs become context-dependent and don't work in the URL bar) but JS encoding doesn't get inherited anywhere.

Reviewing the patch:

- <p>Authoring tools should default to using <span>UTF-8</span> for newly-created documents. <ref
-  spec=ENCODING></p>

It seems to me that the patch should upgrade this to a MUST instead of removing it. I fail to locate a normative statement to that effect for the BOM and HTTP cases. (I see it only for the meta case.)

Member

hsivonen commented Oct 4, 2017

Is @hsivonen now comfortable with this? When Encoding initially required this there was a little bit of fear it might be too soon. So maybe we should split it out for <script charset> since it seems fine to start there.

I think we should nudge authors towards making everything UTF-8. I'm am still a bit worried about authors reacting to an error in a silly way: Making the charset attribute UTF-8 without changing the encoding of the resource to UTF-8.

I guess the exact message that the validator gives matters here. Assuming a message that is worded to complain more about the resource not being UTF-8 than about the value of the attribute per se, I'm OK with this.

As for script vs. link, I think non-UTF-8 CSS is more harmful than non-UTF-8 JS, because style sheet encoding gets inherited into URL parsing (i.e. URLs become context-dependent and don't work in the URL bar) but JS encoding doesn't get inherited anywhere.

Reviewing the patch:

- <p>Authoring tools should default to using <span>UTF-8</span> for newly-created documents. <ref
-  spec=ENCODING></p>

It seems to me that the patch should upgrade this to a MUST instead of removing it. I fail to locate a normative statement to that effect for the BOM and HTTP cases. (I see it only for the meta case.)

@zcorpan

#3006 is a pull request -- I suppose this addresses #3004 and replaces #3006?

Show outdated Hide outdated source
Show outdated Hide outdated source
data-x="attr-meta-http-equiv-content-type">Encoding declaration state</span>, then the character
encoding used must be an <span>ASCII-compatible encoding</span>.</p>
<p>Authors should use <span>UTF-8</span>. Conformance checkers may advise authors against using

This comment has been minimized.

@zcorpan

zcorpan Oct 4, 2017

Member

Is the idea here that the Encoding Standard already requires utf-8? Maybe put in a statement of fact here mentioning that requirement, so we're clear (since the meta encoding declaration is itself optional and encoding could be specified in HTTP/BOM/XML decl).

@zcorpan

zcorpan Oct 4, 2017

Member

Is the idea here that the Encoding Standard already requires utf-8? Maybe put in a statement of fact here mentioning that requirement, so we're clear (since the meta encoding declaration is itself optional and encoding could be specified in HTTP/BOM/XML decl).

This comment has been minimized.

@sideshowbarker

sideshowbarker Oct 4, 2017

Member

Is the idea here that the Encoding Standard already requires utf-8? Maybe put in a statement of fact here mentioning that requirement, so we're clear

OK, 94517b3 attempts to do that

@sideshowbarker

sideshowbarker Oct 4, 2017

Member

Is the idea here that the Encoding Standard already requires utf-8? Maybe put in a statement of fact here mentioning that requirement, so we're clear

OK, 94517b3 attempts to do that

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Oct 4, 2017

Member

Yeah, most of #3006 ends up withdrawn. I need to separate a separate PR for the minor things I fixed on the side.

Member

annevk commented Oct 4, 2017

Yeah, most of #3006 ends up withdrawn. I need to separate a separate PR for the minor things I fixed on the side.

@sideshowbarker

This comment has been minimized.

Show comment
Hide comment
@sideshowbarker

sideshowbarker Oct 4, 2017

Member

When Encoding initially required this there was a little bit of fear it might be too soon.

I know — but that was nearly 5 years ago (January 2013). So finally requiring UTF-8 in HTML almost 5 years after Encoding initially required it doesn’t seem like we’re exactly rushing things…

So maybe we should split it out for <script charset> since it seems fine to start there.

I’m OK with just merging the <script charset> part for now if that’s all we can get agreement on at the moment, but if we were to do that, I wonder how we then decide what process we follow for deciding when to finally go all the way with the rest?

I assume we’d agree we don’t want to wait, say, another 5 years. But short of that it’s not clear to me how we can measure when it’s no longer too soon and we’re instead finally ready to go forward with it.

So it seems like instead we just need to choose some point at which to do it, and then finally just do it.

Member

sideshowbarker commented Oct 4, 2017

When Encoding initially required this there was a little bit of fear it might be too soon.

I know — but that was nearly 5 years ago (January 2013). So finally requiring UTF-8 in HTML almost 5 years after Encoding initially required it doesn’t seem like we’re exactly rushing things…

So maybe we should split it out for <script charset> since it seems fine to start there.

I’m OK with just merging the <script charset> part for now if that’s all we can get agreement on at the moment, but if we were to do that, I wonder how we then decide what process we follow for deciding when to finally go all the way with the rest?

I assume we’d agree we don’t want to wait, say, another 5 years. But short of that it’s not clear to me how we can measure when it’s no longer too soon and we’re instead finally ready to go forward with it.

So it seems like instead we just need to choose some point at which to do it, and then finally just do it.

@sideshowbarker

This comment has been minimized.

Show comment
Hide comment
@sideshowbarker

sideshowbarker Oct 4, 2017

Member

I'm am still a bit worried about authors reacting to an error in a silly way: Making the charset attribute UTF-8 without changing the encoding of the resource to UTF-8.

Yeah, agreed that would be a counterproductive outcome

I guess the exact message that the validator gives matters here. Assuming a message that is worded to complain more about the resource not being UTF-8 than about the value of the attribute per se, I'm OK with this.

OK, that‘s doable of course — but to be clear: in the validator architecture, that message would need to come from the parser code, right?

Reviewing the patch:

- <p>Authoring tools should default to using <span>UTF-8</span> for newly-created documents. > <ref
-  spec=ENCODING></p>

It seems to me that the patch should upgrade this to a MUST instead of removing it.

OK, I’ll make that change.

I fail to locate a normative statement to that effect for the BOM and HTTP cases. (I see it only for the meta case.)

Not sure what you mean. I take it you don’t meant a normative statement about authoring tools in relation to the BOM, or a normative statement about authoring tools in relation to the HTTP-delivered charset.

Member

sideshowbarker commented Oct 4, 2017

I'm am still a bit worried about authors reacting to an error in a silly way: Making the charset attribute UTF-8 without changing the encoding of the resource to UTF-8.

Yeah, agreed that would be a counterproductive outcome

I guess the exact message that the validator gives matters here. Assuming a message that is worded to complain more about the resource not being UTF-8 than about the value of the attribute per se, I'm OK with this.

OK, that‘s doable of course — but to be clear: in the validator architecture, that message would need to come from the parser code, right?

Reviewing the patch:

- <p>Authoring tools should default to using <span>UTF-8</span> for newly-created documents. > <ref
-  spec=ENCODING></p>

It seems to me that the patch should upgrade this to a MUST instead of removing it.

OK, I’ll make that change.

I fail to locate a normative statement to that effect for the BOM and HTTP cases. (I see it only for the meta case.)

Not sure what you mean. I take it you don’t meant a normative statement about authoring tools in relation to the BOM, or a normative statement about authoring tools in relation to the HTTP-delivered charset.

@sideshowbarker

This comment has been minimized.

Show comment
Hide comment
@sideshowbarker

sideshowbarker Oct 4, 2017

Member

#3006 is a pull request -- I suppose this addresses #3004 and replaces #3006?

Yeah (as @annevk noted)

Member

sideshowbarker commented Oct 4, 2017

#3006 is a pull request -- I suppose this addresses #3004 and replaces #3006?

Yeah (as @annevk noted)

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Oct 4, 2017

Member

@sideshowbarker it seems that everyone who commented here is okay with going ahead with it, so let's (finally) do it.

Member

annevk commented Oct 4, 2017

@sideshowbarker it seems that everyone who commented here is okay with going ahead with it, so let's (finally) do it.

@sideshowbarker

This comment has been minimized.

Show comment
Hide comment
@sideshowbarker

sideshowbarker Oct 4, 2017

Member

Reviewing the patch:

- <p>Authoring tools should default to using <span>UTF-8</span> for newly-created documents. > <ref
-  spec=ENCODING></p>

It seems to me that the patch should upgrade this to a MUST instead of removing it.

OK, made it so in 769d6fe

Member

sideshowbarker commented Oct 4, 2017

Reviewing the patch:

- <p>Authoring tools should default to using <span>UTF-8</span> for newly-created documents. > <ref
-  spec=ENCODING></p>

It seems to me that the patch should upgrade this to a MUST instead of removing it.

OK, made it so in 769d6fe

@hsivonen

This comment has been minimized.

Show comment
Hide comment
@hsivonen

hsivonen Oct 4, 2017

Member

OK, that‘s doable of course — but to be clear: in the validator architecture, that message would need to come from the parser code, right?

While the parser could make sense for meta, a datatype in the datatype library would make more sense especially for link and script.

Member

hsivonen commented Oct 4, 2017

OK, that‘s doable of course — but to be clear: in the validator architecture, that message would need to come from the parser code, right?

While the parser could make sense for meta, a datatype in the datatype library would make more sense especially for link and script.

@hsivonen

This comment has been minimized.

Show comment
Hide comment
@hsivonen

hsivonen Oct 4, 2017

Member

Not sure what you mean. I take it you don’t meant a normative statement about authoring tools in relation to the BOM, or a normative statement about authoring tools in relation to the HTTP-delivered charset.

I meant the same thing as @zcorpan meant in the comment right after mine.

Member

hsivonen commented Oct 4, 2017

Not sure what you mean. I take it you don’t meant a normative statement about authoring tools in relation to the BOM, or a normative statement about authoring tools in relation to the HTTP-delivered charset.

I meant the same thing as @zcorpan meant in the comment right after mine.

@sideshowbarker

This comment has been minimized.

Show comment
Hide comment
@sideshowbarker

sideshowbarker Oct 4, 2017

Member

OK, that‘s doable of course — but to be clear: in the validator architecture, that message would need to come from the parser code, right?

While the parser could make sense for meta, a datatype in the datatype library would make more sense especially for link and script.

Aha yeah OK I’ll add a datatype checker for it that way to the validator sources

Member

sideshowbarker commented Oct 4, 2017

OK, that‘s doable of course — but to be clear: in the validator architecture, that message would need to come from the parser code, right?

While the parser could make sense for meta, a datatype in the datatype library would make more sense especially for link and script.

Aha yeah OK I’ll add a datatype checker for it that way to the validator sources

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Oct 4, 2017

Member

We should update those too.

Member

annevk commented Oct 4, 2017

We should update those too.

@zcorpan

This comment has been minimized.

Show comment
Hide comment
@zcorpan

zcorpan Oct 4, 2017

Member

I agree about text/html. But I think we should probably separate accept-encoding in order to do proper reasoning and compat analysis for that.

Member

zcorpan commented Oct 4, 2017

I agree about text/html. But I think we should probably separate accept-encoding in order to do proper reasoning and compat analysis for that.

@sideshowbarker

“Update text/html registration” change LGTM

@domenic

This comment has been minimized.

Show comment
Hide comment
@domenic

domenic Oct 5, 2017

Member

Per #3006 (comment) , I was thinking we should make charset="utf-8" on script elements obsolete but conforming (i.e. validators display a warning), since in a UTF-8 document it is redundant, and we've recently been making redundant script attributes obsolete but conforming. This would mean the charset attribute on script gets a treatment similar to type on style.

Member

domenic commented Oct 5, 2017

Per #3006 (comment) , I was thinking we should make charset="utf-8" on script elements obsolete but conforming (i.e. validators display a warning), since in a UTF-8 document it is redundant, and we've recently been making redundant script attributes obsolete but conforming. This would mean the charset attribute on script gets a treatment similar to type on style.

@sideshowbarker

This comment has been minimized.

Show comment
Hide comment
@sideshowbarker

sideshowbarker Oct 5, 2017

Member

I was thinking we should make charset="utf-8" on script elements obsolete but conforming… This would mean the charset attribute on script gets a treatment similar to type on style.

Yes, will update the source on this branch to do that

Member

sideshowbarker commented Oct 5, 2017

I was thinking we should make charset="utf-8" on script elements obsolete but conforming… This would mean the charset attribute on script gets a treatment similar to type on style.

Yes, will update the source on this branch to do that

@domenic

This comment has been minimized.

Show comment
Hide comment
@domenic

domenic Oct 5, 2017

Member

I was going to do a review but then I thought it'd be easier to just tweak things myself so I got carried away and did a bit more. Let me know what you think :).

Member

domenic commented Oct 5, 2017

I was going to do a review but then I thought it'd be easier to just tweak things myself so I got carried away and did a bit more. Let me know what you think :).

@sideshowbarker

This comment has been minimized.

Show comment
Hide comment
@sideshowbarker

sideshowbarker Oct 5, 2017

Member

I was going to do a review but then I thought it'd be easier to just tweak things myself so I got carried away and did a bit more. Let me know what you think :).

Looks beautiful 🎉

Member

sideshowbarker commented Oct 5, 2017

I was going to do a review but then I thought it'd be easier to just tweak things myself so I got carried away and did a bit more. Let me know what you think :).

Looks beautiful 🎉

sideshowbarker added some commits Oct 3, 2017

Require UTF-8
This change adds a “must” requirement for UTF-8 in all but one of the places in
the spec that define a means for specifying a character encoding.

Specifically, it makes UTF-8 required for any “character encoding declaration”,
which includes the HTTP Content-Type header sent with any document, the
`<meta charset>` element, and the `<meta http-equiv=content-type>` element.

Along with those, this change also makes UTF-8 required for `<script charset>`
but also moves `<script charset>` to being obsolete-but-conforming (because now
that both documents and scripts are required to use UTF-8, it’s redundant to
specify `charset` on the `script` element, since it inherits from the document).

To make the normative source of those requirements clear, this change also adds
a specific citation to the relevant requirement from the Encoding standard, and
updates the in-spec IANA registration for text/html media type to indicate that
UTF-8 is required. Finally, it changes an existing requirement for authoring
tools to use UTF-8 from a “should” to a “must”.

The one place where this change doesn’t yet add a requirement for UTF-8 is for
the `form` element’s `accept-charset` attribute. For that, see issue #3097.

@sideshowbarker sideshowbarker referenced this pull request Oct 6, 2017

Closed

Require UTF-8 #1039

@annevk

I found a couple more nits. Happy to fix these later today.

Show outdated Hide outdated source
Show outdated Hide outdated source
Show outdated Hide outdated source
Show outdated Hide outdated source
Show outdated Hide outdated source
@annevk

annevk approved these changes Oct 6, 2017

I think it looks good now, but someone should probably double check my edit.

@sideshowbarker

This comment has been minimized.

Show comment
Hide comment
@sideshowbarker

sideshowbarker Oct 6, 2017

Member

716ab5f changes LGTM

Member

sideshowbarker commented on 716ab5f Oct 6, 2017

716ab5f changes LGTM

@annevk annevk merged commit fae77e3 into master Oct 6, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@annevk annevk deleted the sideshowbarker/require-utf-8 branch Oct 6, 2017

@annevk annevk restored the sideshowbarker/require-utf-8 branch Oct 6, 2017

@annevk annevk deleted the sideshowbarker/require-utf-8 branch Oct 6, 2017

@annevk annevk added the i18n-comment label Dec 14, 2017

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Dec 14, 2017

Member

Sure thing, done.

Member

annevk commented Dec 14, 2017

Sure thing, done.

@duerst

This comment has been minimized.

Show comment
Hide comment
@duerst

duerst Dec 31, 2017

domenic commented on Oct 4:

E.g. https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg implies a good trend , and in general the "only UTF-8" meme has gotten pretty widespread.

It would be good to have some more recent data. The graph on Wikipedia is about 5 years old.

duerst commented Dec 31, 2017

domenic commented on Oct 4:

E.g. https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg implies a good trend , and in general the "only UTF-8" meme has gotten pretty widespread.

It would be good to have some more recent data. The graph on Wikipedia is about 5 years old.

@sideshowbarker

This comment has been minimized.

Show comment
Hide comment
@sideshowbarker

sideshowbarker Dec 31, 2017

Member

It would be good to have some more recent data. The graph on Wikipedia is about 5 years old.

https://w3techs.com/technologies/history_overview/character_encoding/ms/y has up-to-date data:

image

 Encoding 2012 Jan 2013 Jan 2014 Jan 2015 Jan 2016 Jan 2017 Jan 2017 Dec
UTF-8 68.0% 74.7% 78.7% 82.3% 86.0% 88.2% 90.5%
ISO-8859-1 17.2% 13.5% 10.8% 9.3% 6.9% 5.5% 4.3%
Windows‑1251 3.3% 2.8% 2.7% 2.2% 1.9% 1.7% 1.5%
Shift JIS 1.7% 1.4% 1.4% 1.3% 1.1% 1.0% 0.8%

So the 5-6 year trend is, UTF-8 usage has grown from 68% in January 2012 to over 90% now.

And while it does show the rate of increase leveling off a bit, over the last 3 years it’s still been growing at over 2% per year.

Member

sideshowbarker commented Dec 31, 2017

It would be good to have some more recent data. The graph on Wikipedia is about 5 years old.

https://w3techs.com/technologies/history_overview/character_encoding/ms/y has up-to-date data:

image

 Encoding 2012 Jan 2013 Jan 2014 Jan 2015 Jan 2016 Jan 2017 Jan 2017 Dec
UTF-8 68.0% 74.7% 78.7% 82.3% 86.0% 88.2% 90.5%
ISO-8859-1 17.2% 13.5% 10.8% 9.3% 6.9% 5.5% 4.3%
Windows‑1251 3.3% 2.8% 2.7% 2.2% 1.9% 1.7% 1.5%
Shift JIS 1.7% 1.4% 1.4% 1.3% 1.1% 1.0% 0.8%

So the 5-6 year trend is, UTF-8 usage has grown from 68% in January 2012 to over 90% now.

And while it does show the rate of increase leveling off a bit, over the last 3 years it’s still been growing at over 2% per year.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment