New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider restricting <form accept-charset> to utf-8 #3097

Open
zcorpan opened this Issue Oct 4, 2017 · 13 comments

Comments

5 participants
@zcorpan
Member

zcorpan commented Oct 4, 2017

From #3091 (comment)

Should look into how accept-charset is used, and ponder about what the consequences would be if we only allowed <form accept-charset="utf-8"> as conforming.

cc @sideshowbarker @hsivonen

sideshowbarker added a commit that referenced this issue Oct 5, 2017

Require UTF-8
This change adds a “must” requirement for UTF-8 in most places in the spec that
define a means for specifying a character encoding.

Specifically, it makes UTF-8 required for any “character encoding declaration”,
which includes the HTTP Content-Type header sent with any document, the `<meta
charset>` element, and the `<meta http-equiv=content-type>` element.

Along with those, this change also makes UTF-8 required for `<script charset>`.

To make the normative source of those requirements clear, this change also adds
a specific citation to the relevant requirement from the Encoding standard, and
updates the in-spec IANA registration for text/html media type to indicate that
UTF-8 is required. Finally, it changes an existing requirement for authoring
tools to use UTF-8 from a “should” to a “must”.

One place where this change doesn’t yet add a requirement for UTF-8 is for the
`form` element’s `accept-charset` attribute. For that, see issue #3097.
@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Oct 5, 2017

Member

I think we basically have to do this. You get data loss and interoperability issues with other encodings. And fetch() and XMLHttpRequest already force it.

Member

annevk commented Oct 5, 2017

I think we basically have to do this. You get data loss and interoperability issues with other encodings. And fetch() and XMLHttpRequest already force it.

@sideshowbarker

This comment has been minimized.

Show comment
Hide comment
@sideshowbarker

sideshowbarker Oct 5, 2017

Member

I think we basically have to do this. You get data loss and interoperability issues with other encodings. And fetch() and XMLHttpRequest already force it.

So then we can (should) just add a change for it to #3091, right?

Member

sideshowbarker commented Oct 5, 2017

I think we basically have to do this. You get data loss and interoperability issues with other encodings. And fetch() and XMLHttpRequest already force it.

So then we can (should) just add a change for it to #3091, right?

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Oct 5, 2017

Member

I think so, but @zcorpan wants to do research first as I understand it.

Member

annevk commented Oct 5, 2017

I think so, but @zcorpan wants to do research first as I understand it.

@sideshowbarker

This comment has been minimized.

Show comment
Hide comment
@sideshowbarker

sideshowbarker Oct 5, 2017

Member

I think so, but @zcorpan wants to do research first as I understand it.

Ah, OK — if so then I think we should go ahead and do what we seem to have already been planning, which is to merge #3091 first — without this (but with a note about it in the commit message) — and then handle this a separate change later

Member

sideshowbarker commented Oct 5, 2017

I think so, but @zcorpan wants to do research first as I understand it.

Ah, OK — if so then I think we should go ahead and do what we seem to have already been planning, which is to merge #3091 first — without this (but with a note about it in the commit message) — and then handle this a separate change later

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Oct 5, 2017

Member

Yeah, I think that's fine.

Member

annevk commented Oct 5, 2017

Yeah, I think that's fine.

sideshowbarker added a commit that referenced this issue Oct 6, 2017

Require UTF-8
This change adds a “must” requirement for UTF-8 in all but one of the places in
the spec that define a means for specifying a character encoding.

Specifically, it makes UTF-8 required for any “character encoding declaration”,
which includes the HTTP Content-Type header sent with any document, the `<meta
charset>` element, and the `<meta http-equiv=content-type>` element.

Along with those, this change also makes UTF-8 required for `<script charset>`
but also moves `<script charset>` to being obsolete-but-conforming (because now
that both documents and scripts are required to use UTF-8, it’s redundant to
specify `charset` on the `script` element, since it inherits from the document).

To make the normative source of those requirements clear, this change also adds
a specific citation to the relevant requirement from the Encoding standard, and
updates the in-spec IANA registration for text/html media type to indicate that
UTF-8 is required. Finally, it changes an existing requirement for authoring
tools to use UTF-8 from a “should” to a “must”.

The one place where this change doesn’t yet add a requirement for UTF-8 is for
the `form` element’s `accept-charset` attribute. For that, see issue #3097.

sideshowbarker added a commit that referenced this issue Oct 6, 2017

Require UTF-8
This change adds a “must” requirement for UTF-8 in all but one of the places in
the spec that define a means for specifying a character encoding.

Specifically, it makes UTF-8 required for any “character encoding declaration”,
which includes the HTTP Content-Type header sent with any document, the
`<meta charset>` element, and the `<meta http-equiv=content-type>` element.

Along with those, this change also makes UTF-8 required for `<script charset>`
but also moves `<script charset>` to being obsolete-but-conforming (because now
that both documents and scripts are required to use UTF-8, it’s redundant to
specify `charset` on the `script` element, since it inherits from the document).

To make the normative source of those requirements clear, this change also adds
a specific citation to the relevant requirement from the Encoding standard, and
updates the in-spec IANA registration for text/html media type to indicate that
UTF-8 is required. Finally, it changes an existing requirement for authoring
tools to use UTF-8 from a “should” to a “must”.

The one place where this change doesn’t yet add a requirement for UTF-8 is for
the `form` element’s `accept-charset` attribute. For that, see issue #3097.

annevk added a commit that referenced this issue Oct 6, 2017

Require UTF-8
This change adds a “must” requirement for UTF-8 in all but one of the places in the standard that define a means for specifying a character encoding.

Specifically, it makes UTF-8 required for any “character encoding declaration”, which includes the HTTP Content-Type header sent with any document, the `<meta charset>` element, and the `<meta http-equiv=content-type>` element.

Along with those, this change also makes UTF-8 required for `<script charset>` but also moves `<script charset>` to being obsolete-but-conforming (because now that both documents and scripts are required to use UTF-8, it’s redundant to specify `charset` on the `script` element, since it inherits from the document).

To make the normative source of those requirements clear, this change also adds a specific citation to the relevant requirement from the Encoding standard, and updates the in-document IANA registration for text/html media type to indicate that UTF-8 is required. Finally, it changes an existing requirement for authoring tools to use UTF-8 from a “should” to a “must”.

The one place where this change doesn’t yet add a requirement for UTF-8 is for the `form` element’s `accept-charset` attribute. For that, see issue #3097.

Closes #3004.
@rviscomi

This comment has been minimized.

Show comment
Hide comment
@rviscomi

rviscomi Dec 15, 2017

Is this still something you need compat analysis help with? If so, are you just interested in the frequency of accept-charset values?

SELECT
  COUNT(0) AS frequency,
  value
FROM
  `httparchive.har.2017_12_01_chrome_requests_bodies`,
  UNNEST(REGEXP_EXTRACT_ALL(LOWER(body), r'accept-charset=[\'"]([^\'"]+)')) AS value
GROUP BY
  value
ORDER BY
  frequency DESC
frequency value
64185 utf-8
459 iso-8859-1
113 windows-1251
61 gbk
38 euc-jp
35 gb2312
32 utf8
32 euc-kr
26 shift_jis
24 shift-jis
21 iso-8859-9
18 iso-8859-2
17 unknown
16 tis-620
12 urf-8
11 utf-8,*
9 x-euc-jp
8 iso-8859-15
6 big5
5 cp1251
5 {{config.busca.acceptcharset}}
5 iso-8859-1;utf-8
5 windows-1250
4 windows-1256
3 {{vm.charset}}
3 ,
2 iso-8859-5
2 windows-1253
2 windows-1252
2 sjis
2 us-ascii
2 index.php
2 cb_form_accept_charset
2 iso-8859-1, utf-8
1 iso 8859-9
1 utf-10
1 swedish
1 + chat.cfg.charset +
1 ###form.accept-charset###
1 u
1 latin-1
1 cp-1251
1 gb18030
1 utf-8, utf-8
1 utf-8
1 windows-31j
1 iso-8859-1, iso-8859-2, utf-8
1 character_set
1 false
1 utf-8,us-ascii

rviscomi commented Dec 15, 2017

Is this still something you need compat analysis help with? If so, are you just interested in the frequency of accept-charset values?

SELECT
  COUNT(0) AS frequency,
  value
FROM
  `httparchive.har.2017_12_01_chrome_requests_bodies`,
  UNNEST(REGEXP_EXTRACT_ALL(LOWER(body), r'accept-charset=[\'"]([^\'"]+)')) AS value
GROUP BY
  value
ORDER BY
  frequency DESC
frequency value
64185 utf-8
459 iso-8859-1
113 windows-1251
61 gbk
38 euc-jp
35 gb2312
32 utf8
32 euc-kr
26 shift_jis
24 shift-jis
21 iso-8859-9
18 iso-8859-2
17 unknown
16 tis-620
12 urf-8
11 utf-8,*
9 x-euc-jp
8 iso-8859-15
6 big5
5 cp1251
5 {{config.busca.acceptcharset}}
5 iso-8859-1;utf-8
5 windows-1250
4 windows-1256
3 {{vm.charset}}
3 ,
2 iso-8859-5
2 windows-1253
2 windows-1252
2 sjis
2 us-ascii
2 index.php
2 cb_form_accept_charset
2 iso-8859-1, utf-8
1 iso 8859-9
1 utf-10
1 swedish
1 + chat.cfg.charset +
1 ###form.accept-charset###
1 u
1 latin-1
1 cp-1251
1 gb18030
1 utf-8, utf-8
1 utf-8
1 windows-31j
1 iso-8859-1, iso-8859-2, utf-8
1 character_set
1 false
1 utf-8,us-ascii
@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Dec 15, 2017

Member

Thank you @rviscomi, that's very useful. Those results suggest to me that restricting this further might affect some folks, but should overall be largely uncontroversial. (Note that we're not talking about implementations here. Just about what the validator says.)

Member

annevk commented Dec 15, 2017

Thank you @rviscomi, that's very useful. Those results suggest to me that restricting this further might affect some folks, but should overall be largely uncontroversial. (Note that we're not talking about implementations here. Just about what the validator says.)

@zcorpan

This comment has been minimized.

Show comment
Hide comment
@zcorpan

zcorpan Dec 17, 2017

Member

Thank you @rviscomi! Now we know how many resources in httparchive data set would be affected, but not necessarily the percentage of sites (since one site might use several resouces that match).

I think it would be useful to know how many of those that use a value other than utf-8 that have an action that points to a third party (and so the affected page's author might not be able to fix the action resource). That's probably a bit convulted to write a query for though! 😁

Another option could be to experiment with emitting a warning in the validator and listen to feedback/complaints on SO. If there are many complaints then we back off, if it's not much and the count declines over time we can update the standard and upgrade to an error.

Thoughts?

Member

zcorpan commented Dec 17, 2017

Thank you @rviscomi! Now we know how many resources in httparchive data set would be affected, but not necessarily the percentage of sites (since one site might use several resouces that match).

I think it would be useful to know how many of those that use a value other than utf-8 that have an action that points to a third party (and so the affected page's author might not be able to fix the action resource). That's probably a bit convulted to write a query for though! 😁

Another option could be to experiment with emitting a warning in the validator and listen to feedback/complaints on SO. If there are many complaints then we back off, if it's not much and the count declines over time we can update the standard and upgrade to an error.

Thoughts?

@rviscomi

This comment has been minimized.

Show comment
Hide comment
@rviscomi

rviscomi Dec 18, 2017

Now we know how many resources in httparchive data set would be affected, but not necessarily the percentage of sites (since one site might use several resouces that match).

Modified the query to count the number of websites that include at least one charset that is not utf-8:

SELECT
  COUNT(DISTINCT page) AS affected_sites
FROM
  `httparchive.har.2017_12_01_chrome_requests_bodies`,
  UNNEST(REGEXP_EXTRACT_ALL(LOWER(body), r'accept-charset=[\'"]([^\'"]+)')) AS value
WHERE
  value != 'utf-8'

The result is 624. So out of the 431,851 sites analyzed this is merely 0.14%.

I think it would be useful to know how many of those that use a value other than utf-8 that have an action that points to a third party

Yeah that's definitely a tricky one :)

SELECT
  COUNT(DISTINCT page)
FROM
  `httparchive.har.2017_12_01_chrome_requests_bodies`,
  UNNEST(REGEXP_EXTRACT_ALL(LOWER(body), r'accept-charset=[\'"]([^\'"]+)')) AS value,
  UNNEST(REGEXP_EXTRACT_ALL(LOWER(body), r'action=[\'"]([^\'"]+)')) AS action
WHERE
  value != 'utf-8' AND
  NET.REG_DOMAIN(action) != NET.REG_DOMAIN(page)

I'm cutting a corner here for the sake of simplicity; I'm assuming that the action attribute corresponds to the same form element as the accept-charset attribute. The result is 165 websites (~26% of the 624).

rviscomi commented Dec 18, 2017

Now we know how many resources in httparchive data set would be affected, but not necessarily the percentage of sites (since one site might use several resouces that match).

Modified the query to count the number of websites that include at least one charset that is not utf-8:

SELECT
  COUNT(DISTINCT page) AS affected_sites
FROM
  `httparchive.har.2017_12_01_chrome_requests_bodies`,
  UNNEST(REGEXP_EXTRACT_ALL(LOWER(body), r'accept-charset=[\'"]([^\'"]+)')) AS value
WHERE
  value != 'utf-8'

The result is 624. So out of the 431,851 sites analyzed this is merely 0.14%.

I think it would be useful to know how many of those that use a value other than utf-8 that have an action that points to a third party

Yeah that's definitely a tricky one :)

SELECT
  COUNT(DISTINCT page)
FROM
  `httparchive.har.2017_12_01_chrome_requests_bodies`,
  UNNEST(REGEXP_EXTRACT_ALL(LOWER(body), r'accept-charset=[\'"]([^\'"]+)')) AS value,
  UNNEST(REGEXP_EXTRACT_ALL(LOWER(body), r'action=[\'"]([^\'"]+)')) AS action
WHERE
  value != 'utf-8' AND
  NET.REG_DOMAIN(action) != NET.REG_DOMAIN(page)

I'm cutting a corner here for the sake of simplicity; I'm assuming that the action attribute corresponds to the same form element as the accept-charset attribute. The result is 165 websites (~26% of the 624).

@bsittler

This comment has been minimized.

Show comment
Hide comment
@bsittler

bsittler Dec 18, 2017

Are you also considering cases where a form does not explicitly declare accept-charset, but inherits a non-UTF-8 charset from the page of which it is part edit: to be non-conforming? And would this change be separate from or concurrent with #3091 ?

bsittler commented Dec 18, 2017

Are you also considering cases where a form does not explicitly declare accept-charset, but inherits a non-UTF-8 charset from the page of which it is part edit: to be non-conforming? And would this change be separate from or concurrent with #3091 ?

@bsittler

This comment has been minimized.

Show comment
Hide comment
@bsittler

bsittler Dec 18, 2017

Also, I could see separately disallowing (or at least treating as non-conforming) attempts to use non-UTF-8 character encoding in cross-site form submissions and HREF constructions and in form submissions and navigations originating from pages fetched over unsecured connections (non-https: page containing the form or link) as being inherently more risky/more prone to injection attacks than same-registered-domain, https:-only navigations and form submissions due to additional attack risk/risk of interference by bad actors in those cases. Is it worthwhile to consider nonconformance and deprecation for those cases on a more advanced schedule than for same-site https: cases?

bsittler commented Dec 18, 2017

Also, I could see separately disallowing (or at least treating as non-conforming) attempts to use non-UTF-8 character encoding in cross-site form submissions and HREF constructions and in form submissions and navigations originating from pages fetched over unsecured connections (non-https: page containing the form or link) as being inherently more risky/more prone to injection attacks than same-registered-domain, https:-only navigations and form submissions due to additional attack risk/risk of interference by bad actors in those cases. Is it worthwhile to consider nonconformance and deprecation for those cases on a more advanced schedule than for same-site https: cases?

@rviscomi

This comment has been minimized.

Show comment
Hide comment
@rviscomi

rviscomi Dec 18, 2017

Are you also considering cases where a form does not explicitly declare accept-charset, but inherits a non-UTF-8 charset from the page of which it is part?

No. We can change the query to look for attribute names ending in "charset", which should cover both inherited charsets and explicit form accept-charsets. Again, this is just an approximation so take it with a grain of salt.

SELECT
  COUNT(DISTINCT page)
FROM
  `httparchive.har.2017_12_01_chrome_requests_bodies`,
  UNNEST(REGEXP_EXTRACT_ALL(LOWER(body), r'charset=[\'"]([^\'"]+)')) AS value,
  UNNEST(REGEXP_EXTRACT_ALL(LOWER(body), r'action=[\'"]([^\'"]+)')) AS action
WHERE
  value IS NOT NULL AND
  action IS NOT NULL AND
  value != 'utf-8' AND
  NET.REG_DOMAIN(action) != NET.REG_DOMAIN(page)

Result: 1545. (there are ~430,000 distinct pages in the dataset)

rviscomi commented Dec 18, 2017

Are you also considering cases where a form does not explicitly declare accept-charset, but inherits a non-UTF-8 charset from the page of which it is part?

No. We can change the query to look for attribute names ending in "charset", which should cover both inherited charsets and explicit form accept-charsets. Again, this is just an approximation so take it with a grain of salt.

SELECT
  COUNT(DISTINCT page)
FROM
  `httparchive.har.2017_12_01_chrome_requests_bodies`,
  UNNEST(REGEXP_EXTRACT_ALL(LOWER(body), r'charset=[\'"]([^\'"]+)')) AS value,
  UNNEST(REGEXP_EXTRACT_ALL(LOWER(body), r'action=[\'"]([^\'"]+)')) AS action
WHERE
  value IS NOT NULL AND
  action IS NOT NULL AND
  value != 'utf-8' AND
  NET.REG_DOMAIN(action) != NET.REG_DOMAIN(page)

Result: 1545. (there are ~430,000 distinct pages in the dataset)

@zcorpan

This comment has been minimized.

Show comment
Hide comment
@zcorpan

zcorpan Dec 19, 2017

Member

@bsittler

Are you also considering cases where a form does not explicitly declare accept-charset, but inherits a non-UTF-8 charset from the page of which it is part edit: to be non-conforming?

What I had in mind when reporting this issue, the form itself wouldn't be non-conforming (but the page itself already is, since using and specifying utf-8 for the page is required as of #3091).

But, I don't know, would it be a good idea to require explicit accept-charset="utf-8" if the page itself isn't utf-8? I hadn't thought about that. But that will also be a much higher percentage (though these pages are already non-conforming). Would the end goal be to make it possible for browser engines to force utf-8 for all form submissions eventually when the number of non-utf-8 form submissions get closer to 0?

@rviscomi query in the previous comment doesn't look for this case I believe; it would need to look for any HTML pages that are in utf-8 (ideally looking at specified Content-Type or meta) that also have a form element without an accept-charset attribute.

And would this change be separate from or concurrent with #3091 ?

This is a separate change.

Member

zcorpan commented Dec 19, 2017

@bsittler

Are you also considering cases where a form does not explicitly declare accept-charset, but inherits a non-UTF-8 charset from the page of which it is part edit: to be non-conforming?

What I had in mind when reporting this issue, the form itself wouldn't be non-conforming (but the page itself already is, since using and specifying utf-8 for the page is required as of #3091).

But, I don't know, would it be a good idea to require explicit accept-charset="utf-8" if the page itself isn't utf-8? I hadn't thought about that. But that will also be a much higher percentage (though these pages are already non-conforming). Would the end goal be to make it possible for browser engines to force utf-8 for all form submissions eventually when the number of non-utf-8 form submissions get closer to 0?

@rviscomi query in the previous comment doesn't look for this case I believe; it would need to look for any HTML pages that are in utf-8 (ideally looking at specified Content-Type or meta) that also have a form element without an accept-charset attribute.

And would this change be separate from or concurrent with #3091 ?

This is a separate change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment