Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggest GBK instead of gb18030 for Simplified Chinese fallback #4557

Closed
hsivonen opened this issue Apr 23, 2019 · 3 comments · Fixed by #4714
Closed

Suggest GBK instead of gb18030 for Simplified Chinese fallback #4557

hsivonen opened this issue Apr 23, 2019 · 3 comments · Fixed by #4714
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.

Comments

@hsivonen
Copy link
Member

In the table under step 8 of https://html.spec.whatwg.org/#determining-the-character-encoding , change gb18030 to GBK. Both decode the same way per the Encoding Standard, but GBK doesn't generate 4-byte sequences in form submission. Sites that are legacy enough to have unlabeled content might not be able to deal with the 4-byte sequences.

(Firefox uses GBK instead of gb18030.)

@annevk annevk added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Apr 23, 2019
@TimothyGu
Copy link
Member

What happens if the form submission contains a character that cannot be represented in GBK though, and requires GB 18030?

@hsivonen
Copy link
Member Author

hsivonen commented Jun 17, 2019

What happens if the form submission contains a character that cannot be represented in GBK though, and requires GB 18030?

The characters not representable in GBK will be converted to decimal numeric character references.

Considering that failing to label the encoding is a legacy authoring error, it seems implausible that more such erroneous sites would be broken by failing to submit non-GBK characters as gb18030 than be broken by submitting non-GBK gb18030 characters to a form handler that expects only GBK.

We don't have actual data for this. The Firefox experience from problems that arose from submitting 3-byte EUC-JP sequences was extrapolated to make Big5 encode asymmetric with decode and to keep GBK as a distinct encoding from gb18030 with asymmetric encode and decode.

@hsivonen
Copy link
Member Author

Anyway, if we believe that the Encoding Standard made the right call on GBK, what the HTML Standard says makes no sense.

annevk added a commit that referenced this issue Jun 18, 2019
It's equivalent for decoding, but gives more conservative encoding that's likely to be more compatible.

Fixes #4557.
annevk added a commit that referenced this issue Oct 13, 2021
It's equivalent for decoding, but gives more conservative encoding that's likely to be more compatible.

Fixes #4557.
annevk added a commit that referenced this issue Oct 13, 2021
It's equivalent for decoding, but gives more conservative encoding that's likely to be more compatible.

Fixes #4557.
dandclark pushed a commit to dandclark/html that referenced this issue Dec 4, 2021
It's equivalent for decoding, but gives more conservative encoding that's likely to be more compatible.

Fixes whatwg#4557.
mfreed7 pushed a commit to mfreed7/html that referenced this issue Jun 3, 2022
It's equivalent for decoding, but gives more conservative encoding that's likely to be more compatible.

Fixes whatwg#4557.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.
Development

Successfully merging a pull request may close this issue.

3 participants