Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use GBK as fallback, not gb18030 #4714

Merged
merged 2 commits into from
Oct 13, 2021
Merged

Use GBK as fallback, not gb18030 #4714

merged 2 commits into from
Oct 13, 2021

Conversation

annevk
Copy link
Member

@annevk annevk commented Jun 18, 2019

It's equivalent for decoding, but gives more conservative encoding that's likely to be more compatible.

Fixes #4557.

(See WHATWG Working Mode: Changes for more details.)


/parsing.html ( diff )

@annevk annevk requested a review from hsivonen June 18, 2019 13:21
@inexorabletash
Copy link
Member

@JinsukKim ?

@TimothyGu
Copy link
Member

I don’t think Chrome puts too much weight on UI language for reasons of predictability. In fact it puts any weight at all only when we have a file scheme URL, from what I can tell. As such, I don’t think this change would have too much of an effect either way as far as Chrome is concerned.

Copy link
Member

@domenic domenic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will approve this as editorially it looks good, but I guess we need to figure out multi-implementer interest. Maybe @rniwa has some thoughts.

@domenic domenic added the needs implementer interest Moving the issue forward requires implementers to express interest label Jul 2, 2019
@annevk
Copy link
Member Author

annevk commented Sep 14, 2020

@rniwa ping?

@domenic I guess we should also evaluate interest for the status quo as that's clearly not it either, right?

@hsivonen
Copy link
Member

FWIW, the only case I'm aware of where Firefox still guesses the encoding from the UI language is the display of non-ASCII file names in FTP directory listings. However, when Firefox guesses something GBK/gb18030-ish from TLD or content, though, it guesses GBK, so I'm in favor of this change.

@domenic
Copy link
Member

domenic commented Sep 14, 2020

If we were to merge this change, I'd feel more comfortable if we had a second implementer say something similar to @hsivonen, along the lines of "The current spec doesn't match our implementation, but in spirit this change makes more sense than the current spec". So, @inexorabletash or @JinsukKim, ping again for Blink :).

That said, if this languishes for another week or so without such indications, I'd be OK merging as-is.

I guess we should also evaluate interest for the status quo as that's clearly not it either, right?

Yeah, if we want to sink real time into this, an overhaul of this section or table is needed. Maybe a good next step would be a TLD mapping to replace the UI language mapping.

Base automatically changed from master to main January 15, 2021 07:57
Copy link
Member

@hsivonen hsivonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+ with the comment fixed. Sorry about the delay.

source Outdated
@@ -103206,7 +103206,7 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> {
- ISO-8859-9 and windows-1254 are the same (supported by encoding.spec.whatwg.org)
- windows-31J and Shift_JIS are the same (supported by encoding.spec.whatwg.org)
- windows-932 is close enough to Shift_JIS to be treated as equivalent (supported by wikipedia)
- windows-936 is a basically a subset of GBK which is basically a subset of gb18030 (supported by wikipedia)
- windows-936 is a basically a subset of GBK which is basically a subset of gb18030, but use GBK for its conservative encoder (supported by wikipedia)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be:
"windows-936 and GBK are the same"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://en.wikipedia.org/wiki/Code_page_936_(Microsoft_Windows) does not support this assertion. Do you have a better source?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it say "basically the same".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I missed this question. My source is comparing the behavior of GBK in encoding_rs to the code page 936 behavior of the kernel32.dll converter by trying all two-byte combinations. IIRC, the differences relate to ill-formed byte sequences and to Encoding Standard GBK decoder accepting four-byte sequences, and the mapped two-byte sequences matched exactly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I guess GBK as per Encoding has 0x80 mapped to the Euro sign, so perhaps it's not quite the same as the "original" GBK. Oh well, I think I improved this comment to a good enough state.

It's equivalent for decoding, but gives more conservative encoding that's likely to be more compatible.

Fixes #4557.
@hsivonen
Copy link
Member

Both Firefox and Chrome say GBK here: https://hsivonen.com/test/moz/sniff-zh-hans.htm

@annevk annevk force-pushed the annevk/gbk-instead-of-gb18030 branch from 9d3413e to ff74f6e Compare October 13, 2021 07:20
@annevk annevk merged commit 9fa2a91 into main Oct 13, 2021
@annevk annevk deleted the annevk/gbk-instead-of-gb18030 branch October 13, 2021 07:29
@hsivonen
Copy link
Member

Thanks for merging.

Both Firefox and Chrome say GBK here: https://hsivonen.com/test/moz/sniff-zh-hans.htm

It turns out that Safari says gb18030 after setting the primary language of the OS to Simplified Chinese and rebooting. (Note: Setting Safari's language to Simplified Chinese and relaunching Safari is not enough.)

@annevk
Copy link
Member Author

annevk commented Oct 13, 2021

I filed https://bugs.webkit.org/show_bug.cgi?id=231660 against WebKit.

@annevk annevk removed the needs implementer interest Moving the issue forward requires implementers to express interest label Oct 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Suggest GBK instead of gb18030 for Simplified Chinese fallback
5 participants