-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use GBK as fallback, not gb18030 #4714
Conversation
I don’t think Chrome puts too much weight on UI language for reasons of predictability. In fact it puts any weight at all only when we have a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will approve this as editorially it looks good, but I guess we need to figure out multi-implementer interest. Maybe @rniwa has some thoughts.
FWIW, the only case I'm aware of where Firefox still guesses the encoding from the UI language is the display of non-ASCII file names in FTP directory listings. However, when Firefox guesses something GBK/gb18030-ish from TLD or content, though, it guesses GBK, so I'm in favor of this change. |
If we were to merge this change, I'd feel more comfortable if we had a second implementer say something similar to @hsivonen, along the lines of "The current spec doesn't match our implementation, but in spirit this change makes more sense than the current spec". So, @inexorabletash or @JinsukKim, ping again for Blink :). That said, if this languishes for another week or so without such indications, I'd be OK merging as-is.
Yeah, if we want to sink real time into this, an overhaul of this section or table is needed. Maybe a good next step would be a TLD mapping to replace the UI language mapping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r+ with the comment fixed. Sorry about the delay.
source
Outdated
@@ -103206,7 +103206,7 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> { | |||
- ISO-8859-9 and windows-1254 are the same (supported by encoding.spec.whatwg.org) | |||
- windows-31J and Shift_JIS are the same (supported by encoding.spec.whatwg.org) | |||
- windows-932 is close enough to Shift_JIS to be treated as equivalent (supported by wikipedia) | |||
- windows-936 is a basically a subset of GBK which is basically a subset of gb18030 (supported by wikipedia) | |||
- windows-936 is a basically a subset of GBK which is basically a subset of gb18030, but use GBK for its conservative encoder (supported by wikipedia) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be:
"windows-936 and GBK are the same"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://en.wikipedia.org/wiki/Code_page_936_(Microsoft_Windows) does not support this assertion. Do you have a better source?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it say "basically the same".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I missed this question. My source is comparing the behavior of GBK in encoding_rs to the code page 936 behavior of the kernel32.dll converter by trying all two-byte combinations. IIRC, the differences relate to ill-formed byte sequences and to Encoding Standard GBK decoder accepting four-byte sequences, and the mapped two-byte sequences matched exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I guess GBK as per Encoding has 0x80 mapped to the Euro sign, so perhaps it's not quite the same as the "original" GBK. Oh well, I think I improved this comment to a good enough state.
It's equivalent for decoding, but gives more conservative encoding that's likely to be more compatible. Fixes #4557.
Both Firefox and Chrome say GBK here: https://hsivonen.com/test/moz/sniff-zh-hans.htm |
9d3413e
to
ff74f6e
Compare
Thanks for merging.
It turns out that Safari says gb18030 after setting the primary language of the OS to Simplified Chinese and rebooting. (Note: Setting Safari's language to Simplified Chinese and relaunching Safari is not enough.) |
I filed https://bugs.webkit.org/show_bug.cgi?id=231660 against WebKit. |
It's equivalent for decoding, but gives more conservative encoding that's likely to be more compatible.
Fixes #4557.
(See WHATWG Working Mode: Changes for more details.)
/parsing.html ( diff )