-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"gb18030 ranges" have problematic definitions #17
Comments
Why would the gb18030 encoder algorithm ever hit the gb18030 ranges index for U+8000? It's in the gb18030 index. |
Well then, here is another example of a problematic code point, and this time it doesn't appear in "gb18030 index": U+E5E5.
but:
and 69292 (F5F9 differs from E5E5). |
Hmm, not immediately sure why that would fail. I assume the decoder/encoder has the same problem here? Doesn't seem like the byte conversion part should matter much. |
@r12a any ideas? It seems that for your tests at http://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en browsers mostly pass. |
i didn't test all the BMP for gb18030, just selected ranges, so i missed that particular character. However, there are now two new tests that do test it, and find it (and only that one in the PUA) to be problematic on Firefox and Chrome (didn't try any others). i won't integrate these tests into the i18n test suite fully until we decide what needs to be done about this. i don't have any clear idea about why this fails, although peter's suggestion seems plausible (note the that problem character is the first one in the PUA (that isn't in the index)). (http://r12a.github.io/apps/encodings/ also shows the behaviour @peteroupc describes) |
Also, if #22 does not change the mapping for 0x8135F437 from U+1E3F to U+E7C7, another code point would have the problem. |
@r12a I'm confused by the outcome of your tests since it suggests browsers simply do not encode U+E5E5 at all despite gb18030 supposedly being a UTF (both Chrome and Firefox emit an "HTML entity"). |
For Firefox, this is intentional because some sites expected a space for gbk 0xA3A0. |
Ah, that is the problem. "This matches the GB18030-2000 standard for code points encoded as two bytes, except for 0xA3 0xA0 which maps to U+3000 to be compatible with deployed content." Except we never took care of mapping U+E5E5 to an error in the encoder. |
I don't agree with emitting an encoder error for the code point 0xE5E5; like "vyv03354", I also agree with changing back the mapping for this code point, if that is indeed what the GB18030 standard says. |
Since most browsers emit an error, it seems safer to just do that. Especially since WebKit "fixed" this in 2008, six years after Gecko did. Seems likely it might still be problematic. |
Because of deployed content index gb18030 maps 0xA3 0xA0 to U+3000 rather than U+E5E5 when decoding. Therefore encoding U+E5E5 cannot work either.
I created a PR for my proposal in #25. I would appreciate review before landing this. |
It seems that for some code points, gb18030 ranges doesn't
have a round-trip mapping. Take the code point U+8000 for example.
When we apply the the "index gb18030 ranges pointer" we get:
but when we apply the "index gb18030 ranges code point" from
31843 we get:
and that differs from our original 32768. I think the reason is that each range is poorly defined; it's not clear where each range starts and ends in "index-gb18030-ranges.txt".
The text was updated successfully, but these errors were encountered: