Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"gb18030 ranges" have problematic definitions #17

Closed
peteroupc opened this issue Nov 22, 2015 · 12 comments
Closed

"gb18030 ranges" have problematic definitions #17

peteroupc opened this issue Nov 22, 2015 · 12 comments

Comments

@peteroupc
Copy link
Contributor

It seems that for some code points, gb18030 ranges doesn't
have a round-trip mapping. Take the code point U+8000 for example.

When we apply the the "index gb18030 ranges pointer" we get:

32768 ---> 18962, 0x4DAF --> 18962 + 32768 - 19887 --> 31843

but when we apply the "index gb18030 ranges code point" from
31843 we get:

31843 ---> 19043, 0x9FA6 -->  40870 + 31843 - 19043 --> 53670

and that differs from our original 32768. I think the reason is that each range is poorly defined; it's not clear where each range starts and ends in "index-gb18030-ranges.txt".

@annevk
Copy link
Member

annevk commented Nov 23, 2015

Why would the gb18030 encoder algorithm ever hit the gb18030 ranges index for U+8000? It's in the gb18030 index.

@peteroupc
Copy link
Contributor Author

Well then, here is another example of a problematic code point, and this time it doesn't appear in "gb18030 index": U+E5E5.

58853 ---> 19043, 0x9FA6 ---> 19043 + 58853 - 40870 ---> 37026

but:

37026 ---> 33550, 0xE865 ---> 59493 + 37026 - 33550 ---> 62969

and 69292 (F5F9 differs from E5E5).

@annevk
Copy link
Member

annevk commented Nov 23, 2015

Hmm, not immediately sure why that would fail. I assume the decoder/encoder has the same problem here? Doesn't seem like the byte conversion part should matter much.

@annevk
Copy link
Member

annevk commented Dec 16, 2015

@r12a any ideas? It seems that for your tests at http://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en browsers mostly pass.

@r12a
Copy link
Collaborator

r12a commented Dec 16, 2015

i didn't test all the BMP for gb18030, just selected ranges, so i missed that particular character. However, there are now two new tests that do test it, and find it (and only that one in the PUA) to be problematic on Firefox and Chrome (didn't try any others).

http://www.w3.org/International/tests/repo/encoding/legacy-mb-schinese/gb18030/gb18030-encode-form-other-pua.html

http://www.w3.org/International/tests/repo/encoding/legacy-mb-schinese/gb18030/gb18030-decode-other-pua.html

i won't integrate these tests into the i18n test suite fully until we decide what needs to be done about this.

i don't have any clear idea about why this fails, although peter's suggestion seems plausible (note the that problem character is the first one in the PUA (that isn't in the index)).

(http://r12a.github.io/apps/encodings/ also shows the behaviour @peteroupc describes)

@vyv03354
Copy link
Collaborator

Also, if #22 does not change the mapping for 0x8135F437 from U+1E3F to U+E7C7, another code point would have the problem.

@annevk
Copy link
Member

annevk commented Jan 5, 2016

@r12a I'm confused by the outcome of your tests since it suggests browsers simply do not encode U+E5E5 at all despite gb18030 supposedly being a UTF (both Chrome and Firefox emit an "HTML entity").

@vyv03354
Copy link
Collaborator

vyv03354 commented Jan 5, 2016

For Firefox, this is intentional because some sites expected a space for gbk 0xA3A0.
https://bugzilla.mozilla.org/show_bug.cgi?id=131837
But this bug is archaic. I'm fine with changing back the mapping to U+E5E5 to align with GB18030 spec/IE/Edge.

@annevk
Copy link
Member

annevk commented Jan 5, 2016

Ah, that is the problem. "This matches the GB18030-2000 standard for code points encoded as two bytes, except for 0xA3 0xA0 which maps to U+3000 to be compatible with deployed content." Except we never took care of mapping U+E5E5 to an error in the encoder.

@peteroupc
Copy link
Contributor Author

I don't agree with emitting an encoder error for the code point 0xE5E5; like "vyv03354", I also agree with changing back the mapping for this code point, if that is indeed what the GB18030 standard says.

@annevk
Copy link
Member

annevk commented Jan 5, 2016

Since most browsers emit an error, it seems safer to just do that. Especially since WebKit "fixed" this in 2008, six years after Gecko did. Seems likely it might still be problematic.

annevk added a commit that referenced this issue Jan 6, 2016
Because of deployed content index gb18030 maps 0xA3 0xA0 to U+3000
rather than U+E5E5 when decoding. Therefore encoding U+E5E5 cannot work
either.
@annevk
Copy link
Member

annevk commented Jan 6, 2016

I created a PR for my proposal in #25. I would appreciate review before landing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants