gb18030 encoding/decoding support #57

r12a · 2016-06-20T14:42:03Z

Results for a series of tests for gb18030 encoding/decoding can be found at
https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#gb18030

The tests can be run from that page (select the link in the left-most column) or get the tests from the WPT repo. There is a PR at
web-platform-tests/wpt#3195

The test check whether:

the browser produces the expected byte sequences for all characters in the gb18030 index after 0x9F when encoding bytes for a URL produced by a form, using the encoder steps in the specification.
the browser produces the expected byte sequences for miscellaneous characters not in the gb18030 index when encoding bytes for a URL produced by a form, using the encoder steps in the specification. (tests for several ranges)
same two types of test when writing characters to an href value
the browser decodes all characters as expected from a file generated by encoding all pointers in the gb18030 index per the encoder steps in the specification.
the browser decodes all characters as expected from a file generated by encoding miscellaneous characters not in the gb18030 index per the encoder steps in the specification. (tests for several ranges)
when decoding gb18030 text, the browser uses replacement characters as described by the algorithm in the Encoding spec.

The following summarises the current situation according to my testing, for major desktop browsers. (I will be adding nightly results and perhaps other browsers in time.) The table lists the number of characters that were NOT successfully converted by the test.

Notes:

all href tests fail for Edge because characters are not converted to percent-escapes

Can we please investigate the failures to ascertain whether:

the browser needs to be changed
the spec needs to be changed
the test is at fault

The following tool may be helpful for investigating issues. It converts between byte sequences and characters for all encodings in the Encoding spec. http://r12a.github.io/apps/encodings/

r12a · 2016-06-23T18:43:25Z

I updated the table. Something went wrong with the scoring for Safari in the previous version. Should now reflect reality.

r12a · 2016-09-15T17:14:15Z

List of bugs raised:

inexorabletash · 2016-09-16T00:48:12Z

Note to self: Chrome shows the 3 failures for decode errors, but when the API (TextDecoder) is used it produces the the correct result. I believe this is because Chrome doesn't "flush" at the end of resource (i.e. non-API) streams... might be the same in other browsers.

jungshik · 2016-09-16T06:24:41Z

FYI, Chromium may soon change the decoding table to map 28 byte sequences that used to be mapped to PUA code points (completely useless and even harmful on platforms where there's no font to cover those PUA code points) to regular Unicode characters (see #22, #27 and http://crbug.com/645783 ).

r12a · 2016-11-18T17:23:00Z

wrt Safari support for GB 18030 (and probably other encodings), there's a discussion at https://bugs.webkit.org/show_bug.cgi?id=159891 that people following the Encoding issues may be able to help with.

inexorabletash · 2016-11-18T17:34:49Z

I'm loathe to jump in on that bug, but ISTM the answer for WebKit should be "normalize on input". Once it's in the DOM, normalization should not happen, as @r12a points out.

annevk · 2016-11-18T18:18:20Z

I added a comment to that effect.

hsivonen · 2017-04-27T15:22:22Z

The tests seem to disargee with the spec on the handling of ASCII bytes as part of a malformed sequence when decoding:
Fail step 2: 82 30 C3 assert_equals: expected "�" but got "�0�"
Fail step 5.7: 82 FF C3 33 assert_equals: expected "��" but got "��3"
Fail step 9: FF 30 C3 33 assert_equals: expected "�0�" but got "�0�3"

r12a · 2017-06-15T07:42:43Z

@hsivonen just so you know, i'm still intending to check the above and change the tests (and results) where needed, when i get a moment. Same goes for similar comments elsewhere. I've had even less time than normal lately because of various distractions.

hsivonen · 2017-06-15T09:48:39Z

i'm still intending to check the above and change the tests (and results) where needed, when i get a moment. Same goes for similar comments elsewhere.

Great. Thank you.

domenic · 2017-06-15T22:01:00Z

So, I've been helping rebase @r12a's pull requests, fix lint errors, and address some review comments from web-platform-tests/wpt#3194 that apply to all of them. I am happy to continue doing that as I have it down to a pretty fast process. Which means if @r12a can just edit his remaining WPT PRs with the normative changes and then ping the appropriate thread, I am happy to carry things through to the finish line. Woohoo!

r12a · 2017-08-15T17:56:20Z

@hsivonen wrt #57 (comment), i have stepped through the conversion for the first two tests you mentioned several times, using the debugger alongside the spec text, and i still come up with the results expected by the test, rather than the results i get from Firefox (nightly). Are you able to point out for me why the test produces a different result from FF?

Here's a link to the test: https://www.w3.org/International/tests/repo/encoding/legacy-mb-schinese/gb18030/gb18030-decode-errors.html Thanks.

annevk · 2017-08-16T07:23:35Z

I looked at the step 2 test. Below the iterations and the results:

gb18030 first = 0x82
gb18030 first = 0x82, gb18030 second = 0x30
gb18030 first = 0x82, gb18030 second = 0x30, gb18030 third = 0xC3
Conditional in step 2 is true
Conditional in step 1 is true

So @r12a is correct.

r12a · 2017-10-04T12:39:26Z

@hsivonen ping wrt #57 (comment)

hsivonen · 2017-11-06T10:58:16Z

Sorry about the delay.

Firefox, Chrome and Safari agree with each other on the 3 remaining failures. Edge is closer to the other browses than to the spec.

So I think this is a spec bug. (And the tests reflect the spec.)

hsivonen · 2017-11-06T11:28:17Z

Hmm. The Firefox situation might be confused somewhere between the Chrome/Safari behavior and the spec behavior. I need to investigate this more.

hsivonen · 2017-11-07T07:22:57Z

OK. Here's what's happening:

Firefox implements the spec, but the test case doesn't test the spec. The test case expectations are written as if there was EOF after each examined sequence. However, the test input is not TextDecoder calls with just the sequences tested (each ending in EOF). Instead, it's an HTML file containing all the sequences within spans, so what comes after each test sequence isn't an EOF but a less-than sign (of the span end tag).

I will need to test what Chrome, Safari and Edge do when the sequences actually end in EOF, but my tentative opinion is that it's bad for the spec collapse a bogus sequence of bytes to a different output when the bogus sequence is followed by EOF vs. when it's followed by something else that's not a valid continuation of the sequence.

hsivonen · 2017-11-07T08:00:27Z

I wrote demos that exercise both the followed-by-end-tag case and the followed-by-EOF case.

The spec, Firefox, Chrome and Safari agree on these, so I think it's the best not to change the spec even though it is rather unfortunate for the treatment of the bogus byte sequence to differs depending on what comes after.

In conclusion, this is a test case bug after all.

annevk · 2018-04-25T09:03:48Z

@r12a are you planning on updating the tests?

peteroupc · 2018-07-08T15:46:51Z

Some time ago, the Encoding Standard started mapping the two bytes "0xA3 0xA0" to "U+3000" rather than U+E5E5 "to be compatible with deployed content".

Do the benefits of this mapping still outweigh the disadvantages even today? The answer depends largely on—

how commonly GB18030 (as opposed to other encodings) is used in Web sites today compared to when the mapping was made,
whether, today compared to when the mapping was made, the GB18030 bytes "0xA3 0xA0" are still commonly used instead of the proper mapping for U+3000 to represent a space character in Web sites,
how frequently Web sites that use "0xA3 0xA0" instead of the proper mapping for U+3000 are seen today compared to when the mapping was made,
whether GB18030 encoders/decoders that convert "0xA3 0xA0" to or from U+3000 (besides implementations of the Encoding Standard) are still widely deployed today,
the effort required for Encoding Standard implementations to update their GB18030 mapping tables,
the need and desirability to comply strictly with GB18030-2005, and
other factors weighing for or against this mapping.

inexorabletash · 2018-07-10T20:26:20Z

Given that Firefox/Chrome/Safari have the same behavior, we'd need data indicating that changing implementations strongly improves compatibility with deployed content.

This is a variation of your 2nd point. If the number of sites using "0xA3 0xA0" intending U+E5E5 has significantly increased, it would be a consideration.

I don't think the other points would be directly relevant to implementors making a decision.

annevk · 2018-10-17T07:48:26Z

Same problem as with the gbk tests, upstreaming never completed: web-platform-tests/wpt#20361.

annevk added the tests label Nov 16, 2016

hsivonen mentioned this issue Nov 6, 2017

gb18030 encoding and decoding tests web-platform-tests/wpt#3195

Closed

annevk mentioned this issue Dec 21, 2018

Differences between tests and specification #169

Closed

KatyMawhood mentioned this issue Mar 12, 2019

[encoding] Webkit support for GB18030 w3c/i18n-activity#238

Closed

stevecheckoway mentioned this issue Feb 24, 2021

End-of-queue during decoding of GB18030 should not mask ASCII characters. #253

Closed

annevk mentioned this issue Aug 10, 2021

[GB18030] Wrong codepoint at index 7533 #271

Closed

annevk mentioned this issue Oct 27, 2022

If gb18030 is revised, consider aligning the Encoding Standard #27

Open

domenic mentioned this issue Jun 7, 2023

Reflect changes in GB 18030-2022 #312

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gb18030 encoding/decoding support #57

gb18030 encoding/decoding support #57

r12a commented Jun 20, 2016 •

edited

Loading

r12a commented Jun 23, 2016

r12a commented Sep 15, 2016

inexorabletash commented Sep 16, 2016

jungshik commented Sep 16, 2016

r12a commented Nov 18, 2016

inexorabletash commented Nov 18, 2016

annevk commented Nov 18, 2016

hsivonen commented Apr 27, 2017

r12a commented Jun 15, 2017

hsivonen commented Jun 15, 2017

domenic commented Jun 15, 2017

r12a commented Aug 15, 2017

annevk commented Aug 16, 2017

r12a commented Oct 4, 2017

hsivonen commented Nov 6, 2017

hsivonen commented Nov 6, 2017

hsivonen commented Nov 7, 2017 •

edited

Loading

hsivonen commented Nov 7, 2017

annevk commented Apr 25, 2018

peteroupc commented Jul 8, 2018 •

edited

Loading

inexorabletash commented Jul 10, 2018

annevk commented Oct 17, 2018 •

edited

Loading

gb18030 encoding/decoding support #57

gb18030 encoding/decoding support #57

Comments

r12a commented Jun 20, 2016 • edited Loading

r12a commented Jun 23, 2016

r12a commented Sep 15, 2016

inexorabletash commented Sep 16, 2016

jungshik commented Sep 16, 2016

r12a commented Nov 18, 2016

inexorabletash commented Nov 18, 2016

annevk commented Nov 18, 2016

hsivonen commented Apr 27, 2017

r12a commented Jun 15, 2017

hsivonen commented Jun 15, 2017

domenic commented Jun 15, 2017

r12a commented Aug 15, 2017

annevk commented Aug 16, 2017

r12a commented Oct 4, 2017

hsivonen commented Nov 6, 2017

hsivonen commented Nov 6, 2017

hsivonen commented Nov 7, 2017 • edited Loading

hsivonen commented Nov 7, 2017

annevk commented Apr 25, 2018

peteroupc commented Jul 8, 2018 • edited Loading

inexorabletash commented Jul 10, 2018

annevk commented Oct 17, 2018 • edited Loading

r12a commented Jun 20, 2016 •

edited

Loading

hsivonen commented Nov 7, 2017 •

edited

Loading

peteroupc commented Jul 8, 2018 •

edited

Loading

annevk commented Oct 17, 2018 •

edited

Loading