Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EUC-jp encoding/decoding support #59

Closed
r12a opened this issue Jun 20, 2016 · 14 comments
Closed

EUC-jp encoding/decoding support #59

r12a opened this issue Jun 20, 2016 · 14 comments
Labels

Comments

@r12a
Copy link
Collaborator

r12a commented Jun 20, 2016

Results for a series of tests for EUC-jp encoding/decoding can be found at
https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#eucjp

The tests can be run from that page (select the link in the left-most column) or get the tests from the WPT repo. There is a PR at
web-platform-tests/wpt#3198

The tests check whether:

  1. the browser produces the expected byte sequences for all characters in the euc-jp encoding after 0x9F when encoding bytes for a URL produced by a form, using the encoder steps in the specification.
  2. the browser produces percent-escaped character references for a URL produced by a form when encoding miscellaneous characters that are not in the euc-jp encoding. (tests for several ranges)
  3. same two types of test when writing characters to an href value
  4. the browser decodes all characters as expected from a file generated by encoding all pointers in the euc-jp encoding per the encoder steps in the specification.
  5. the browser decodes characters that are not recognised from the euc-jp encoding as replacement characters.

The following summarises the current situation according to my testing, for major desktop browsers. (I will be adding nightly results and perhaps other browsers in time.) The table lists the number of characters that were NOT successfully converted by the test.

screen shot 2016-06-20 at 16 53 26

Notes:

  • Edge fails all href encode tests because characters are not converted to percent-escapes in the href attribute.
  • Firefox fails all href encode tests for characters not in the encoding because it converts characters to percent-escaped Unicode values instead.
  • eucjp-decode-index: Edge fails on all and only the JIS-X-0212 characters, because it doesn't recognise 0xAF as the first in a 3-byte sequence.

Can we please investigate the failures to ascertain whether:

  1. the browser needs to be changed
  2. the spec needs to be changed
  3. the test is at fault

The following tool may be helpful for investigating issues. It converts between byte sequences and characters for all encodings in the Encoding spec. http://r12a.github.io/apps/encodings/

@jungshik
Copy link

jungshik commented Sep 16, 2016

As is the case of EUC-KR (form; misc; #62), Chromium's failure in 'form (misc)' in EUC-JP encoding is likely to be caused NOT by Chromium's encoder BUT by Blink's handling of Cf / Default_Ignorable characters.

@hsivonen
Copy link
Member

One test showed that the code I had written differed from the spec. However, it seems the test expectation differs from the spec, too:
trail byte outside 0xA1-0xFE: B0 A0 assert_equals: expected "� " but got "�"

Curiously, the spec makes EUC-JP differ from the other two-byte encodings when it comes to the handling of a non-ASCII bogus trail byte. (0xA0 and 0xFF get prepended in addition to ASCII getting prepended.)

Why is EUC-JP different? Is it intentional?

@annevk
Copy link
Member

annevk commented Apr 27, 2017

More concretely, you're saying that

If byte is not in the range 0xA1 to 0xFE, inclusive, prepend byte to stream.

should be

If byte is an ASCII byte, prepend byte to stream.

which I think is reasonable and this was probably just an oversight.

Confirmation from @jungshik would be good, but I'm happy to fix this. If I were to fix this, should I make sure I land web-platform-tests at the same time or are you sorting out web-platform-tests @hsivonen?

@hsivonen
Copy link
Member

More concretely, you're saying that

If byte is not in the range 0xA1 to 0xFE, inclusive, prepend byte to stream.

should be

If byte is an ASCII byte, prepend byte to stream.

which I think is reasonable and this was probably just an oversight.

I'm saying that that would be consistent with the other two-byte decoders. I haven't investigated legacy decoder behavior on this point, so I can't at this time say whether it should be consistent that way.

If I were to fix this, should I make sure I land web-platform-tests at the same time or are you sorting out web-platform-tests @hsivonen?

I don't have a plan to be sorting out Web Platform Tests. However, I might have to for tests that have already been imported to mozilla-central. (It seems that this one is still in the PR stage on the WPT side.)

@hsivonen
Copy link
Member

Firefox (uconv) and Chrome think B0 A0 should result in one REPLACEMENT CHARACTER. Microsoft browsers say two. I'd appreciate it if someone else could test Safari.

@annevk
Copy link
Member

annevk commented Apr 28, 2017

Safari Technology Preview yields "X�X".

@annevk
Copy link
Member

annevk commented Apr 28, 2017

(Chrome and Safari don't seem to pick a different font by the way, but that's a different class of bugs.)

@hsivonen
Copy link
Member

OK, in that case, I think the EUC-JP decoder should prepend ASCII bytes only.

annevk added a commit that referenced this issue Apr 28, 2017
@annevk
Copy link
Member

annevk commented Apr 28, 2017

Prepared a PR.

annevk added a commit that referenced this issue May 5, 2017
annevk added a commit that referenced this issue May 5, 2017
@vyv03354
Copy link
Collaborator

Firefox Nightly 56 fixed the encoder, but one decoder test still fails.

@r12a
Copy link
Collaborator Author

r12a commented Jun 15, 2017

Today and yesterday i updated the results at https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#eucjp for Firefox, FNightly, Chrome, and Canary. The latest summary is:

screen shot 2017-06-15 at 08 37 50

@hsivonen
Copy link
Member

Firefox Nightly 56 fixed the encoder, but one decoder test still fails.

See discussion upthread. Firefox is correct per spec as amended in May.

ricea pushed a commit to ricea/encoding that referenced this issue Nov 16, 2017
ricea pushed a commit to ricea/encoding that referenced this issue Nov 16, 2017
@annevk
Copy link
Member

annevk commented Oct 17, 2018

Now that Firefox passes all these tests and a year has passed, I'm happy to consider this done. A new issue would also be less noisy at this point, were one warranted.

@annevk annevk closed this as completed Oct 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

5 participants