Half-width Katakana should be representable in ISO-2022-JP #105

hsivonen · 2017-05-05T08:57:14Z

A query string-based (and, therefore, IE/Edge-incompatible) test shows that Gecko, WebKit, Blink and Presto can encode half-width Katakana as ISO-2022-JP without NCRs.

The spec should be amended to match (both encoder and decoder side).

annevk · 2017-05-05T10:07:16Z

It seems this is a special feature of the encoder only: "ﾐミ" both encode to 0x25 0x5F. I wonder if all Japanese encoders first convert halfwidth to fullwidth now.

vyv03354 · 2017-05-05T10:11:35Z

I wonder if all Japanese encoders first convert halfwidth to fullwidth now.

No, ISO-2022-JP only.

annevk · 2017-05-05T10:29:58Z

You'd think someone would have already written and published an algorithm for this conversion. I guess I'll just find the mapping for each code point myself.

annevk · 2017-05-05T10:42:43Z

Okay, so I guess what we want to do is to apply Unicode Normalization Form KC on any code point in the range U+FF65 to U+FF9F, inclusive.

annevk · 2017-05-05T10:52:56Z

const start = 0xFF61,
      end = 0xFF9F + 1;
for(let i = start; i < end; i++) {
  const cp = String.fromCodePoint(i),
        fullwidthCP = cp.normalize("NFKC");
  // ...
}

If I write those out and use @hsivonen's demo I get the results I was expecting per the above analysis.

Fixes #105.

aphillips · 2017-05-05T16:45:15Z

Correct me if I'm wrong, but wouldn't halfwidth katakana involve a switch to JIS X 0201 (Roman) mode? There's no need to destroy the round-trip by normalizing to fullwidth.

annevk · 2017-05-05T16:55:59Z

@aphillips that is not what implementations do.

aphillips · 2017-05-05T17:28:47Z

@annevk Yes, although that seems like a bug in the coders. I saw this thread this morning and was surprised, since I recall having to implement this when I was writing an ISO-2022 coder about 15 years ago. I can't imagine that the encoding's formal definition has changed, so I'm surprised to see implementations doing this.

annevk · 2017-05-05T17:53:54Z

Sure, but after such a long time bugs become features.

jungshik · 2017-05-06T22:10:25Z

@aphillips @annevk
I wouldn't call it a bug.

On the Internet/Web, only the original ISO-2022-JP defined in RFC 1468 was "widely" (relative to subsequent versions) used, but subsequent versions of ISO-2022-JP, ISO-2022-JP-[123] never got much traction. Why would anybody use ISO-2022-JP-* to encode Chinese, Korean, Latin beyond ASCII, and Greek? And, JIS X 0212 (supported in ISO-2022-JP-1 or later) is not critical enough to Japanese users (Shift_JIS does not support it, either).

The original ISO-2022-JP does not support Halfwidth Katakana. That's why ICU has a fallback encoding for Halfwidth Katakana for the original ISO-2022-JP.

It's only ISO-2022-JP-3 that supports Halfwidth Katakana. ICU supports ISO-2022-JP-3 as defined and does not have fallback encoding for Halfwidth Katakana in ISO-2022-JP-3.
Note that ISO-2022-JP-2 defined in RFC 1554 does not support them either.

annevk · 2017-05-07T04:43:17Z

Note that we do support decoding halfwidth Katakana: https://encoding.spec.whatwg.org/#iso-2022-jp-decoder-katakana. Should we remove that then?

aphillips · 2017-05-07T05:07:16Z

Why? If we see the byte sequence and it isn't invalid, why not decode it?

Note that this encoding is primarily used for email, not web pages.

annevk · 2017-05-07T05:37:34Z

Sorry, that suggestion was rather flippant and I should have looked at https://w3c-test.org/encoding/iso-2022-jp-decoder.html in various browsers first, which shows it's supported (though not sure about Edge).

It just shows that @jungshik's story above is not really complete as browsers support ISO-2022-JP-3's halfwidth Katakana extension on the decoder side (in what they call ISO-2022-JP).

To be 100% clear: suggestion retracted.

jungshik · 2017-05-07T16:24:13Z

Sorry for the confusion. It turned out that ICU's ISO-2022-JP converter (and other converters used in browsers) supports Halfwidth Katakana ("ESC ( I") in the spirit of 'be lenient in what you accept and be strict in what you emit'. For instance, it's explicitly commented in ICU's ucnv2022.cpp

 Note: The converter uses some leniency:
 - The escape sequence ESC ( I for half-width 7-bit Katakana is recognized in
    all versions, not just JIS7 and JIS8.
.....
static const uint16_t jpCharsetMasks[MAX_JA_VERSION+1]={
    CSM(ASCII)|CSM(JISX201)|CSM(JISX208)|CSM(HWKANA_7BIT),  <== ISO-2022-JP version 0 still has HWKANA_7BIT

…wg/encoding#105.

Fixes #105.

Fixes whatwg#105.

annevk added a commit that referenced this issue May 5, 2017

ISO-2022-JP encoder: convert halfwidth Katakana to fullwidth

9f92654

Fixes #105.

annevk mentioned this issue May 5, 2017

ISO-2022-JP encoder: convert halfwidth Katakana to fullwidth #106

Merged

annevk added a commit that referenced this issue May 5, 2017

ISO-2022-JP encoder: convert halfwidth Katakana to fullwidth

44decb1

Fixes #105.

annevk added the normative label May 5, 2017

hsivonen added a commit to hsivonen/encoding_rs that referenced this issue May 8, 2017

Map half-width katakana to full-width in ISO-2022-JP encoder per what…

03376fe

…wg/encoding#105.

annevk closed this as completed in #106 May 8, 2017

annevk added a commit that referenced this issue May 8, 2017

ISO-2022-JP encoder: convert halfwidth katakana to fullwidth

5a09856

Fixes #105.

ricea pushed a commit to ricea/encoding that referenced this issue Nov 16, 2017

ISO-2022-JP encoder: convert halfwidth katakana to fullwidth

d05d165

Fixes whatwg#105.

ricea pushed a commit to ricea/encoding that referenced this issue Nov 16, 2017

ISO-2022-JP encoder: convert halfwidth katakana to fullwidth

b095b8c

Fixes whatwg#105.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Half-width Katakana should be representable in ISO-2022-JP #105

Half-width Katakana should be representable in ISO-2022-JP #105

hsivonen commented May 5, 2017

annevk commented May 5, 2017

vyv03354 commented May 5, 2017

annevk commented May 5, 2017

annevk commented May 5, 2017 •

edited

Loading

annevk commented May 5, 2017 •

edited

Loading

aphillips commented May 5, 2017

annevk commented May 5, 2017

aphillips commented May 5, 2017

annevk commented May 5, 2017

jungshik commented May 6, 2017 •

edited

Loading

annevk commented May 7, 2017

aphillips commented May 7, 2017

annevk commented May 7, 2017

jungshik commented May 7, 2017

Half-width Katakana should be representable in ISO-2022-JP #105

Half-width Katakana should be representable in ISO-2022-JP #105

Comments

hsivonen commented May 5, 2017

annevk commented May 5, 2017

vyv03354 commented May 5, 2017

annevk commented May 5, 2017

annevk commented May 5, 2017 • edited Loading

annevk commented May 5, 2017 • edited Loading

aphillips commented May 5, 2017

annevk commented May 5, 2017

aphillips commented May 5, 2017

annevk commented May 5, 2017

jungshik commented May 6, 2017 • edited Loading

annevk commented May 7, 2017

aphillips commented May 7, 2017

annevk commented May 7, 2017

jungshik commented May 7, 2017

annevk commented May 5, 2017 •

edited

Loading

annevk commented May 5, 2017 •

edited

Loading

jungshik commented May 6, 2017 •

edited

Loading