Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Half-width Katakana should be representable in ISO-2022-JP #105

Closed
hsivonen opened this issue May 5, 2017 · 14 comments
Closed

Half-width Katakana should be representable in ISO-2022-JP #105

hsivonen opened this issue May 5, 2017 · 14 comments

Comments

@hsivonen
Copy link
Member

hsivonen commented May 5, 2017

A query string-based (and, therefore, IE/Edge-incompatible) test shows that Gecko, WebKit, Blink and Presto can encode half-width Katakana as ISO-2022-JP without NCRs.

The spec should be amended to match (both encoder and decoder side).

@annevk
Copy link
Member

annevk commented May 5, 2017

It seems this is a special feature of the encoder only: "ミミ" both encode to 0x25 0x5F. I wonder if all Japanese encoders first convert halfwidth to fullwidth now.

@vyv03354
Copy link
Collaborator

vyv03354 commented May 5, 2017

I wonder if all Japanese encoders first convert halfwidth to fullwidth now.

No, ISO-2022-JP only.

@annevk
Copy link
Member

annevk commented May 5, 2017

You'd think someone would have already written and published an algorithm for this conversion. I guess I'll just find the mapping for each code point myself.

@annevk
Copy link
Member

annevk commented May 5, 2017

Okay, so I guess what we want to do is to apply Unicode Normalization Form KC on any code point in the range U+FF65 to U+FF9F, inclusive.

@annevk
Copy link
Member

annevk commented May 5, 2017

const start = 0xFF61,
      end = 0xFF9F + 1;
for(let i = start; i < end; i++) {
  const cp = String.fromCodePoint(i),
        fullwidthCP = cp.normalize("NFKC");
  // ...
}

If I write those out and use @hsivonen's demo I get the results I was expecting per the above analysis.

@aphillips
Copy link
Contributor

Correct me if I'm wrong, but wouldn't halfwidth katakana involve a switch to JIS X 0201 (Roman) mode? There's no need to destroy the round-trip by normalizing to fullwidth.

@annevk
Copy link
Member

annevk commented May 5, 2017

@aphillips that is not what implementations do.

@aphillips
Copy link
Contributor

@annevk Yes, although that seems like a bug in the coders. I saw this thread this morning and was surprised, since I recall having to implement this when I was writing an ISO-2022 coder about 15 years ago. I can't imagine that the encoding's formal definition has changed, so I'm surprised to see implementations doing this.

@annevk
Copy link
Member

annevk commented May 5, 2017

Sure, but after such a long time bugs become features.

@jungshik
Copy link

jungshik commented May 6, 2017

@aphillips @annevk
I wouldn't call it a bug.

On the Internet/Web, only the original ISO-2022-JP defined in RFC 1468 was "widely" (relative to subsequent versions) used, but subsequent versions of ISO-2022-JP, ISO-2022-JP-[123] never got much traction. Why would anybody use ISO-2022-JP-* to encode Chinese, Korean, Latin beyond ASCII, and Greek? And, JIS X 0212 (supported in ISO-2022-JP-1 or later) is not critical enough to Japanese users (Shift_JIS does not support it, either).

The original ISO-2022-JP does not support Halfwidth Katakana. That's why ICU has a fallback encoding for Halfwidth Katakana for the original ISO-2022-JP.

It's only ISO-2022-JP-3 that supports Halfwidth Katakana. ICU supports ISO-2022-JP-3 as defined and does not have fallback encoding for Halfwidth Katakana in ISO-2022-JP-3.
Note that ISO-2022-JP-2 defined in RFC 1554 does not support them either.

@annevk
Copy link
Member

annevk commented May 7, 2017

Note that we do support decoding halfwidth Katakana: https://encoding.spec.whatwg.org/#iso-2022-jp-decoder-katakana. Should we remove that then?

@aphillips
Copy link
Contributor

Why? If we see the byte sequence and it isn't invalid, why not decode it?

Note that this encoding is primarily used for email, not web pages.

@annevk
Copy link
Member

annevk commented May 7, 2017

Sorry, that suggestion was rather flippant and I should have looked at https://w3c-test.org/encoding/iso-2022-jp-decoder.html in various browsers first, which shows it's supported (though not sure about Edge).

It just shows that @jungshik's story above is not really complete as browsers support ISO-2022-JP-3's halfwidth Katakana extension on the decoder side (in what they call ISO-2022-JP).

To be 100% clear: suggestion retracted.

@jungshik
Copy link

jungshik commented May 7, 2017

Sorry for the confusion. It turned out that ICU's ISO-2022-JP converter (and other converters used in browsers) supports Halfwidth Katakana ("ESC ( I") in the spirit of 'be lenient in what you accept and be strict in what you emit'. For instance, it's explicitly commented in ICU's ucnv2022.cpp

 Note: The converter uses some leniency:
 - The escape sequence ESC ( I for half-width 7-bit Katakana is recognized in
    all versions, not just JIS7 and JIS8.
.....
static const uint16_t jpCharsetMasks[MAX_JA_VERSION+1]={
    CSM(ASCII)|CSM(JISX201)|CSM(JISX208)|CSM(HWKANA_7BIT),  <== ISO-2022-JP version 0 still has HWKANA_7BIT

ricea pushed a commit to ricea/encoding that referenced this issue Nov 16, 2017
ricea pushed a commit to ricea/encoding that referenced this issue Nov 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants