Refresh big5hkscs mapping to HKSCS-2016 #93271

sorcio · 2022-05-26T20:13:44Z

While working on #84508 I noticed that the mapping for big5hkscs codec has not been updated in a while. The current version in CPython reflects the Big-5 mappings for HKSCS-2004.

Since then, there have been some updates:

HKSCS-2008 adds 68 code points to the Big-5 encoding scheme
HKSCS-2016 adds no code points to Big-5 (it's Unicode-only), but since new characters have been added to Unicode, the mapping can change
after 2016, at least one mapped code point has been changed in an amendment

I can update the script and generate the mapping using the latest data available on the CCLI website, since I was already looking into this.

If we care about refreshing big5hkscs at all, there are a couple questions about compatibility. In case mapping a Big-5 code X used to map to Unicode code point A (in HKSCS-2004), and is changed to map to B (in later versions):

should we: decode X to A, or to B?
should we: encode B to X, A to X, or both?

E.g. right now the Big-5 sequence 9D73 round-trips:

>>> x = bytes.fromhex('9D73')
>>> x.decode('big5hkscs') == '\u4ca4'
True
>>> '\u4ca4'.encode('big5hkscs') == x
True

If we followed the new HKSCS-2016 mapping with no compatibility provisions, this round-trip would instead go through the newly mapped character \u9fd0. This might be fine for some users, but it might break compatibility for others. So the questions are about what kind of compatibility we want to guarantee.

Related question which should not block this issue. For the web platform, WHATWG defines a Big5 encoding which includes HKSCS extensions, and already overlaps 99% with big5hkscs, but is incompatible in some cases. Since one of the users of the CPython CJK codecs is html5lib, this means that html5lib does not comply with the web platform specifications. Should CPython be concerned with this, since it already provides the codec and the mapping tables, and it could provide a web-compatible codec with just a few fixups? Or does this belong in third-party libraries?

The text was updated successfully, but these errors were encountered:

corona10 · 2022-05-28T03:50:28Z

What about following these processes?

Provide encode/decode('big5hkscs-2016') and encode/decode('big5hkscs-2004') at Python 3.12
Maintain encode/decode('big5hkscs') to use 'big5hkscs-2004' for Python 3.12 and change it to use 'big5hkscs-2016' from Python 3.13
Deprecate encode/decode('big5hkscs-2004') from Python 3.13 and remove it at Python 3.15

sorcio · 2022-05-28T11:36:14Z

Is the question for me or for a topic expert?

In the solution @corona10 suggested, I would still recommend to update from 2004 to 2008 because it only adds characters, so the compatibility is trivial. I would even suggest to make it happen in 3.11 beta.

Regarding HKSCS-2016, things are more complex. It formally doesn't specify a Big-5 mapping, and they don't publish files for Big-5 mappings like they did for versions up to 2008. So what I had in mind was more like "Big-5 based on HKSCS-2008 but fixed up to use modern characters", but that's exactly what introduces compatibility questions.

Moreover, this is a moving target. The published data for HKSCS is a JSON file which is continuously updated. There is no 2016 file¹, it gets refreshed any time there is an amendment. By the time Python 3.13 is out, it could be that either new amendments will have come up, or a new version of the standard is released. And with each change to HKSCS, Python would need to make even more decisions about compatibility with Big-5. It probably doesn't fit the usual deprecation cycle.

I don't know if the users of big5hkscs care about following the latest fixups closely, or rather they need a more stable target. If the issue never came up, it's possible that stability is more important. If we decide that stability comes first, we can re-target this issue to only address HKSCS-2008.

Personally I would care more about a decoder that is compatible with the notion of Big-5 in the web platform (I say "decoder" because the encoding is not fully specified), so I'm ok with dropping the idea of following HKSCS 2016-and-later.

Reformulating the proposal:

update big5hkscs to HKSCS-2008 (in 3.11 or 3.12)
if there is demand/interest, add a big5hkscs-latest (or better name) that is based on the 2008 Big-5 mapping, but follows the latest available JSON data, with less regard for backwards compatibility
if it's deemed to belong in CPython, add a big5web (or better name) codec for web platform compatibility

(The three points are a bit different in scope, so I would repurpose this issue for 1, and maybe open new ones for 2 and 3)

historical versions of the JSON file are available, but I cannot confirm if there is one that corresponds exactly to "HKSCS-2016 as specified in the document published in May 2017" ↩

NightFurySL2001 · 2024-02-24T07:57:13Z

I think either update big5hkscs to HKSCS-2008 or a new big5hkscs-latest/big5hkscs-2008 will be fine. This encoding shouldn't be used for HKSCS-2016 anyway and it's pretty much frozen at this point with the prevalent use of Unicode.

sorcio added the type-feature A feature request or enhancement label May 26, 2022

sorcio mentioned this issue May 26, 2022

gh-84508: tool to generate cjk traditional chinese mappings #93272

Merged

AA-Turner added the topic-unicode label May 27, 2022

corona10 self-assigned this May 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refresh big5hkscs mapping to HKSCS-2016 #93271

Refresh big5hkscs mapping to HKSCS-2016 #93271

sorcio commented May 26, 2022

corona10 commented May 28, 2022 •

edited

sorcio commented May 28, 2022

NightFurySL2001 commented Feb 24, 2024

Refresh big5hkscs mapping to HKSCS-2016 #93271

Refresh big5hkscs mapping to HKSCS-2016 #93271

Comments

sorcio commented May 26, 2022

corona10 commented May 28, 2022 • edited

sorcio commented May 28, 2022

Footnotes

NightFurySL2001 commented Feb 24, 2024

corona10 commented May 28, 2022 •

edited