Description
Let index be index Big5 excluding all entries whose pointer is less than (0xA1 - 0x81) × 157.
Avoid returning Hong Kong Supplementary Character Set extensions literally.
As become apparent in my attempts to chart different Big5 and CNS 11643 variants: if the intention is to make the encoder purely Big5-ETEN, excluding all further extensions that Big5-HKSCS adds on top of it, then lead bytes 0xFA–FE need to be excluded, not just 0x81–A0.
The only-partial exclusion of HKSCS in the encoder defined by the current standard actually creates some truly bizarre corner cases, insofar as how it interacts with index-big5's inclusion of the duplicate mappings inherited from GCCS (which a lot of even HKSCS-equipped Big5 codecs, e.g. Python's big5-hkscs
, do not accept). Some of these duplicated other GCCS/HKSCS codes, rather than standard Big5 codes. In four cases, one of these GCCS duplicates has a lead byte in 0xFA–FE, while its standard HKSCS code has a lead byte in 0x81–A0. Hence, the WHATWG-described behaviour finishes up decoding them from both, but encoding them to their GCCS duplicates as follows.
0x9DEF → 嘅 U+5605 ↔ 0xFB48
0x9DFB → 廐 U+5ED0 ↔ 0xFBF9
0xA0DC → 悤 U+60A4 ↔ 0xFC6C
0x9975 → 猪 U+732A ↔ 0xFE52
Accepting these GCCS duplicates is probably fine, but generating them (when not even all HKSCS-equipped implementations will accept them) is probably inappropriate, even assuming (for sake of argument) that the encoder's current partway-house between Big5-ETEN and Big5-HKSCS was deliberately chosen.