Corner cases arising from Big5 encoder not excluding HKSCS codes with lead bytes 0xFA–FE

https://encoding.spec.whatwg.org/commit-snapshots/4d54adce6a871cb03af3a919cbf644a43c22301a/#visualization

> Let index be index Big5 excluding all entries whose pointer is less than \(0xA1 \- 0x81\) × 157\.
> 
> Avoid returning Hong Kong Supplementary Character Set extensions literally\. 

As become apparent in my attempts to [chart different Big5 and CNS 11643 variants](https://harjit.moe/cns-conc.html): if the intention is to make the encoder purely [Big5-ETEN](https://moztw.org/docs/big5/table/eten.txt), excluding all further extensions that Big5-HKSCS adds on top of it, then lead bytes 0xFA–FE need to be excluded, not just 0x81–A0.

The only-partial exclusion of HKSCS in the encoder defined by the current standard actually creates some truly bizarre corner cases, insofar as how it interacts with index-big5's inclusion of the duplicate mappings inherited from GCCS (which a lot of even HKSCS-equipped Big5 codecs, e.g. Python's `big5-hkscs`, do not accept).&ensp; Some of these duplicated other GCCS/HKSCS codes, rather than standard Big5 codes.&ensp; In four cases, one of these GCCS duplicates has a lead byte in 0xFA–FE, while its standard HKSCS code has a lead byte in 0x81–A0.&ensp; Hence, the WHATWG-described behaviour finishes up decoding them from both, but encoding them to their GCCS duplicates as follows.

```
0x9DEF → 嘅 U+5605 ↔ 0xFB48
0x9DFB → 廐 U+5ED0 ↔ 0xFBF9
0xA0DC → 悤 U+60A4 ↔ 0xFC6C
0x9975 → 猪 U+732A ↔ 0xFE52
```

Accepting these GCCS duplicates is probably fine, but generating them (when not even all HKSCS-equipped implementations will accept them) is probably inappropriate, even assuming (for sake of argument) that the encoder's current partway-house between Big5-ETEN and Big5-HKSCS was deliberately chosen.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Corner cases arising from Big5 encoder not excluding HKSCS codes with lead bytes 0xFA–FE #252

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Corner cases arising from Big5 encoder not excluding HKSCS codes with lead bytes 0xFA–FE #252

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions