Fix/utf8 fh decoding#28
Conversation
Previously decodeFH() converted each hex pair to a character via
String.fromCharCode, mangling multi-byte UTF-8 sequences (e.g. _C3_A4
became 'ä' instead of 'ä'). Since the generator emits ^CI28, third-party
ZPL using ^FH for non-ASCII glyphs round-tripped through mojibake.
Collect contiguous escape pairs into a Uint8Array and decode the run via
TextDecoder('utf-8'); invalid byte sequences fall back to U+FFFD.
Replace inner matchAll + impossible-case fallback with a stride loop:
the outer regex already guarantees the run is a sequence of fixed-width
{delim}XX pairs, so byte offsets are computable directly. One regex
allocation per call, fixed Uint8Array, no defensive ?? fallback.
There was a problem hiding this comment.
Code Review
This pull request updates the decodeFH function to support multi-byte UTF-8 characters by grouping contiguous hex escape sequences and decoding them via TextDecoder, and adds corresponding test cases. However, feedback indicates that hardcoding the UTF-8 decoder introduces a regression for labels using single-byte encodings (such as CP1252), where valid characters may be incorrectly replaced with the Unicode replacement character. It is recommended to track the active encoding from the ^CI command to maintain compatibility across different ZPL configurations.
The ^FH decoder was hardcoded to UTF-8, which broke single-byte encodings (^CI27 / ^CI0..13) where bytes like 0xE4 (= ä in CP1252) are valid but invalid as standalone UTF-8 → U+FFFD. Track ^CI state in the parser, map known values to TextDecoder labels: - ^CI28 → utf-8 - ^CI27 → windows-1252 - ^CI0..^CI13 → windows-1252 (ASCII-compatible legacy variants) - others → keep current decoder, surface as partial import Decoders are cached to avoid per-field allocation. Default remains UTF-8 to preserve round-trip fidelity for this app's own generator output.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request enhances the ZPL parser's ^FH hex escape decoding by supporting multi-byte UTF-8 sequences and character set selection via the ^CI command. It introduces a TextDecoder cache and updates the decoding logic to process contiguous hex escape runs as byte sequences. Feedback was provided regarding the ^CI command handler, which currently fails to reset the decoder to UTF-8 when an unsupported encoding is encountered, contradicting the intended fallback behavior described in the documentation.
Previously an unknown ^CI N kept whatever decoder was active before, contradicting the comment that promised a UTF-8 fallback. If ^CI27 preceded the unknown command, CP1252 stayed silently active. Always rebind to ciToEncoding's label (which is 'utf-8' for the unsupported branch) so behaviour matches the documentation and is predictable regardless of prior state.
No description provided.