Skip to content

US7ASCII database: charset conversion corrupts multi-byte pass-through data #2

@rophy

Description

@rophy

Version: dabeb4b1 (master)

Description

Legacy Oracle databases using US7ASCII often store multi-byte characters (BIG5, GB2312) as raw bytes — a common practice known as "pass-through". OLR reads nls-character-set: US7ASCII from the schema and applies charset conversion, stripping the high bit from every byte >= 0x80, which destroys the original data.

A config option to skip charset conversion and emit raw bytes as-is would solve this.

Steps to reproduce

  1. Oracle XE 21c with NLS_CHARACTERSET = US7ASCII
  2. Insert Big5-encoded Chinese characters as raw bytes:
CREATE TABLE TEST_MULTIBYTE (id NUMBER PRIMARY KEY, name VARCHAR2(200));
ALTER TABLE TEST_MULTIBYTE ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;
-- Big5: 台北 = A578 A55F
INSERT INTO TEST_MULTIBYTE VALUES (1, UTL_RAW.CAST_TO_VARCHAR2(HEXTORAW('A578A55F')));
COMMIT;
  1. Capture redo logs and run OLR in batch mode

Expected result

Raw bytes preserved in JSON output: "NAME" contains bytes A5 78 A5 5F.

Actual result

OLR strips the high bit (& 0x7F) from every byte >= 0x80:

Big5 input: A5 78 A5 5F  (台北)
OLR output: 25 78 25 5F  (%x%_)
{"after":{"ID":1,"NAME":"%x%_"}}

The original Big5 data is unrecoverable from OLR's output.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions