### 编码字符串到字节流

In [2]:
import typing

def PrintLine(label: str, content: str, labelWidth: int = 16, alignSymbol: str = "<"):
    formatString = f"{{label:{alignSymbol}{labelWidth}}} : {{content}}"
    ret = formatString.format(label=label, content=content)
    print(ret)

def PrintTextCode(s: str, codePages: typing.List[str]):
    PrintLine("String", s)
    for cp in codePages:
        try:
            PrintLine(f"Encoded by {cp}", s.encode(cp).hex())
        except UnicodeEncodeError as e:
            print(str(e))

TEST_CODE_PAGES = ['utf-8', 'gbk', 'cp936', 'cp932', 'cp1252']
s = "test 测试 テスト"

PrintTextCode(s, TEST_CODE_PAGES)

String           : test 测试 テスト
Encoded by utf-8 : 7465737420e6b58be8af9520e38386e382b9e38388
Encoded by gbk   : 7465737420b2e2cad420a5c6a5b9a5c8
Encoded by cp936 : 7465737420b2e2cad420a5c6a5b9a5c8
'cp932' codec can't encode character '\u6d4b' in position 5: illegal multibyte sequence
'charmap' codec can't encode characters in position 5-6: character maps to <undefined>


### '\u6d4b' 是什么字？

In [3]:
PrintTextCode('\u6d4b', ['utf-8'])

String           : 测
Encoded by utf-8 : e6b58b


### “测”不在日文汉字中吗？

In [4]:
PrintTextCode('測', TEST_CODE_PAGES)

String           : 測
Encoded by utf-8 : e6b8ac
Encoded by gbk   : 9c79
Encoded by cp936 : 9c79
Encoded by cp932 : 91aa
'charmap' codec can't encode character '\u6e2c' in position 0: character maps to <undefined>


### 应用一下修改

In [5]:
s2 = "test 測試 テスト"
PrintTextCode(s2, TEST_CODE_PAGES)

String           : test 測試 テスト
Encoded by utf-8 : 7465737420e6b8ace8a9a620e38386e382b9e38388
Encoded by gbk   : 74657374209c79d48720a5c6a5b9a5c8
Encoded by cp936 : 74657374209c79d48720a5c6a5b9a5c8
Encoded by cp932 : 746573742091aa8e8e20836583588367
'charmap' codec can't encode characters in position 5-6: character maps to <undefined>


### 乱码是怎么来的？

In [10]:
s3 = "test 測試 テスト 시험하는것"
def Mojibake(s: str, encCp: str, decCp: str):
    print("----------------------------------------------------------")
    PrintLine("String", s)
    bytes = s.encode(encCp, errors='replace')
    PrintLine(f"{encCp} Enc2Bytes", bytes.hex(":"))
    newS = bytes.decode(decCp, errors='replace')
    PrintLine(f"{decCp} Dec2Str", newS)
    return newS
    
# 乱码 == Mojibake == 文字化け （Character Transformation）
moji1 = Mojibake(s3, 'cp932', 'cp936')
moji2 = Mojibake(s3, 'cp936', 'cp932')

moji3 = Mojibake(moji1, 'cp936', 'cp932')
moji4 = Mojibake(moji2, 'cp932', 'cp936')

moji5 = Mojibake(s3, 'utf-8', 'cp1252')
moji6 = Mojibake(moji5, 'cp1252', 'utf-8')

moji7 = Mojibake("一隻憂鬱的台灣烏龜", 'cp950', 'cp936')

----------------------------------------------------------
String           : test 測試 テスト 시험하는것
cp932 Enc2Bytes  : 74:65:73:74:20:91:aa:8e:8e:20:83:65:83:58:83:67:20:3f:3f:3f:3f:3f
cp936 Dec2Str    : test 應帋 僥僗僩 ?????
----------------------------------------------------------
String           : test 測試 テスト 시험하는것
cp936 Enc2Bytes  : 74:65:73:74:20:9c:79:d4:87:20:a5:c6:a5:b9:a5:c8:20:3f:3f:3f:3f:3f
cp932 Dec2Str    : test 忱ﾔ� ･ﾆ･ｹ･ﾈ ?????
----------------------------------------------------------
String           : test 應帋 僥僗僩 ?????
cp936 Enc2Bytes  : 74:65:73:74:20:91:aa:8e:8e:20:83:65:83:58:83:67:20:3f:3f:3f:3f:3f
cp932 Dec2Str    : test 測試 テスト ?????
----------------------------------------------------------
String           : test 忱ﾔ� ･ﾆ･ｹ･ﾈ ?????
cp932 Enc2Bytes  : 74:65:73:74:20:9c:79:d4:3f:20:a5:c6:a5:b9:a5:c8:20:3f:3f:3f:3f:3f
cp936 Dec2Str    : test 測�? テスト ?????
----------------------------------------------------------
String           : test 測試 テスト 시험하는것
utf-8 Enc2Bytes  : 74:6

### “烫烫烫”, “屯屯屯” 与 "锟斤拷"

In [9]:
PrintTextCode("烫烫烫烫", TEST_CODE_PAGES)
PrintTextCode("屯屯屯屯", TEST_CODE_PAGES)
PrintTextCode("锟斤拷", TEST_CODE_PAGES)
PrintTextCode("锘", TEST_CODE_PAGES)

String           : 烫烫烫烫
Encoded by utf-8 : e783abe783abe783abe783ab
Encoded by gbk   : cccccccccccccccc
Encoded by cp936 : cccccccccccccccc
'cp932' codec can't encode character '\u70eb' in position 0: illegal multibyte sequence
'charmap' codec can't encode characters in position 0-3: character maps to <undefined>
String           : 屯屯屯屯
Encoded by utf-8 : e5b1afe5b1afe5b1afe5b1af
Encoded by gbk   : cdcdcdcdcdcdcdcd
Encoded by cp936 : cdcdcdcdcdcdcdcd
Encoded by cp932 : 93d493d493d493d4
'charmap' codec can't encode characters in position 0-3: character maps to <undefined>
String           : 锟斤拷
Encoded by utf-8 : e9949fe696a4e68bb7
Encoded by gbk   : efbfbdefbfbd
Encoded by cp936 : efbfbdefbfbd
'cp932' codec can't encode character '\u951f' in position 0: illegal multibyte sequence
'charmap' codec can't encode characters in position 0-2: character maps to <undefined>
String           : 锘
Encoded by utf-8 : e99498
Encoded by gbk   : efbb
Encoded by cp936 : efbb
'cp932' codec can't encode 

* `0xCC`: x86/64 asm `int 3` interruption, 未初始化的栈内存会被填充；
* `0xCD`: MS CRT debug下 `delete`/`free` 之后对内存的标记；
* `EF:BB` 是 UTF-8 BOM

### "锟斤拷" 是什么？

In [8]:
moji6 = Mojibake("锟斤拷", 'cp936', 'utf-8')

----------------------------------------------------------
String           : 锟斤拷
cp936 Enc2Bytes  : ef:bf:bd:ef:bf:bd
utf-8 Dec2Str    : ��


### 其他一些用于填充内存以方便诊断的代码
* `0xFD`
* `0xDD`
* `0xBAADF00D`
* `0xDEADBEEF`