The current code forgets to check 0xFE for the 1st and 3rd byte.
Therefore, there are 8630 illegal 4-byte sequences can be decoded by GB18030 codec, here is an example:
# legal sequence b'\x81\x31\x81\x30' is decoded to U+060A, it's fine.uchar=b'\x81\x31\x81\x30'.decode('gb18030')
# illegal sequence 0x8130FF30 can be decoded to U+060A as well, this should not happen.uchar=b'\x81\x30\xFF\x30' .decode('gb18030')
I suppose the English edition is not the final release of GB18030-2000.
At the end of official Chinese edition of GB18030-2005, listed the difference between GB18030-2000 and GB18030-2005 clearly, it doesn't mention 0x80 (€), so GB18030-2000 should not has 0x80 as well.
Why 0x80 (€) appear in English edition?
I searched on Google, this topic said 0x80 appears in *draft* of GB18030-2000. http://www.pkucn.com/thread-304395-1-1.html
So maybe the English edition is a translation of GB18030-2000 draft, this logic seems ok.
Anyway, 0x80 is another story, not conflict with this issue.
Zhang, do I need to make PR for 3.6/3.5/2.7 respectively?
This is a very trivial bug, it's hard to imagine a scene that someone trying to decode those 8630 illegal 4-byte sequences with GB18030 decoder.
And I think this bug can't lead to security vulnerabilities.
As far as I can see, GB2312/GBK/GB18030 codecs are bugfree except this bug, of course maybe I'm wrong.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
The text was updated successfully, but these errors were encountered: