-
-
Notifications
You must be signed in to change notification settings - Fork 31.6k
Range checking in GB18030 decoder #74176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This issue is split from bpo-24117, that issue became a soup of small issues, so I'm going to close it. For 4-byte GB18030 sequence, the legal range is: The current code forgets to check 0xFE for the 1st and 3rd byte. # legal sequence b'\x81\x31\x81\x30' is decoded to U+060A, it's fine.
uchar = b'\x81\x31\x81\x30'.decode('gb18030')
print(hex(ord(uchar)))
# illegal sequence 0x8130FF30 can be decoded to U+060A as well, this should not happen.
uchar = b'\x81\x30\xFF\x30' .decode('gb18030')
print(hex(ord(uchar))) |
The table in wikipedia is somewhat complex. I find ftp://ftp.software.ibm.com/software/globalization/documents/gb18030m.pdf and the table in it is same as https://pan.baidu.com/share/link?shareid=2606985291&uk=3341026630 (except 0x80) but in English. I agree with Ma Lin bytes sequences like b'\x81\x30\xFF\x30' are invalid. For current implementation, you could see: >>> invalid = b'\x81\x30\xff\x30'
>>> invalid.decode('gb18030').encode('gb18030') == invalid
False |
I suppose the English edition is not the final release of GB18030-2000. At the end of official Chinese edition of GB18030-2005, listed the difference between GB18030-2000 and GB18030-2005 clearly, it doesn't mention 0x80 (€), so GB18030-2000 should not has 0x80 as well. Why 0x80 (€) appear in English edition? Anyway, 0x80 is another story, not conflict with this issue. |
Yes, 0x80 doesn't matter here. It's nice to make the backporting PRs. But let's wait some time for ezio and haypo's comments and reviews. Get the master PR merged first and then continue on backporting. :-) |
An incorrect implementation of a decoder might lead to security vulnerabilities: *But* UTF-8 decoder of Python 2 is *not* strict and nobody complained. I suggest that, once the changed is merged in master, backport the fix to 3.6 and 3.5. But I'm not sure that it's worth it to backport it to 2.7? Is there a risk to break an application? |
This is a very trivial bug, it's hard to imagine a scene that someone trying to decode those 8630 illegal 4-byte sequences with GB18030 decoder. As far as I can see, GB2312/GBK/GB18030 codecs are bugfree except this bug, of course maybe I'm wrong. |
Thanks Ma Lin for finding the problem! Don't know why you close the PR but anyway, we solve it finally. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: