-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong behavior for '\xff\n'.decode('gb2312', 'ignore') #56225
Comments
let s='\xff\n' |
u'' in 2.7.1 also, on winxp |
So the correct result for b'\xff\n'.decode('gb2312', 'replace') is u'?\n'? |
I think it should be so. This behavior does not leave out possible information, has no side-effect on later decodings, and should the '\n' indeed be redundant, an output of u'?\n' would unlikely cause confusions. Though, I have no knowledge on this subject code-wise. If a change of the behavior will have an impact on performance, maybe the change should not come in. |
_codecs_cn implements different multibyte encodings: gb2312, gbkext, gbcommon, gb18030ext, gbk, gb18030. And there are other Asian multibyte encodings: big5 family, ISO 2202 family, JIS family, korean encodings (KSX1001, EUC_KR, CP949, ...), Big5, CP950, ... All of them ignore the all bytes if one byte of a multibyte sequence is invalid (lile 0xFF 0x0A: replaced by ? instead of ?\n using replace error handler). I don't think that you can/should patch only one encoding: we should use the same rule for all encodings. By the way, do you have any document explaining which result is the good one (? or ?\n)? For UTF-8, we have well defined standards explaining exactly what to do with invalid byte sequences => see issue bpo-8271. It is easy to fix the decoders, but I would like to be sure that your proposed change is the right way to decode these encodings. Change the multibyte encodings can also concern the security. Read for example the following section "Check byte strings before decoding them to character strings" of my book: |
I do not have documents on this subject. Though, I found that GNU iconv(1) behaves the same as my proposed behavior. My reading of the source code suggests that iconv(1) treat all encodings equally, which I think should also be true for python. As of security concerns, I do not think the change in decoding function itself would introduce any security vulnerabilities. If a security issue arises because of the proposed change, there must be improper code out side of python, which is out of python's control. That said, the proposed change is unlikely to introduce new security vulnerability, as all it does in effect is retaining a few ascii characters in the string to the output as opposed to removing. In the issue of wordpress, if we suppose that wordpress was written in python, and that the attacker was using gb2312 encoded strings instead of gbk, then my proposed change would by chance fix the issue, as the backslash would be retained when we decode the string. |
I asked if the change is correct on iconv mail list. Here is a copy of an answer. De: Bruno Haible Hi,
For UTF-8 the recommended way of handling malformed input is written down In particular, normally, if the first byte that is considered "wrong" For an example implementation, see
It is better to replace only the 0xFF byte with U+FFFD, because 0x61 is a
Generally, yes.
For stateful encodings of the ISO 2202 family, you may want to ignore/replace BrunoIn memoriam Siegfried Rädel <http://en.wikipedia.org/wiki/Siegfried_Rädel\> |
Oh, the HZ codec has no test! And what is this horrible BLOB, Lib/test/cjkencodings_test.py? |
I can now work confidently on this issue. I will try to patch all CJK decoders to only replace 1 invalid byte by U+FFFD (and not 2, 3 or 4 bytes) and try to write a test for each case (each byte sequence generating a different error). |
New changeset 3610841f7357 by Victor Stinner in branch '3.2': New changeset aa07c1237f4e by Victor Stinner in branch 'default': New changeset 685351d65592 by Victor Stinner in branch '2.7': |
New changeset 8572bf1b56ec by Victor Stinner in branch '3.2': New changeset c3dc94d53ef8 by Victor Stinner in branch 'default': New changeset 53912b58eee6 by Victor Stinner in branch '2.7': |
cjk_decode.patch:
Because I consider this issue as a bug, I would like to apply this patch to 2.7, 3.2 and 3.3. |
New changeset 16cbd84de848 by Victor Stinner in branch 'default': |
It is maybe a bug but it is also an important change on Python behaviour, so finally I prefer to only change (fix) Python 3.3. Thanks for reporting the bug zy (cdqzzy). Tell me if it now behaves as you expected. I'm closing this issue because the initial issue is now fixed. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: