New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault in UTF-7 incremental decoder #64737
Comments
UTF-7 incremental decoder can crash in debug build when decodes unfinished base-64 section. In non-debug build it just produces inconsistent unicode string. Minimal examples: $ ./python -c "import codecs; codecs.utf_7_decode(b'a+AIA', 'strict')"
python: Objects/unicodeobject.c:403: _PyUnicode_CheckConsistency: Assertion `maxchar >= 128' failed.
Aborted (core dumped)
$ ./python -c "import codecs; codecs.utf_7_decode(b'+AIA-+AQA', 'strict')"
python: Objects/unicodeobject.c:410: _PyUnicode_CheckConsistency: Assertion `maxchar >= 0x100' failed.
Aborted (core dumped)
$ ./python -c "import codecs; codecs.utf_7_decode(b'+AQA-+2ADcAA', 'strict')"
python: Objects/unicodeobject.c:414: _PyUnicode_CheckConsistency: Assertion `maxchar >= 0x10000' failed.
Aborted (core dumped) This happens because _PyUnicodeWriter reverts position back before unfinished base-64 section, but its buffer was already widened for characters in unfinished base-64 section. if (inShift) {
writer.pos = shiftOutStart; /* back off output */
*consumed = startinpos;
} And now _PyUnicodeWriter generates a string with a kind larger then needed for decoded characters. This bug causes a lot of crashes on buildbots. E.g: |
Note that I added a skip for test_readline in bpo-20542 before realising this bug had already been filed. |
Here are patches for 3.3 and 3.4 (this is 3.3+ only bug). |
Patches look good to me. |
Maybe you can a new truncate operation to unicode writer? As you want. The patch looks good to me. |
New changeset 8d40d9cee409 by Serhiy Storchaka in branch '3.3': New changeset e988661e458c by Serhiy Storchaka in branch 'default': |
Thanks Nick and Victor for your reviews. As far as there is only one place where truncating unicode writer is needed, I don't think this is worth special function. |
This checkin appears to be causing a regression in the Windows buildbots. http://buildbot.python.org/all/builders/AMD64%20Windows7%20SP1%203.x/builds/4040 test_streamreaderwriter (test.test_codecs.WithStmtTest) ... test test_codecs failed ====================================================================== Traceback (most recent call last):
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\test_codecs.py", line 157, in test_readline
self.assertEqual(readalllines("".join(vw), True), "|".join(vw))
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\test_codecs.py", line 136, in readalllines
line = reader.readline(size=size, keepends=keepends)
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\codecs.py", line 548, in readline
data = self.read(readsize, firstline=True)
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\codecs.py", line 494, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'CP_UTF8' codec can't decode bytes in position 0--1: No mapping for the Unicode character exists in the target code page. Ran 206 tests in 5.912s |
And to be clear: I'm currently waiting on this before tagging 3.4rc1. If someone who understands the issue could fix this soon, I would appreciate it. |
Marking as closed and opening a new issue as per Serhiy's suggestion. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: