-
-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Many incremental codecs don’t handle fragmented data #64331
Comments
Many of the incremental codecs do not handle fragmented data very well. In the past I think I was interested in using the Base-64 and Quoted-printable codecs, and playing with other codecs today reveals many more issues. A lot of the issues reflect missing functionality, so maybe the simplest solution would be to document the codecs that don’t work. Incremental decoding issues: >>> str().join(codecs.iterdecode(iter((b"\\", b"u2013")), "unicode-escape"))
UnicodeDecodeError: 'unicodeescape' codec can't decode byte 0x5c in position 0: \ at end of string
# Same deal for raw-unicode-escape.
>>> bytes().join(codecs.iterdecode(iter((b"3", b"3")), "hex-codec"))
binascii.Error: Odd-length string
>>> bytes().join(codecs.iterdecode(iter((b"A", b"A==")), "base64-codec"))
binascii.Error: Incorrect padding
>>> bytes().join(codecs.iterdecode(iter((b"=", b"3D")), "quopri-codec"))
b'3D' # Should return b"="
>>> codecs.getincrementaldecoder("uu-codec")().decode(b"begin ")
ValueError: Truncated input data Incremental encoding issues: >>> e = codecs.getincrementalencoder("base64-codec")(); codecs.decode(e.encode(b"1") + e.encode(b"2", final=True), "base64-codec")
b'1' # Should be b"12"
>>> e = codecs.getincrementalencoder("quopri-codec")(); e.encode(b"1" * 50) + e.encode(b"2" * 50, final=True)
b'1111111111111111111111111111111111111111111111111122222222222222222222222222222222222222222222222222'
# I suspect the line should have been split in two
>>> e = codecs.getincrementalencoder("uu-codec")(); codecs.decode(e.encode(b"1") + e.encode(b"2", final=True), "uu-codec")
b'1' # Should be b"12" I also noticed iterdecode() does not work with “uu-codec”: >>> bytes().join(codecs.iterdecode(iter((b"begin 666 <data>\n \nend\n",)), "uu-codec"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.3/codecs.py", line 1032, in iterdecode
output = decoder.decode(b"", True)
File "/usr/lib/python3.3/encodings/uu_codec.py", line 80, in decode
return uu_decode(input, self.errors)[0]
File "/usr/lib/python3.3/encodings/uu_codec.py", line 45, in uu_decode
raise ValueError('Missing "begin" line in input data')
ValueError: Missing "begin" line in input data And iterencode() does not work with any of the byte encoders, because it does not know what kind of empty string to pass to IncrementalEncoder.encode(final=True): >>> bytes().join(codecs.iterencode(iter(()), "base64-codec"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.3/codecs.py", line 1014, in iterencode
output = encoder.encode("", True)
File "/usr/lib/python3.3/encodings/base64_codec.py", line 31, in encode
return base64.encodebytes(input)
File "/usr/lib/python3.3/base64.py", line 343, in encodebytes
raise TypeError("expected bytes, not %s" % s.__class__.__name__)
TypeError: expected bytes, not str Finally, incremental UTF-7 encoding is suboptimal, and the decoder seems to buffer unlimited data, both defeating the purpose of using an incremental codec: >>> bytes().join(codecs.iterencode("\xA9" * 2, "utf-7"))
b'+AKk-+AKk-' # b"+AKkAqQ-" would be better
>>> d = codecs.getincrementaldecoder("utf-7")()
>>> d.decode(b"+")
b''
>>> any(d.decode(b"AAAAAAAA" * 100000) for _ in range(10))
False # No data returned: everything must be buffered
>>> d.decode(b"-") == "\x00" * 3000000
True # Returned all buffered data in one go |
The best solution IMHO would be to implement real incremental codecs for all of those. Maybe iterencode() with an empty iterator should never call encode()? (But IMHO it would be better to document that iterencode()/iterdecode() should only be used with "real" codecs.) Note that the comment before PyUnicode_DecodeUTF7Stateful() in unicodeobject.c reads: /* The decoder. The only state we preserve is our read position,
Changing that would have to introduce a state object that the codec updates and from which it can be restarted. Also the encoder does not buffer anything. To implement the suggested behaviour, the encoder might have to buffer unlimited data. |
I think calling iterencode() with an empty iterator is a side issue. Even with a non-empty iterator, it tries to encode an empty _text_ string to finalise the encoder: >>> bytes().join(codecs.iterencode(iter((b"data",)), "base64-codec"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.3/codecs.py", line 1014, in iterencode
output = encoder.encode("", True)
File "/usr/lib/python3.3/encodings/base64_codec.py", line 31, in encode
return base64.encodebytes(input)
File "/usr/lib/python3.3/base64.py", line 343, in encodebytes
raise TypeError("expected bytes, not %s" % s.__class__.__name__)
TypeError: expected bytes, not str Similarly, iterdecode(encoding="rot-13") doesn’t work. I agree it would be good to document that iterencode() is limited to text encoders and iterdecode() is limited to byte decoders. |
Stream reader interfaces suffer the same problem. Test cases: codecs.getreader("unicode-escape")(BytesIO(br"\u2013")).read(1) Even though the “zlib” incremental decoder is okay, its stream reader still suffers this problem. I was going to check for zip bomb behaviour when you call read() with a limited number of input “characters”, but got this instead: >>> codecs.getreader("zlib-codec")(bomb).read(1, 1)
zlib.error: Error -5 while decompressing data: incomplete or truncated stream Similar problems with stream writers, for instance: >>> w = codecs.getwriter("base64-codec")(BytesIO())
>>> w.write(b"1"); w.write(b"2"); w.reset(); w.getvalue()
b'MQ==\nMg==\n' # Should be b"MTI=\n" I also noticed StreamWriter.writelines() assumes it is encoding text, and does not work at least with “base64-codec”: >>>> codecs.getwriter("base64-codec")(BytesIO()).writelines((b"1", b"2"))
TypeError: sequence item 0: expected str instance, bytes found |
The “unicode-escape” and “utf-7” cases affect the more widely-used TextIOWrapper interface: >>> TextIOWrapper(BytesIO(br"\u2013" * 2000), "unicode-escape").read(1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/encodings/unicode_escape.py", line 26, in decode
return codecs.unicode_escape_decode(input, self.errors)[0]
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 8190-8191: truncated \uXXXX escape
>>> w = TextIOWrapper(BytesIO(), "utf-7")
>>> w.writelines("\xA9\xA9") # Write one character at a time
>>> w.detach().getvalue()
b'+AKk-+AKk-'
>>> r = TextIOWrapper(BytesIO(b"+" + b"AAAAAAAA" * 100000 + b"-"), "utf-7")
>>> r.read(1) # Short delay as all 800 kB are decoded to read one character
'\x00'
>>> r.buffer.tell()
800002 For UTF-7 decoding to work optimally I think the amount of data buffering necessary would be limited to only a few bytes. |
I opened bpo-23231 about fixing iterencode() and iterdecode() in the general case. I added a patch to bpo-13881 to fix StreamWriter for zlib and bz2, and to fix StreamWriter.writelines() in general. I am adding a patch here to clarify the StreamReader API and fix the StreamReader for the zlib-codec. Plan of other things to do: bz2 StreamReader: Should be implemented similar to the zlib patch above, after bpo-15955 is resolved and we have a max_length parameter to use. Or could be based on Bz2File now. hex decoder: Shouldn’t be too hard to hack a stateful IncrementalDecoder that saves the leftover digit if given an odd number of digits. Create a generic codecs._IncrementalStreamReader class that uses an IncrementalDecoder and buffers unread decoded data, similar to my _IncrementalStreamWriter for bpo-13881. base64 encoder: IncrementalEncoder could encode in base64.MAXBINSIZE chunks base64 decoder: IncrementalDecoder could strip non-alphabet characters using regular expressions, decode in multiples of four characters quopri encoder: would require new implementation or major refactoring of quopri module quopri decoder: check for incomplete trailing escape codes (=, =<single hex digit>, =\r) and newlines (\r) uu encoder: write header and trailer via uu module; encode using b2a_uu() uu decoder: factor out header parsing from uu module; buffer and decode line by line based on encoded length unicode-escape, raw-unicode-escape: Stateful decoding would probably require a new -Stateful() function at the C level, though it might be easy to build from the existing implementation. I suggest documenting that stateful decoding is not supported for the time being. utf-7: As Walter said, proper stateful codec is not supported by the C API, despite PyUnicode_DecodeUTF7Stateful(); doing so would probably require significant changes. I suggest documenting a warning for stateful mode (including with TextIOWrapper) about suboptimal encoding and unlimited data buffering for the time being. punycode, unicode_internal: According to test_codecs.py, these also don’t work in stateful mode. Not sure on the details though. |
On 15.01.2015 23:46, Martin Panter wrote:
Martin, I must be missing some context: what's your master plan with They are very hard to follow and you're making design changes which Regarding the patch: The doc patch seems to just change ordering of sentences and paragraphs. The "strict" error change is not correct. Most codecs raise a UnicodeError |
This addition is wrong as well: The *stream* argument must be a file-like object open for reading
- text or binary data, as appropriate for the specific codec.
+ text or binary data, as appropriate for the specific codec. This stream is
+ assumed to have ended as soon as a read from it returns no data. Where did you get this idea from ? |
My “master plan” is basically to make most of the bytes-to-bytes codecs work as documented in the incremental (stateful) modes. I’m less interested in fixing the text codecs, and the quopri and uu codecs might be too hard, so I was going to propose some documentation warnings for those. If you have a suggestion on how to go about this better, let me know. With my doc change to StreamReader, I wanted to document the different modes that I saw in the base codecs.StreamReader.read() implementation:
Previously the case of read(-1, chars) was ambiguous. Also I did not find the description “an approximate maximum number of decoded bytes” very helpful, considering more than this maximum was read if necessary to get enough *chars*. Regarding the end-of-stream behaviour, I made an assumption but I now realize it was wrong. Experimenting with the UTF-8 codec shows that its StreamReader.read() keeps returning "" when the underlying stream returns no data. But if it was in the middle of a multi-byte sequence, no “end of data” error is raised, and the multi-byte sequence can be completed if the underlying stream later returns more data. I think the lack of end-of-data checking should be documented. The different cases of ValueError and UnicodeError that you describe make sense. I think the various references to ValueError and UnicodeError should be updated (or replaced with pointers) to match. |
There is a flaw with inheriting the readline() method in my patch, and I have decided to give up fixing the StreamReader classes. I did update the documentation in my copy of the patch based on Marc-Andre Lemburg’s feedback if anyone is interested in it, otherwise I am going to focus on some of the incremental codecs now. |
Here is a new patch which fixes the bytes-to-bytes incremental codecs. It depends on my patches for these other issues being applied first:
In addition, without the fix for bpo-20121 (Quoted printable soft line breaking), the patch will apply, but there will be test failures. Summary of the changes in this patch:
This patch is actually generated from a series of patches folded together. If anyone is interested, I could try pushing the original series to a Bitbucket repository or something. |
Split off bpo-27799 for the base-64 incremental decoder |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: