-
-
Notifications
You must be signed in to change notification settings - Fork 31.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More informative error handling when encoding and decoding #62028
Comments
Passing the wrong types to codecs can currently lead to rather confusing exceptions, like: ==================== >>> b"ZXhhbXBsZQ==\n".decode("base64_codec")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python3.2/encodings/base64_codec.py", line 20, in base64_decode
return (base64.decodebytes(input), len(input))
File "/usr/lib64/python3.2/base64.py", line 359, in decodebytes
raise TypeError("expected bytes, not %s" % s.__class__.__name__)
TypeError: expected bytes, not memoryview
====================
>>> codecs.decode("example", "utf8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python3.2/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
TypeError: 'str' does not support the buffer interface ==================== This situation could be improved by having the affected APIs use the exception chaining system to wrap these errors in a more informative exception that also display information on the codec involved. Note that UnicodeEncodeError and UnicodeDecodeError are not appropriate, as those are specific to text encoding operations, while these new wrappers will apply to arbitrary codecs, regardless of whether or not they use the unicode error handlers. Furthermore, for backwards compatibility with existing exception handling, it is probably necessary to limit ourselves to specific exception types and ensure that the wrapper exceptions are subclasses of those types. These new wrappers would have __cause__ set to the exception raised by the codec, but emit a message more along the lines of the following: ============== Wrapping TypeError and ValueError should cover most cases, which would mean four new exception types in the codecs module: Raised by codecs.decode, bytes.decode and bytearray.decode:
Raised by codecs.encode, str.encode:
Instances of UnicodeError wouldn't be wrapped, since they already contain codec information. |
There may also be some specific improvement to be made to str.encode, bytes.decode and bytearray.decode in relation to the additional type checks they do to enforce the appropriate input and output types (see the bizarre "expected bytes, not memoryview" example) |
I tracked down the proximate cause of the weird exception in the bytes.decode case: the base64 module only accepts bytes and bytearray objects, instead of using memoryview to accept anything that supports the buffer API and provides a C-contiguous 8-bit view of the underlying data. Raised as bpo-17839. |
Here's an example of the specific type errors raised by additional checks in the text-encoding specific methods. I believe the main improvement needed here is to mention the encoding name in the exception message: "example".encode("rot_13")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: encoder did not return a bytes object (type=str)
b'BZh91AY&SY\xc1uvK\x00\x00\x01F\x80\x00\x10\x00"\x04\x00\x00\x10 \x000\xcd\x00\xc1\xa0P\xe2\xeeH\xa7\n\x12\x18.\xae\xc9`'.decode("bz2_codec")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: decoder did not return a str object (type=bytes) |
Ezio pointed out on IRC that the extra type checks in str.encode, bytes.decode and bytearray.decode should reference the appopriate codecs module function in addition to the codec in use. So if str.encode produces something other than bytes, it should reference codecs.encode, while the binary decoding methods should mention codecs.decode if they produce something other than str. |
The attached patch changes the error message of str.encode/bytes.decode when the codec returns the wrong type: >>> import codecs
>>> 'example'.encode('rot_13')
TypeError: encoder returned 'str' instead of 'bytes', use codecs.decode for str->str conversions
>>> codecs.encode('example', 'rot_13')
'rknzcyr'
>>>
>>> b'000102'.decode('hex_codec')
TypeError: decoder returned 'bytes' instead of 'str', use codecs.encode for bytes->bytes conversions
>>> codecs.decode(b'000102', 'hex_codec')
b'\x00\x01\x02'
This only solves part of the problem though, because individual codecs might raise other errors if the input type is wrong:
>>> 'example'.encode('hex_codec')
Traceback (most recent call last):
File "/home/wolf/dev/py/py3k/Lib/encodings/hex_codec.py", line 16, in hex_encode
return (binascii.b2a_hex(input), len(input))
TypeError: 'str' does not support the buffer interface |
To summarize:
The things that could go wrong are:
My patch only covers 3. The four new exceptions suggested by Nick in msg187704 would cover the first 2 cases. |
The attached proof of concept catches Type/ValueError in str.encode and raises another exception with a better message:
>>> 'example'.encode('hex_codec')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: invalid input type for hex_codec codec ('str' does not support the buffer interface) (note: the patch doesn't handle the exception chaining yet and probably leaks.) If Nick proposal in msg187704 is accepted, this should become a codecs.EncodeTypeError. The same should then be done for bytes.decode and for codecs.encode/decode. |
Updated patch. The results of this suggests to me that the input wrappers are likely infeasible at this point in time, but improving the errors for the wrong *output* type is entirely feasible. Since the main conversion we need to prompt is things like "binary_object.decode(binary_codec)" -> "codecs.decode(binary_object, binary_codec)", I suggest we limit the scope of this issue to that part of the problem. >>> import codecs
>>> codecs.encode(b"hello", "bz2_codec").decode("bz2_codec")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'bz2_codec' decoder returned 'bytes' instead of 'str'; use codecs.decode to decode to arbitrary types
>>> "hello".encode("bz2_codec")
TypeError: 'str' does not support the buffer interface
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: invalid input type for 'bz2_codec' codec (TypeError: 'str' does not support the buffer interface)
>>> "hello".encode("rot_13")
TypeError: 'rot_13' encoder returned 'str' instead of 'bytes'; use codecs.encode to encode to arbitrary types
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: invalid input type for 'rot_13' codec (TypeError: 'rot_13' encoder returned 'str' instead of 'bytes'; use codecs.encode to encode to arbitrary types) |
Ah, came up with a relatively simple solution based on an internal helper function with an optional output flag: >>> import codecs
>>> codecs.encode(b"hello", "bz2_codec").decode("bz2_codec")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'bz2_codec' decoder returned 'bytes' instead of 'str'; use codecs.decode to decode to arbitrary types
>>> "hello".encode("bz2_codec")
TypeError: 'str' does not support the buffer interface
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: invalid input type for 'bz2_codec' codec (TypeError: 'str' does not support the buffer interface)
>>> "hello".encode("rot_13")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'rot_13' encoder returned 'str' instead of 'bytes'; use codecs.encode to encode to arbitrary types |
The other thing is that this patch doesn't wrap AttributeError. I'm OK with that, since I believe the only codec in the standard library that currently throws that for a bad input type is rot_13. |
It would be simpler to just drop these custom codecs (rot13, bz2, hex, etc.) instead of helping to use them :-) |
On 04.11.2013 14:30, STINNER Victor wrote:
-1 for the same reasons I keep repeating over and over and over again :-) The codec system was designed to work obj->obj. Python 3 limits the types +1 on having better error messages. In the long run, we should add -- Professional Python Services directly from the Source (#1, Nov 04 2013)
>>> Python Projects, Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________
2013-11-19: Python Meeting Duesseldorf ... 15 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 |
I think I figured out a better way to structure this that avoids the need for the output flag and is more easily expanded to whitelist additional exception types as safe to wrap. I'll try to come up with a new patch tonight. |
New and improved implementation attached that extracts the exception chaining to a helper functions and calls it only when it is the call in to the codecs machinery that failed (eliminating the need for the output flag, and covering decoding as well as encoding). TypeError, AttributeError and ValueError are all wrapped with chained exceptions that mention the codec that failed. (Annoyingly, bz2_codec throws OSError instead of ValueError for bad input data, but wrapping OSError safely is a pain due to the extra state potentially carried on instances. So letting it escape unwrapped is the simpler and more conservative option at this point) >>> import codecs
>>> codecs.encode(b"hello", "bz2_codec").decode("bz2_codec")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'bz2_codec' decoder returned 'bytes' instead of 'str'; use codecs.decode to decode to arbitrary types
>>> b"hello".decode("rot_13")
AttributeError: 'memoryview' object has no attribute 'translate'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: decoding with 'rot_13' codec failed (AttributeError: 'memoryview' object has no attribute 'translate')
>>> "hello".encode("bz2_codec")
TypeError: 'str' does not support the buffer interface
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: encoding with 'bz2_codec' codec failed (TypeError: 'str' does not support the buffer interface)
>>> "hello".encode("rot_13")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'rot_13' encoder returned 'str' instead of 'bytes'; use codecs.encode to encode to arbitrary types |
Checking the other binary<->binary and str<->str codecs with input type and value restrictions:
For bad value input, "uu_codec" is the only one that throws a normal ValueError, I couldn't figure out a way to get "quopri_codec" to complain about the input value and the others throw a module specific error:
As with the OSError that escapes from bz2_codec, I think the simplest and most conservative option is to not worry about those at this point. |
Updated patch adds systematic tests for the new error handling to test_codecs.TransformTests I also moved the codecs changes up to a "Codec handling improvements" section. My rationale for doing that is that this is actually a pretty significant usability enhancement and Python 3 codec model clarification for heavy users of binary codecs coming from Python 2, and because I also plan to follow up on this issue by bringing back the shorthand aliases for these codecs that were removed in bpo-10807 (thus closing bpo-7475). If bpo-15216 gets finished (changing stream encodings after creation) that would also be a substantial enhancement worth mentioning here. |
Updated patch (v5) with a more robust chaining mechanism provided as a private "_PyErr_TrySetFromCause" API. This version eliminates the previous whitelist in favour of checking directly for the ability to replace the exception with another instance of the same type without losing information. This version also has more direct tests of the exception wrapping behaviour as a dedicated test class. If I don't hear any objections in the next couple of days, I plan to commit this version. |
On 10.11.2013 14:03, Nick Coghlan wrote:
This doesn't look right: diff -r 1ee45eb6aab9 Include/pyerrors.h BTW: Why don't we make that API a public one ? It could be useful In the error messages, I'd use "codecs.encode()" and "codecs.decode()" |
On 10 November 2013 23:21, Marc-Andre Lemburg <report@bugs.python.org> wrote:
The signature? That API doesn't currently let you change the exception
Because I'm not sure it's a good idea in general and hence am wary of bpo-18861 also makes me wonder if there's an underlying structural |
On 10 November 2013 23:21, Marc-Andre Lemburg <report@bugs.python.org> wrote:
After sending my previous reply, I realised you may have been
Forgot to reply to this part - I like it, will switch it over before committing. |
On 10.11.2013 15:39, Nick Coghlan wrote:
Sorry about the false warning. After looking at those lines
Also note that it's not clear whether the "ASCII" That's a separate ticket, though.
Thanks. |
Patch for the final version that I'm about to commit.
|
New changeset 854a2cea31b9 by Nick Coghlan in branch 'default': |
New changeset 99ba1772c469 by Christian Heimes in branch 'default': |
New changeset 26121ae22016 by Christian Heimes in branch 'default': |
Coverity has found two issues in your patch. I have fixed them both. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: