Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode(exception) and str(exception) should return the same message on Py2.6 #50358

Closed
ezio-melotti opened this issue May 26, 2009 · 25 comments
Closed
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) release-blocker type-bug An unexpected behavior, bug, or error

Comments

@ezio-melotti
Copy link
Member

BPO 6108
Nosy @warsaw, @ncoghlan, @pitrou, @rbtcollins, @ezio-melotti
Files
  • issue6108_testcase.diff: Testcase that checks if str() and unicode() return the same message
  • output_on_py26.txt: Output of unicode_exceptions.py on Python 2.6
  • unicode_exceptions.py: Example script that shows the str() and unicode() of some exceptions
  • issue6108.diff: Proof of concept that fixes UnicodeDecodeException
  • issue6108-2.patch: Patch that makes all the tests in issue6108_testcase pass (except for KeyError).
  • issue6108-3.patch: Patch that makes all the tests in issue6108_testcase pass.
  • issue6108-4.patch: Patch + unittests against trunk
  • issue6108-5.patch: Patch + unittests against trunk
  • issue6108-6.patch: Patch + unittests against trunk
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ezio-melotti'
    closed_at = <Date 2009-12-24.23:03:08.792>
    created_at = <Date 2009-05-26.04:53:13.017>
    labels = ['interpreter-core', 'type-bug', 'release-blocker']
    title = 'unicode(exception) and str(exception) should return the same message on Py2.6'
    updated_at = <Date 2009-12-24.23:03:08.789>
    user = 'https://github.com/ezio-melotti'

    bugs.python.org fields:

    activity = <Date 2009-12-24.23:03:08.789>
    actor = 'ezio.melotti'
    assignee = 'ezio.melotti'
    closed = True
    closed_date = <Date 2009-12-24.23:03:08.792>
    closer = 'ezio.melotti'
    components = ['Interpreter Core']
    creation = <Date 2009-05-26.04:53:13.017>
    creator = 'ezio.melotti'
    dependencies = []
    files = ['15314', '15315', '15316', '15529', '15534', '15535', '15632', '15650', '15651']
    hgrepos = []
    issue_num = 6108
    keywords = ['patch']
    message_count = 25.0
    messages = ['88330', '92561', '92568', '93313', '95158', '96281', '96297', '96313', '96314', '96315', '96318', '96319', '96321', '96322', '96323', '96324', '96325', '96331', '96717', '96742', '96755', '96756', '96761', '96762', '96872']
    nosy_count = 7.0
    nosy_names = ['barry', 'exarkun', 'ncoghlan', 'pitrou', 'rbcollins', 'ezio.melotti', 'cvrebert']
    pr_nums = []
    priority = 'release blocker'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue6108'
    versions = ['Python 2.6', 'Python 2.7']

    @ezio-melotti
    Copy link
    Member Author

    On Python 2.5 str(exception) and unicode(exception) return the same text:
    >>> err
    UnicodeDecodeError('ascii', '\xc3\xa0', 0, 1, 'ordinal not in range(128)')
    >>> str(err)
    "'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in
    range(128)"
    >>> unicode(err)
    u"'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in
    range(128)"
    
    On Python 2.6 unicode(exception) returns unicode(exception.args):
    >>> err
    UnicodeDecodeError('ascii', '\xc3\xa0', 0, 1, 'ordinal not in range(128)')
    >>> str(err)
    "'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in
    range(128)"
    >>> unicode(err)
    u"('ascii', '\\xc3\\xa0', 0, 1, 'ordinal not in range(128)')"

    This seems to affect only exceptions with more than 1 arg (e.g.
    UnicodeErrors and SyntaxErrors). KeyError is also different (the '' are
    missing with unicode()).

    Note that when an exception like ValueError() is instantiated with more
    than 1 arg even str() returns str(exception.args) on both Py2.5 and Py2.6.

    Probably __str__() checks the number of args before returning a specific
    message and if it doesn't match it returns str(self.args). __unicode__()
    instead seems to always return unicode(self.args) on Py2.6.

    Attached there's a script that prints the repr(), str() and unicode() of
    some exceptions, run it on Py2.5 and Py2.6 to see the differences.

    @ezio-melotti ezio-melotti added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error labels May 26, 2009
    @exarkun
    Copy link
    Mannequin

    exarkun mannequin commented Sep 13, 2009

    Perhaps also worth noting is that in Python 2.4 as well, str(exception)
    and unicode(exception) returned the same thing. Unlike some other
    exception changes in 2.6, this doesn't seem to be a return to older
    behavior, but just a new behavior. (Or maybe no one cares about that;
    just wanted to point it out, though.)

    @pitrou
    Copy link
    Member

    pitrou commented Sep 13, 2009

    Looks like a potentially annoying bug to me.

    @warsaw
    Copy link
    Member

    warsaw commented Sep 29, 2009

    Since we do not yet have a patch for this, I'm knocking it off the list
    for 2.6.3. It seems like an annoying loss of compatibility, but do we
    have any reports of it breaking real-world code?

    @ezio-melotti
    Copy link
    Member Author

    I added the output of unicode_exceptions.py on Py2.6 and a testcase
    (against the trunk) that fails for 5 different exceptions, including the
    IOError mentioned in bpo-6890 (also added to unicode_exceptions.py).
    The problem has been introduced by bpo-2517.

    @ezio-melotti ezio-melotti changed the title unicode(exception) behaves differently on Py2.6 when len(exception.args) > 1 unicode(exception) and str(exception) should return the same message on Py2.6 Nov 12, 2009
    @ezio-melotti
    Copy link
    Member Author

    In r64791, BaseException gained a new __unicode__ method that does the
    equivalent of the following things:

    • if the number of args is 0, returns u''
    • if it's 1 returns unicode(self.args[0])
    • if it's >1 returns unicode(self.args)

    Before this, BaseException only had a __str__ method, so unicode(e)
    (with e being an exception derived from BaseException) called:

    • e.__str__().decode(), if e didn't implement __unicode__
    • e.__unicode__(), if e implemented an __unicode__ method

    Now, all the derived exceptions that don't implement their own
    __unicode__ method inherit the "generic" __unicode__ of BaseException,
    and they use that instead of falling back on __str__.
    This is generally ok if the numbers of args is 0 or 1, but if there are
    more args, there's usually some specific formatting in the __str__
    method that is lost when BaseException.__unicode__ returns
    unicode(self.args).

    Possible solutions:

    1. implement a __unicode__ method that does the equivalent of calling
      unicode(str(self)) (i.e. converting to unicode the message returned by
      __str__ instead of converting self.args);
    2. implement a __unicode__ method that formats the message as __str__
      for all the exceptions with a __str__ that does some specific formatting;

    Attached there's a proof of concept (bpo-6108.diff) where I tried to
    implement the first method with UnicodeDecodeError. This method can be
    used as long as __str__ always returns only ascii.

    The patch seems to work fine for me (note: this is my first attempt to
    use the C API). If the approach is correct I can do the same for the
    other exceptions too and submit a proper patch.

    @ezio-melotti ezio-melotti self-assigned this Dec 12, 2009
    @pitrou
    Copy link
    Member

    pitrou commented Dec 12, 2009

    In r64791, BaseException gained a new __unicode__ method that does the
    equivalent of the following things:

    It remains to be seen why that behaviour was chosen. Apparently Nick
    implemented it.
    IMO __unicode__ should have the same behaviour as __str__. There's no
    reason to implement two different formatting algorithms.

    @ncoghlan
    Copy link
    Contributor

    Following this down the rabbit hole a little further: Issue bpo-2517 (the
    origin of my checkin) was just a restoration of the __unicode__ slot
    implementation that had been ripped out in r51837 due to Issue bpo-1551432.

    At the time of the r64791 checkin, BaseException_str and
    BaseException_unicode were identical aside from the type of object
    returned (checking SVN head shows they're actually still identical).

    However, it looks like several exceptions with __str__ overrides (i.e.
    Unicode[Encode/Decode/Translate]Error_str, EnvironmentError_str,
    WindowsError_str. SyntaxError_str, KeyError_str) are missing
    corresponding __unicode__ overrides, so invoking unicode() on them falls
    back to the BaseException_unicode implementation instead of using the
    custom formatting behaviour of the subclass.

    @ezio-melotti
    Copy link
    Member Author

    IMO __unicode__ should have the same behaviour as __str__. There's no
    reason to implement two different formatting algorithms.

    If BaseException has both the methods they have to be both overridden by
    derived exceptions in order to have the same behaviour. The simplest way
    to do it is to convert the string returned by __str__ to unicode, as I
    did in bpo-6108.diff.
    If you have better suggestions let me know.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 13, 2009

    Well the obvious problem with this approach is that it won't work if
    __str__() returns a non-ascii string. The only working solution would be
    to replicate the functioning of __str__() in each __unicode__()
    implementation.

    @ncoghlan
    Copy link
    Contributor

    As Antoine said, there's a reason BaseException now implements both
    __str__ and __unicode__ and doesn't implement the latter in terms of the
    former - it's the only way to consistently support Unicode arguments
    that can't be encoded to an 8-bit ASCII string:

    >>> str(Exception(u"\xc3\xa0"))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode characters in position
    0-1: ordinal not in range(128)
    >>> unicode(Exception(u"\xc3\xa0"))
    u'\xc3\xa0'

    For some of the exception subclasses that will always return ASCII (e.g.
    KeyError, which calls repr() on its arguments) then defining __unicode__
    in terms of __str__ as Ezio suggests will work.

    For others (as happened with BaseException itself), the __unicode__
    method will need to be a reimplementation that avoids trying to encode
    potentially non-ASCII characters into an 8-bit ASCII string.

    @ezio-melotti
    Copy link
    Member Author

    What you said is only a special case, and I agree that the solution
    introduced with r64791 is correct for that. However, that fix has the
    side effect of breaking the code in other situations.

    To summarize the possible cases and the behaviours I prepared the
    following list (odd numbers -> BaseException; even numbers -> any
    exception with overridden __str__ and no __unicode__.):

    1. 0 args, e = Exception():
      py2.5 : str(e) -> ''; unicode(e) -> u''
      py2.6 : str(e) -> ''; unicode(e) -> u''
      desired: str(e) -> ''; unicode(e) -> u''
      Note: this is OK

    2. 0 args, e = MyException(), with overridden __str__:
      py2.5 : str(e) -> 'ascii' or error; unicode(e) -> u'ascii' or error;
      py2.6 : str(e) -> 'ascii' or error; unicode(e) -> u''
      desired: str(e) -> 'ascii' or error; unicode(e) -> u'ascii' or error;
      Note: py2.5 behaviour is better: if __str__ returns an ascii string
      (including ''), unicode(e) should return the same string decoded, if
      __str__ returns a non-ascii string, both should raise an error.

    3a) 1 str arg, e = Exception('foo'):
    py2.5 : str(e) -> 'foo'; unicode(e) -> u'foo'
    py2.6 : str(e) -> 'foo'; unicode(e) -> u'foo'
    desired: str(e) -> 'foo'; unicode(e) -> u'foo'
    Note: this is OK

    3b) 1 non-ascii unicode arg, e = Exception(u'föö'):
    py2.5 : str(e) -> error; unicode(e) -> error
    py2.6 : str(e) -> error; unicode(e) -> u'föö'
    desired: str(e) -> error; unicode(e) -> u'föö'
    Note: py2.6 behaviour is better: unicode(e) should return u'föö'

    1. 1 unicode arg, e = MyException(u'föö'), with overridden __str__:
      py2.5 : str(e) -> error or 'ascii'; unicode(e) -> error or u'ascii'
      py2.6 : str(e) -> error or 'ascii'; unicode(e) -> u'föö'
      desired: str(e) -> error or 'ascii'; unicode(e) -> error or u'ascii'
      Note: py2.5 behaviour is better: if __str__ returns an ascii string
      str(e) should work, otherwise it should raise an error. unicode(e)
      should return the ascii string decoded or an error, not the arg.

    2. >1 args of any type, e = Exception('foo', u'föö', 5):
      py2.5 : str(e) -> "('foo', u'f\\xf6\\xf6', 5)";
      unicode(e) -> u"('foo', u'f\\xf6\\xf6', 5)";
      py2.6 : str(e) -> "('foo', u'f\\xf6\\xf6', 5)";
      unicode(e) -> u"('foo', u'f\\xf6\\xf6', 5)";
      desired: str(e) -> "('foo', u'f\\xf6\\xf6', 5)";
      unicode(e) -> u"('foo', u'f\\xf6\\xf6', 5)";
      Note: this is OK

    3. >1 args of any type, e = MyException('foo', u'föö', 5), with
      overridden __str__:
      py2.5 : str(e) -> 'ascii' or error; unicode(e) -> u'ascii' or error;
      py2.6 : str(e) -> 'ascii' or error; unicode(e) -> u"('foo',
      u'f\\xf6\\xf6', 5)";
      desired: str(e) -> 'ascii' or error; unicode(e) -> u'ascii' or error;
      Note: py2.5 behaviour is better: if __str__ returns an ascii string,
      unicode(e) should return the same string decoded, if __str__ returns a
      non-ascii string, both should raise an error.

    As you can see, your example corresponds just to the case 3b) (now
    fixed), but cases 2, 4, 6 are now broken.

    Making this list allowed me to come out with a new patch, that seems to
    solve all the problems (2, 4 and 6 while leaving 3b as it is now). The
    only exception is for KeyError, if we want it to print the repr, then
    KeyError_unicode should be implemented, but I think that Python only
    calls str() so it's probably not necessary.

    Attached new patch that passes all the tests in issue6108_testcase
    except for KeyError. Unless you disagree with the 'desired behaviours'
    that I listed, this patch should fix the issue.

    @rbtcollins
    Copy link
    Member

    "2) 0 args, e = MyException(), with overridden __str__:
    py2.5 : str(e) -> 'ascii' or error; unicode(e) -> u'ascii' or error;
    py2.6 : str(e) -> 'ascii' or error; unicode(e) -> u''
    desired: str(e) -> 'ascii' or error; unicode(e) -> u'ascii' or error;
    Note: py2.5 behaviour is better: if __str__ returns an ascii string
    (including ''), unicode(e) should return the same string decoded, if
    __str__ returns a non-ascii string, both should raise an error.
    "

    I'm not sure how you justify raising an unnecessary error when trying to
    stringify an exception as being 'better'.

    __str__ should not decode its arguments if they are already strings:
    they may be valid data for the user even if they are not decodable (and
    note that an implicit decode may try to decode('ascii') which is totally
    useless.

    __str__ and __unicode__ are /different/ things, claiming they have to
    behave the same is equivalent to claiming either that we don't need
    unicode, or that we don't need binary data.

    Surely there is space for both things, which does imply that
    unicode(str(e)) != unicode(e).

    Why _should_ that be the same anyway?

    @ncoghlan
    Copy link
    Contributor

    I agree the 2.6 implementation creates backwards compatibility problems
    with subclasses that only override __str__ that we didn't recognise at
    the time.

    An alternative approach that should work even for the KeyError case is
    for BaseException_unicode to check explicitly for the situation where
    the __str__ slot has been overridden but __unicode__ is still the
    BaseException version and invoke "PyObject_Unicode(PyObject_Str(self))"
    when it detects that situation.

    That way subclasses that only override __str__ would continue to see the
    old behaviour, while subclasses that don't override either would
    continue to benefit from the new behaviour.

    @ezio-melotti
    Copy link
    Member Author

    Assume the case of e = MyException() (note: 0 args) with a __str__ that
    returns a default message. Now, if the message is ascii, str(e) works
    and the user see the default message but unicode(e) returns a
    not-so-useful empty string.
    On the other hand, if __str__ returns a non-ascii string, then it's
    wrong in the first place, because str(e) will fail and returning an
    empty string with unicode(e) is not going to help.

    @ezio-melotti
    Copy link
    Member Author

    An alternative approach that should work even for the KeyError case is
    for BaseException_unicode to check explicitly for the situation where
    the __str__ slot has been overridden but __unicode__ is still the
    BaseException version and invoke "PyObject_Unicode(PyObject_Str(self))"
    when it detects that situation.

    This is even better, I'll try to do it.

    @ezio-melotti
    Copy link
    Member Author

    Here is a new patch (bpo-6108-3.patch) that checks if __str__ has been
    overridden and calls PyObject_Unicode(PyObject_Str(self)).

    All the tests (including the one with KeyError) in
    issue6108_testcase.diff now pass.

    If the patch is OK I'll make sure that the tests cover all the possible
    cases that I listed and possibly add a few more before the commit.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 13, 2009

    You should check the return value from PyObject_Str().

    @ezio-melotti
    Copy link
    Member Author

    I created a comprehensive set of tests to check all the possibilities
    that I listed in msg96319 and updated the patch for Object/exceptions.c.
    Without patch all the test_*with_overridden__str_ and
    test_builtin_exceptions fail, both on 2.6 and on trunk, with the patch
    all the tests pass.
    The code in exceptions.c now does the equivalent of unicode(e.__str__())
    instead of unicode(str(e)). If e.__str__() returns a non-ascii unicode
    string, unicode() now shows the message instead of raising an error.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 21, 2009

    I created a comprehensive set of tests to check all the possibilities
    that I listed in msg96319 and updated the patch for Object/exceptions.c.

    Great!
    Small thing: in tests, you should use setUp() to initialize test data
    rather than __init__().

    @ezio-melotti
    Copy link
    Member Author

    I updated the patch and moved the helper class outside the __init__.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 21, 2009

    This looks fine, module the slight style issue mentioned on IRC. Please
    commit after you fix it.
    (this is assuming all tests pass, of course!)

    @ezio-melotti
    Copy link
    Member Author

    This should be the final patch (bpo-6108-6.patch). I update the
    comments, checked that (some of) the tests fail without the patch, that
    they (all) pass with it and that there are no leaks.
    I plan to backport this on 2.6 and possibly port the tests to py3k and 3.1.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 21, 2009

    It's ok for me.

    @ezio-melotti
    Copy link
    Member Author

    Fixed in r77045 (trunk) and r77046 (release26-maint). No need to port it
    to py3k since unicode() is gone.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) release-blocker type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants