Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf-7 inconsistent with surrogates #57542

Closed
pitrou opened this issue Nov 3, 2011 · 11 comments
Closed

utf-7 inconsistent with surrogates #57542

pitrou opened this issue Nov 3, 2011 · 11 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@pitrou
Copy link
Member

pitrou commented Nov 3, 2011

BPO 13333
Nosy @loewis, @pitrou, @ezio-melotti, @akheron
Files
  • utf7.patch
  • utf7-nogit.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2011-11-15.00:58:41.640>
    created_at = <Date 2011-11-03.12:13:46.638>
    labels = ['interpreter-core', 'type-bug', 'expert-unicode']
    title = 'utf-7 inconsistent with surrogates'
    updated_at = <Date 2011-11-15.00:58:41.638>
    user = 'https://github.com/pitrou'

    bugs.python.org fields:

    activity = <Date 2011-11-15.00:58:41.638>
    actor = 'pitrou'
    assignee = 'none'
    closed = True
    closed_date = <Date 2011-11-15.00:58:41.640>
    closer = 'pitrou'
    components = ['Interpreter Core', 'Unicode']
    creation = <Date 2011-11-03.12:13:46.638>
    creator = 'pitrou'
    dependencies = []
    files = ['23686', '23688']
    hgrepos = []
    issue_num = 13333
    keywords = ['patch']
    message_count = 11.0
    messages = ['146919', '146951', '147457', '147635', '147639', '147640', '147643', '147646', '147647', '147648', '147649']
    nosy_count = 5.0
    nosy_names = ['loewis', 'pitrou', 'ezio.melotti', 'python-dev', 'petri.lehtinen']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue13333'
    versions = ['Python 2.7', 'Python 3.2', 'Python 3.3']

    @pitrou
    Copy link
    Member Author

    pitrou commented Nov 3, 2011

    The utf-7 codec happily encodes lone surrogates, but it won't decode them:

    >>> "\ud801".encode("utf-7")
    b'+2AE-'
    >>> "\ud801\ud801".encode("utf-7")
    b'+2AHYAQ-'
    >>> "\ud801".encode("utf-7").decode("utf-7")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/antoine/cpython/default/Lib/encodings/utf_7.py", line 12, in decode
        return codecs.utf_7_decode(input, errors, True)
    UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-4: second surrogate missing at end of shift sequence
    >>> "\ud801\ud801".encode("utf-7").decode("utf-7")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/antoine/cpython/default/Lib/encodings/utf_7.py", line 12, in decode
        return codecs.utf_7_decode(input, errors, True)
    UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-6: second surrogate missing

    I don't know which behaviour is better but round-tripping is certainly a desirable property of any codec.

    @pitrou pitrou added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-bug An unexpected behavior, bug, or error labels Nov 3, 2011
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Nov 3, 2011

    RFC 2152 talks about encoding 16-bit unicode, and clarifies

    Surrogate pairs (UTF-16) are converted by treating each half
    of the pair as a separate 16 bit quantity (i.e., no special
    treatment).

    So lone surrogates clearly should be supported.

    This text could be interpreted as saying that decoding surrogate pairs should also keep them (rather than combining them). However, the RFC also assumes that the decoded form will use 16-bit code units; for Python, I think we should continue combining surrogate pairs on decoding UTF-7 when we find them.

    @ezio-melotti
    Copy link
    Member

    FWIW Wikipedia says "Other characters must be encoded in UTF-16 (hence U+10000 and higher would be encoded into surrogates) and then in modified Base64."

    So one possible interpretation is that while encoding a non-BMP char, it should be first converted in a surrogate pair and then each of the surrogates should be encoded just like any other 16bit code unit.
    While decoding, it seems reasonable to do the opposite, i.e. recombine the surrogate pair.

    The RFC doesn't say anything about lone surrogates, but I think that the fact that surrogates are used internally doesn't necessarily mean that the codec should be able to encode/decode them when they are not paired. The other UTF-* codecs reject them, but that's because it is explicitly forbidden by their respective standards.

    So I'm +1 about recombining them while decoding, and ±0 about allowing lone surrogates.

    @pitrou
    Copy link
    Member Author

    pitrou commented Nov 14, 2011

    Here is a patch.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Nov 14, 2011

    Can you please regenerate the patch against default's head?

    @pitrou
    Copy link
    Member Author

    pitrou commented Nov 14, 2011

    It's a patch for 3.2.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Nov 14, 2011

    Please don't use git-style diffs then, since otherwise the review can't figure out what the patch applies to (and neither could I figure that out).

    @pitrou
    Copy link
    Member Author

    pitrou commented Nov 14, 2011

    Here is a non-git diff then :)

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Nov 15, 2011

    LGTM.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 15, 2011

    New changeset ddfcb0de564f by Antoine Pitrou in branch '3.2':
    Issue bpo-13333: The UTF-7 decoder now accepts lone surrogates
    http://hg.python.org/cpython/rev/ddfcb0de564f

    New changeset 250091e60f28 by Antoine Pitrou in branch 'default':
    Issue bpo-13333: The UTF-7 decoder now accepts lone surrogates
    http://hg.python.org/cpython/rev/250091e60f28

    New changeset 050772822bde by Antoine Pitrou in branch '2.7':
    Issue bpo-13333: The UTF-7 decoder now accepts lone surrogates
    http://hg.python.org/cpython/rev/050772822bde

    @pitrou
    Copy link
    Member Author

    pitrou commented Nov 15, 2011

    I made a little fix to the patch for wide unicode builds and then committed it. Thank you!

    @pitrou pitrou closed this as completed Nov 15, 2011
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants