utf-7 inconsistent with surrogates #57542

pitrou · 2011-11-03T12:13:47Z

BPO	13333
Nosy	@loewis, @pitrou, @ezio-melotti, @akheron
Files	utf7.patch utf7-nogit.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2011-11-15.00:58:41.640>
created_at = <Date 2011-11-03.12:13:46.638>
labels = ['interpreter-core', 'type-bug', 'expert-unicode']
title = 'utf-7 inconsistent with surrogates'
updated_at = <Date 2011-11-15.00:58:41.638>
user = 'https://github.com/pitrou'

bugs.python.org fields:

activity = <Date 2011-11-15.00:58:41.638>
actor = 'pitrou'
assignee = 'none'
closed = True
closed_date = <Date 2011-11-15.00:58:41.640>
closer = 'pitrou'
components = ['Interpreter Core', 'Unicode']
creation = <Date 2011-11-03.12:13:46.638>
creator = 'pitrou'
dependencies = []
files = ['23686', '23688']
hgrepos = []
issue_num = 13333
keywords = ['patch']
message_count = 11.0
messages = ['146919', '146951', '147457', '147635', '147639', '147640', '147643', '147646', '147647', '147648', '147649']
nosy_count = 5.0
nosy_names = ['loewis', 'pitrou', 'ezio.melotti', 'python-dev', 'petri.lehtinen']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue13333'
versions = ['Python 2.7', 'Python 3.2', 'Python 3.3']

pitrou · 2011-11-03T12:13:46Z

The utf-7 codec happily encodes lone surrogates, but it won't decode them:

>>> "\ud801".encode("utf-7")
b'+2AE-'
>>> "\ud801\ud801".encode("utf-7")
b'+2AHYAQ-'
>>> "\ud801".encode("utf-7").decode("utf-7")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/antoine/cpython/default/Lib/encodings/utf_7.py", line 12, in decode
    return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-4: second surrogate missing at end of shift sequence
>>> "\ud801\ud801".encode("utf-7").decode("utf-7")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/antoine/cpython/default/Lib/encodings/utf_7.py", line 12, in decode
    return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-6: second surrogate missing

I don't know which behaviour is better but round-tripping is certainly a desirable property of any codec.

loewis · 2011-11-03T17:28:59Z

RFC 2152 talks about encoding 16-bit unicode, and clarifies

Surrogate pairs (UTF-16) are converted by treating each half
of the pair as a separate 16 bit quantity (i.e., no special
treatment).

So lone surrogates clearly should be supported.

This text could be interpreted as saying that decoding surrogate pairs should also keep them (rather than combining them). However, the RFC also assumes that the decoded form will use 16-bit code units; for Python, I think we should continue combining surrogate pairs on decoding UTF-7 when we find them.

ezio-melotti · 2011-11-12T01:55:54Z

FWIW Wikipedia says "Other characters must be encoded in UTF-16 (hence U+10000 and higher would be encoded into surrogates) and then in modified Base64."

So one possible interpretation is that while encoding a non-BMP char, it should be first converted in a surrogate pair and then each of the surrogates should be encoded just like any other 16bit code unit.
While decoding, it seems reasonable to do the opposite, i.e. recombine the surrogate pair.

The RFC doesn't say anything about lone surrogates, but I think that the fact that surrogates are used internally doesn't necessarily mean that the codec should be able to encode/decode them when they are not paired. The other UTF-* codecs reject them, but that's because it is explicitly forbidden by their respective standards.

So I'm +1 about recombining them while decoding, and ±0 about allowing lone surrogates.

pitrou · 2011-11-14T21:33:26Z

Here is a patch.

loewis · 2011-11-14T23:26:29Z

Can you please regenerate the patch against default's head?

pitrou · 2011-11-14T23:29:37Z

It's a patch for 3.2.

loewis · 2011-11-14T23:32:58Z

Please don't use git-style diffs then, since otherwise the review can't figure out what the patch applies to (and neither could I figure that out).

pitrou · 2011-11-14T23:42:40Z

Here is a non-git diff then :)

loewis · 2011-11-15T00:16:37Z

LGTM.

python-dev · 2011-11-15T00:55:07Z

New changeset ddfcb0de564f by Antoine Pitrou in branch '3.2':
Issue bpo-13333: The UTF-7 decoder now accepts lone surrogates
http://hg.python.org/cpython/rev/ddfcb0de564f

New changeset 250091e60f28 by Antoine Pitrou in branch 'default':
Issue bpo-13333: The UTF-7 decoder now accepts lone surrogates
http://hg.python.org/cpython/rev/250091e60f28

New changeset 050772822bde by Antoine Pitrou in branch '2.7':
Issue bpo-13333: The UTF-7 decoder now accepts lone surrogates
http://hg.python.org/cpython/rev/050772822bde

pitrou · 2011-11-15T00:58:41Z

I made a little fix to the patch for wide unicode builds and then committed it. Thank you!

pitrou added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-bug An unexpected behavior, bug, or error labels Nov 3, 2011

pitrou closed this as completed Nov 15, 2011

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utf-7 inconsistent with surrogates #57542

utf-7 inconsistent with surrogates #57542

pitrou commented Nov 3, 2011

pitrou commented Nov 3, 2011

loewis mannequin commented Nov 3, 2011

ezio-melotti commented Nov 12, 2011

pitrou commented Nov 14, 2011

loewis mannequin commented Nov 14, 2011

pitrou commented Nov 14, 2011

loewis mannequin commented Nov 14, 2011

pitrou commented Nov 14, 2011

loewis mannequin commented Nov 15, 2011

python-dev mannequin commented Nov 15, 2011

pitrou commented Nov 15, 2011

utf-7 inconsistent with surrogates #57542

utf-7 inconsistent with surrogates #57542

Comments

pitrou commented Nov 3, 2011

pitrou commented Nov 3, 2011

loewis mannequin commented Nov 3, 2011

ezio-melotti commented Nov 12, 2011

pitrou commented Nov 14, 2011

loewis mannequin commented Nov 14, 2011

pitrou commented Nov 14, 2011

loewis mannequin commented Nov 14, 2011

pitrou commented Nov 14, 2011

loewis mannequin commented Nov 15, 2011

python-dev mannequin commented Nov 15, 2011

pitrou commented Nov 15, 2011