Codecs: "surrogateescape" error handler in Python 2.7 #52685

ysjray · 2010-04-18T08:25:32Z

BPO	8438
Nosy	@malemburg, @loewis, @pitrou, @vstinner, @benjaminp, @ezio-melotti
Files	surrogateescape.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2010-04-19.19:38:54.620>
created_at = <Date 2010-04-18.08:25:31.636>
labels = ['type-bug', 'expert-IO']
title = 'Codecs: "surrogateescape" error handler in Python 2.7'
updated_at = <Date 2010-04-19.19:38:54.619>
user = 'https://bugs.python.org/ysjray'

bugs.python.org fields:

activity = <Date 2010-04-19.19:38:54.619>
actor = 'pitrou'
assignee = 'none'
closed = True
closed_date = <Date 2010-04-19.19:38:54.620>
closer = 'pitrou'
components = ['IO']
creation = <Date 2010-04-18.08:25:31.636>
creator = 'ysj.ray'
dependencies = []
files = ['16977']
hgrepos = []
issue_num = 8438
keywords = ['patch']
message_count = 15.0
messages = ['103470', '103478', '103479', '103480', '103481', '103483', '103484', '103506', '103508', '103509', '103513', '103561', '103562', '103567', '103622']
nosy_count = 7.0
nosy_names = ['lemburg', 'loewis', 'pitrou', 'vstinner', 'benjamin.peterson', 'ezio.melotti', 'ysj.ray']
pr_nums = []
priority = 'high'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue8438'
versions = ['Python 2.7']

ysjray · 2010-04-18T08:25:30Z

According to PEP-383, the new "surrogateescape" error handler of codecs should begin to appear since Python3.1, but in the trunk I found some code have already used it:

Modules/_io/fileio.c:

static int
fileio_init(PyObject *oself, PyObject *args, PyObject *kwds){
    ...
    stringobj = PyUnicode_AsEncodedString(
		u, Py_FileSystemDefaultEncoding, "surrogateescape");
    ...

Obviously, the "surrogateescape" error handler not exists.

Some test code:
===========================

import io

file_name = u'\udc80.txt'
f = io.FileIO(file_name)

===========================

When run this piece of code on a machine whose file system default encoding is gb2312, will raise an exception:

LookupError: unknown error handler name 'surrogateescape'

I don't know weather this is a bug?

Thanks.

pitrou · 2010-04-18T10:57:30Z

Certainly a bug indeed.

malemburg · 2010-04-18T11:26:29Z

I think it would be best to backport the handler (even though it is not needed in Python 2.7), since it makes porting apps to 3.x easier.

loewis · 2010-04-18T11:29:37Z

-1 on backporting. The handler isn't really meant to be used in applications, plus 2.7 is in feature-freeze.

malemburg · 2010-04-18T11:37:50Z

Martin v. Löwis wrote:

Martin v. Löwis <martin@v.loewis.de> added the comment:

-1 on backporting. The handler isn't really meant to be used in applications, plus 2.7 is in feature-freeze.

Since 2.7 is meant to be the last release of the 2.x series,
we have to make sure that it has all the bits necessary to make
porting apps to 3.x easy.

As a result, omissions such as the new handler which became
necessary after the change to the UTF-8 codec in 3.x deserve
special attention, overriding such self-imposed restrictions.

The handler is not meant to be used internally only. In fact,
it was the prerequisite for me to be +1 on the UTF-8 codec
change in 3.x.

loewis · 2010-04-18T12:03:38Z

Since 2.7 is meant to be the last release of the 2.x series,
we have to make sure that it has all the bits necessary to make
porting apps to 3.x easy.

Any new features in 2.7 require approval from the release manager now.

pitrou · 2010-04-18T12:05:33Z

Any new features in 2.7 require approval from the release manager now.

Not only, but they also need someone to provide a patch :)
Removing any surrogateescape use from the io module would be comparatively much easier.

benjaminp · 2010-04-18T17:33:55Z

The 2.x io lib should use the same encoding principles as the rest of 2.x.

pitrou · 2010-04-18T18:12:14Z

Here is a fix + test.

vstinner · 2010-04-18T18:20:15Z

I think it would be best to backport the handler (even though
it is not needed in Python 2.7), since it makes porting apps
to 3.x easier.

surrogateescape should not be used directly be applications. It's used by Python3 internals using unicode by default.

I don't know if it's would help porting applications from Python2 to Python3. I don't know a use case of surrogateescape in Python2. By default, Python2 uses byte string everywhere, especially for filenames, and so it doesn't need any unicode error handler.

Another point to consider is that utf8 encoder rejects surrogates in Python3, whereas surrogates are accepted by the Python2 utf8 encoder.

I don't have a strong opinion. But if I have to choose, I would say that surrogateescape should not go to Python2. It's a solution to problem specific to Python3.

(... and surrogates introduces a lot of new issues ...)

pitrou · 2010-04-18T18:32:54Z

New patch fixing Windows compatibility.

malemburg · 2010-04-19T08:45:19Z

STINNER Victor wrote:

STINNER Victor <victor.stinner@haypocalc.com> added the comment:

> I think it would be best to backport the handler (even though
> it is not needed in Python 2.7), since it makes porting apps
> to 3.x easier.

surrogateescape should not be used directly be applications. It's used by Python3 internals using unicode by default.

I don't know if it's would help porting applications from Python2 to Python3. I don't know a use case of surrogateescape in Python2. By default, Python2 uses byte string everywhere, especially for filenames, and so it doesn't need any unicode error handler.

Another point to consider is that utf8 encoder rejects surrogates in Python3, whereas surrogates are accepted by the Python2 utf8 encoder.

Sorry, I think I need to correct myself: I mixed up the handlers
surrogateescape and surrogatepass. I was actually thinking of the
surrogatepass handler which makes the Python3 UTF-8 codec have like the
Python2 UTF-8 codec (without extra handler), not the surrogatescape
handler which implements the UTF-8b logic of escaping non-encodable
bytes to lone surrogates.

The surrogatepass handler is needed in Python 2.7 to make it
possible to write applications that work in both 2.7 and 3.x
without changing the code.

I consider this an important missing backport for 2.7, since
without this handler, the UTF-8 codecs in 2.7 and 3.x are
incompatible and there's no other way to work around this
other than to make use of the errorhandler conditionally
depend on the Python version.

As such, it's a bug rather than a new feature.

The surrogateescape handler implements the UTF-8b escaping
logic:

b'\x91\x92'

In Python 3.x this is needed to work around problems with
wrong I/O encoding settings or situations where you have mixed
encoding settings used in external resources such as environment
variable content, filesystems using different encodings than
the system one, remote shell output, pipes which don't carry
any encoding information, etc. etc.

Backporting this handler would be useful for Python 2.7 as
well, since it allows preparing 2.7 applications for use in
3.x and again allows using the same code for 2.7 and 3.x.

Not having this handler in 2.7 is not as serious as the
surrogatepass handler, but still useful for applications to
use that are meant to run in 2.7 and 3.x unchanged.

ezio-melotti · 2010-04-19T08:55:47Z

I consider this an important missing backport for 2.7, since
without this handler, the UTF-8 codecs in 2.7 and 3.x are
incompatible and there's no other way to work around this
other than to make use of the errorhandler conditionally
depend on the Python version.

FWIW I tried to updated the UTF-8 codec on trunk from RFC 2279 to RFC 3629 while working on bpo-8271, and found out this difference in the handling of surrogates (only on 3.x they are invalid).
I didn't change the behavior of the codec in the patch I attached to bpo-8271 because it was out of the scope of the issue, but I consider the fact that in Python 2.x surrogates can be encoded as a bug, because it doesn't follow RFC 3629.
IMHO Python 2.x should provide an RFC-3629-compliant UTF-8 codec, however I didn't have time yet to investigate how Python 3 handles this and what is the best solution (e.g. adding another codec or change the default behavior).

malemburg · 2010-04-19T09:15:28Z

Ezio Melotti wrote:

Ezio Melotti <ezio.melotti@gmail.com> added the comment:

> I consider this an important missing backport for 2.7, since
> without this handler, the UTF-8 codecs in 2.7 and 3.x are
> incompatible and there's no other way to work around this
> other than to make use of the errorhandler conditionally
> depend on the Python version.

FWIW I tried to updated the UTF-8 codec on trunk from RFC 2279 to RFC 3629 while working on bpo-8271, and found out this difference in the handling of surrogates (only on 3.x they are invalid).
I didn't change the behavior of the codec in the patch I attached to bpo-8271 because it was out of the scope of the issue, but I consider the fact that in Python 2.x surrogates can be encoded as a bug, because it doesn't follow RFC 3629.
IMHO Python 2.x should provide an RFC-3629-compliant UTF-8 codec, however I didn't have time yet to investigate how Python 3 handles this and what is the best solution (e.g. adding another codec or change the default behavior).

We have good reasons to allow lone surrogates in the UTF-8
codec.

Please remember that Python is a programming language
meant to allow writing applications, which also includes constructing
Unicode data from scratch, rather than an application which is
only meant to work with UTF-8 data.

Also note that lone surrogates were considered valid UTF-8 at the
time of adding Unicode support to Python and many years after that.

Since the codec is used in lots of applications, following the
Unicode consortium change in 2.7 is not possible.

This is why it was done in the 3.x branch and then only with
the additional surrogatepass handler to get back the old behavior
where needed.

But this is getting offtopic for the issue in question... I'll
open a new ticket for the backports.

pitrou · 2010-04-19T18:53:13Z

Patch committed to trunk in r80215. I'm going to watch the buildbots, I suspect OS X might dislike surrogates in the filename.

ysjray mannequin added topic-unicode type-bug An unexpected behavior, bug, or error labels Apr 18, 2010

pitrou added topic-IO and removed topic-unicode labels Apr 18, 2010

pitrou closed this as completed Apr 19, 2010

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codecs: "surrogateescape" error handler in Python 2.7 #52685

Codecs: "surrogateescape" error handler in Python 2.7 #52685

ysjray mannequin commented Apr 18, 2010

ysjray mannequin commented Apr 18, 2010

pitrou commented Apr 18, 2010

malemburg commented Apr 18, 2010

loewis mannequin commented Apr 18, 2010

malemburg commented Apr 18, 2010

loewis mannequin commented Apr 18, 2010

pitrou commented Apr 18, 2010

benjaminp commented Apr 18, 2010

pitrou commented Apr 18, 2010

vstinner commented Apr 18, 2010

pitrou commented Apr 18, 2010

malemburg commented Apr 19, 2010

ezio-melotti commented Apr 19, 2010

malemburg commented Apr 19, 2010

pitrou commented Apr 19, 2010

Codecs: "surrogateescape" error handler in Python 2.7 #52685

Codecs: "surrogateescape" error handler in Python 2.7 #52685

Comments

ysjray mannequin commented Apr 18, 2010

ysjray mannequin commented Apr 18, 2010

pitrou commented Apr 18, 2010

malemburg commented Apr 18, 2010

loewis mannequin commented Apr 18, 2010

malemburg commented Apr 18, 2010

loewis mannequin commented Apr 18, 2010

pitrou commented Apr 18, 2010

benjaminp commented Apr 18, 2010

pitrou commented Apr 18, 2010

vstinner commented Apr 18, 2010

pitrou commented Apr 18, 2010

malemburg commented Apr 19, 2010

ezio-melotti commented Apr 19, 2010

malemburg commented Apr 19, 2010

pitrou commented Apr 19, 2010