New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Codecs: "surrogateescape" error handler in Python 2.7 #52685
Comments
According to PEP-383, the new "surrogateescape" error handler of codecs should begin to appear since Python3.1, but in the trunk I found some code have already used it: static int
fileio_init(PyObject *oself, PyObject *args, PyObject *kwds){
...
stringobj = PyUnicode_AsEncodedString(
u, Py_FileSystemDefaultEncoding, "surrogateescape");
... Obviously, the "surrogateescape" error handler not exists. Some test code: import io
file_name = u'\udc80.txt'
f = io.FileIO(file_name) =========================== When run this piece of code on a machine whose file system default encoding is gb2312, will raise an exception: LookupError: unknown error handler name 'surrogateescape' I don't know weather this is a bug? Thanks. |
Certainly a bug indeed. |
I think it would be best to backport the handler (even though it is not needed in Python 2.7), since it makes porting apps to 3.x easier. |
-1 on backporting. The handler isn't really meant to be used in applications, plus 2.7 is in feature-freeze. |
Martin v. Löwis wrote:
Since 2.7 is meant to be the last release of the 2.x series, As a result, omissions such as the new handler which became The handler is not meant to be used internally only. In fact, |
Any new features in 2.7 require approval from the release manager now. |
Not only, but they also need someone to provide a patch :) |
The 2.x io lib should use the same encoding principles as the rest of 2.x. |
Here is a fix + test. |
surrogateescape should not be used directly be applications. It's used by Python3 internals using unicode by default. I don't know if it's would help porting applications from Python2 to Python3. I don't know a use case of surrogateescape in Python2. By default, Python2 uses byte string everywhere, especially for filenames, and so it doesn't need any unicode error handler. Another point to consider is that utf8 encoder rejects surrogates in Python3, whereas surrogates are accepted by the Python2 utf8 encoder. I don't have a strong opinion. But if I have to choose, I would say that surrogateescape should not go to Python2. It's a solution to problem specific to Python3. (... and surrogates introduces a lot of new issues ...) |
New patch fixing Windows compatibility. |
STINNER Victor wrote:
Sorry, I think I need to correct myself: I mixed up the handlers
I consider this an important missing backport for 2.7, since As such, it's a bug rather than a new feature.
b'\x91\x92' In Python 3.x this is needed to work around problems with Backporting this handler would be useful for Python 2.7 as Not having this handler in 2.7 is not as serious as the |
FWIW I tried to updated the UTF-8 codec on trunk from RFC 2279 to RFC 3629 while working on bpo-8271, and found out this difference in the handling of surrogates (only on 3.x they are invalid). |
Ezio Melotti wrote:
We have good reasons to allow lone surrogates in the UTF-8 Please remember that Python is a programming language Also note that lone surrogates were considered valid UTF-8 at the Since the codec is used in lots of applications, following the This is why it was done in the 3.x branch and then only with But this is getting offtopic for the issue in question... I'll |
Patch committed to trunk in r80215. I'm going to watch the buildbots, I suspect OS X might dislike surrogates in the filename. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: