bpo-28180: Implementation for PEP 538 #659

ncoghlan · 2017-03-13T06:06:48Z

Reference implementation for PEP 538, which is in turn a partial fix for bpo-28180.

This updates the CPython CLI to attempt to set LC_CTYPE to a suitable UTF-8
based locale before loading the runtime when it detects that it is running in
the C locale.

It also updates the CPython runtime to emit a compatibility warning on stderr
when running in the C locale.

Remaining work:

Fix the Windows compatibility issue reported by Appveyor
What's New entry
Add a note in the What's New entry regarding the current locale coercion warning that indicates we're actively seeking feedback on it during the 3.7 pre-release cycle. Our experience with the Fedora 26 backport so far has been that the warning is useful in spotting places where we should probably change the configured locale to C.UTF-8 (e.g. in RPM build environments), but I'm still starting to wonder if it might be better for us to disable it before the F26 final release.
NEWS entry

- new PYTHONCOERCECLOCALE config setting - coerces legacy C locale to C.UTF-8, C.utf8 or UTF-8 by default TODO: - configure option to disable locale coercion at build time - configure option to disable C locale warning at build time - skip runtime locale warning on Mac OS X

* --with(out)-c-locale-coercion for PY_COERCE_C_LOCALE * --with(out)-c-locale-warning for PY_WARN_ON_C_LOCALE

methane · 2017-03-13T12:15:03Z

Programs/python.c

+    /* Set PYTHONIOENCODING if not already set */
+    if (setenv("PYTHONIOENCODING", "utf-8:surrogateescape", 0)) {
+        fprintf(stderr,
+                "Error setting PYTHONIOENCODING during C locale coercion\n");


This may break old Python 2 in subprocess.

$ cat t.py # encoding: utf-8 print(u"こんにちは") $ /usr/bin/python2.6 t.py こんにちは $ export PYTHONIOENCODING=utf-8:surrogateescape $ /usr/bin/python2.6 t.py TypeError: an integer is required

I'm sorry, my environment was wrong.
I accidently test above in Lib/ directory in cpython's checkout.

It still raises a good question though, as that setting does affect Python 2 differently from the way it affects Python 3 - it changes the implicit encoding step on stdout, but stdin still relies on passing the raw bytes through without interpretation:

$ LANG=C python2 Python 2.7.13 (default, Jan 12 2017, 17:59:37) [GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> print(u"こんにちは") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-14: ordinal not in range(128) >>> print("こんにちは") こんにちは >>> $ PYTHONIOENCODING=utf-8:surrogateescape LANG=C python2 Python 2.7.13 (default, Jan 12 2017, 17:59:37) [GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> print(u"こんにちは") こんにちは >>> print("こんにちは") こんにちは >>>

It's also a potential problem that 'surrogateescape' doesn't exist in Python 2, so it may be better to just use Py_SetStandardStreamEncoding in PEP 538, and leave enabling surrogateescape in subprocesses as well to PEP 540 (via PYTHONUTF8=1 in the parent environment).

It also turns out that LANG=C python2 is an easy way to demonstrate that GNU readline just plain doesn't handle UTF-8 properly in the C locale - attempting to edit the print(u"こんにちは") line at the interactive prompt to remove the u prefix or add it back results in nonsense:

>>> print(u"こんにちは") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-14: ordinal not in range(128) >>> print(�こんにちは") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-13: ordinal not in range(128) >>> print("こんにちは") こんにちは >>> print(u��にちは") ") こ�u��にちは

warsaw

Leaving out for the moment whether this is a good idea or not (we'll use the PEP process for that), I have some questions about the code.

warsaw · 2017-03-14T20:37:14Z

Doc/using/cmdline.rst

+   to skip coercing the legacy ASCII-based C locale to a more capable UTF-8
+   based alternative. Note that this setting is checked even when the
+   :option:`-E` or :option:`-I` options are used, as it is handled prior to
+   the processing of command line options.


Am I reading this right? It seems odd that setting PYTHONCOERCECLOCALE would disable coercing the legacy locale. The sense is exactly opposite of what I'd expect.

Should the envar be named PYTHONNOCOERCECLOCALE?

Also, what if the environment variable is set to "0"? Given the above description, that should still skip coercion.

Oops, that's a holdover from when the setting was PYTHONALLOWCLOCALE - presumably I changed the title of the section, but then got distracted by something else before updating the body.

Fixed in 1c3a270

warsaw · 2017-03-14T20:43:08Z