New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-28180: Implementation for PEP 538 #659

Merged
merged 43 commits into from Jun 11, 2017

Conversation

Projects
None yet
7 participants
@ncoghlan
Contributor

ncoghlan commented Mar 13, 2017

Reference implementation for PEP 538, which is in turn a partial fix for bpo-28180.

This updates the CPython CLI to attempt to set LC_CTYPE to a suitable UTF-8
based locale before loading the runtime when it detects that it is running in
the C locale.

It also updates the CPython runtime to emit a compatibility warning on stderr
when running in the C locale.

Remaining work:

  • Fix the Windows compatibility issue reported by Appveyor
  • What's New entry
  • Add a note in the What's New entry regarding the current locale coercion warning that indicates we're actively seeking feedback on it during the 3.7 pre-release cycle. Our experience with the Fedora 26 backport so far has been that the warning is useful in spotting places where we should probably change the configured locale to C.UTF-8 (e.g. in RPM build environments), but I'm still starting to wonder if it might be better for us to disable it before the F26 final release.
  • NEWS entry

ncoghlan added some commits Mar 5, 2017

WIP: PEP 538 reference implementation
- new PYTHONCOERCECLOCALE config setting
- coerces legacy C locale to C.UTF-8, C.utf8 or UTF-8 by default

TODO:

- configure option to disable locale coercion at build time
- configure option to disable C locale warning at build time
- skip runtime locale warning on Mac OS X
Add C locale coercion and warning build flags
* --with(out)-c-locale-coercion for PY_COERCE_C_LOCALE
* --with(out)-c-locale-warning for PY_WARN_ON_C_LOCALE

@ncoghlan ncoghlan requested a review from warsaw Mar 13, 2017

/* Set PYTHONIOENCODING if not already set */
if (setenv("PYTHONIOENCODING", "utf-8:surrogateescape", 0)) {
fprintf(stderr,
"Error setting PYTHONIOENCODING during C locale coercion\n");

This comment has been minimized.

@methane

methane Mar 13, 2017

Member

This may break old Python 2 in subprocess.

$ cat t.py
# encoding: utf-8
print(u"こんにちは")

$ /usr/bin/python2.6 t.py
こんにちは

$ export PYTHONIOENCODING=utf-8:surrogateescape
$ /usr/bin/python2.6 t.py
TypeError: an integer is required

This comment has been minimized.

@methane

methane Mar 13, 2017

Member

I'm sorry, my environment was wrong.
I accidently test above in Lib/ directory in cpython's checkout.

This comment has been minimized.

@ncoghlan

ncoghlan Mar 14, 2017

Contributor

It still raises a good question though, as that setting does affect Python 2 differently from the way it affects Python 3 - it changes the implicit encoding step on stdout, but stdin still relies on passing the raw bytes through without interpretation:

$ LANG=C python2
Python 2.7.13 (default, Jan 12 2017, 17:59:37) 
[GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print(u"こんにちは")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-14: ordinal not in range(128)
>>> print("こんにちは")
こんにちは
>>> 
$ PYTHONIOENCODING=utf-8:surrogateescape LANG=C python2
Python 2.7.13 (default, Jan 12 2017, 17:59:37) 
[GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print(u"こんにちは")
こんにちは
>>> print("こんにちは")
こんにちは
>>> 

It's also a potential problem that 'surrogateescape' doesn't exist in Python 2, so it may be better to just use Py_SetStandardStreamEncoding in PEP 538, and leave enabling surrogateescape in subprocesses as well to PEP 540 (via PYTHONUTF8=1 in the parent environment).

It also turns out that LANG=C python2 is an easy way to demonstrate that GNU readline just plain doesn't handle UTF-8 properly in the C locale - attempting to edit the print(u"こんにちは") line at the interactive prompt to remove the u prefix or add it back results in nonsense:

>>> print(u"こんにちは")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-14: ordinal not in range(128)
>>> print(�こんにちは")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-13: ordinal not in range(128)
>>> print("こんにちは")
こんにちは
>>> print(u��にちは") ")
こ�u��にちは
@warsaw

Leaving out for the moment whether this is a good idea or not (we'll use the PEP process for that), I have some questions about the code.

to skip coercing the legacy ASCII-based C locale to a more capable UTF-8
based alternative. Note that this setting is checked even when the
:option:`-E` or :option:`-I` options are used, as it is handled prior to
the processing of command line options.

This comment has been minimized.

@warsaw

warsaw Mar 14, 2017

Member

Am I reading this right? It seems odd that setting PYTHONCOERCECLOCALE would disable coercing the legacy locale. The sense is exactly opposite of what I'd expect.

Should the envar be named PYTHONNOCOERCECLOCALE?

Also, what if the environment variable is set to "0"? Given the above description, that should still skip coercion.

This comment has been minimized.

@ncoghlan

ncoghlan Mar 15, 2017

Contributor

Oops, that's a holdover from when the setting was PYTHONALLOWCLOCALE - presumably I changed the title of the section, but then got distracted by something else before updating the body.

This comment has been minimized.

@ncoghlan

ncoghlan Mar 15, 2017

Contributor

Fixed in 1c3a270

# Details of the CLI warning emitted at runtime
CLI_COERCION_WARNING_FMT = (
"Python detected LC_CTYPE=C: {} coerced to {} (set another locale "
"or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour)."

This comment has been minimized.

@warsaw

warsaw Mar 14, 2017

Member

Does the value of the envar matter or not? It just needs to be any string value, right?

This comment has been minimized.

@ncoghlan

ncoghlan Mar 15, 2017

Contributor

Docs bug - the warning matches the behaviour, the docs don't.

This comment has been minimized.

@ncoghlan

ncoghlan Mar 15, 2017

Contributor

Docs fixed in 1c3a270

"LC_CTYPE": "",
"LC_ALL": "",
}
for env_var in ("LC_ALL", "LC_CTYPE", "LANG"):

This comment has been minimized.

@warsaw

warsaw Mar 14, 2017

Member

Does the order matter? If not, what about using for env_var in base_var_dict:?

This comment has been minimized.

@ncoghlan

ncoghlan Mar 15, 2017

Contributor

Done in d12b412

"LC_CTYPE": "",
"LC_ALL": "",
}
for env_var in ("LC_ALL", "LC_CTYPE", "LANG"):

This comment has been minimized.

@warsaw

warsaw Mar 14, 2017

Member

And here...

This comment has been minimized.

@ncoghlan

ncoghlan Mar 15, 2017

Contributor

Done in d12b412

self._check_c_locale_coercion("utf-8", coerce_c_locale=None)
def test_PYTHONCOERCECLOCALE_not_zero(self):
# *Any* string other that "0" is considered "set" for our purposes

This comment has been minimized.

@warsaw

warsaw Mar 14, 2017

Member

Doesn't this contradict the documentation at the top of this PR?

This comment has been minimized.

@ncoghlan

ncoghlan Mar 15, 2017

Contributor

Docs fixed in 1c3a270

os.chdir(basepath)
def tearDown(self):
os.chdir(self.oldcwd)

This comment has been minimized.

@warsaw

warsaw Mar 14, 2017

Member

What about using self.addCleanup(os.chdir, self.oldcwd) in the setUp() instead and getting rid of the tearDown()?

This comment has been minimized.

@ncoghlan

ncoghlan Mar 15, 2017

Contributor

Good point - I stole the basic structure from test_capi (simplified a bit since this doesn't need to run on Windows), so it's missing some modern niceties. With addCleanup, we wouldn't even need self.oldcwd as a record, we can just add the cleanup operation before changing the directory:

self.addCleanup(os.chdir, os.getcwd())
os.chdir(basepath)

This comment has been minimized.

@ncoghlan

ncoghlan Mar 15, 2017

Contributor

Done in d12b412

p = subprocess.Popen(cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
universal_newlines=True)

This comment has been minimized.

@warsaw

warsaw Mar 14, 2017

Member

Does it make any sense to run this without universal_newlines also? Since the locale changes should only affect text mode, that additional test would just ensure that nothing's changed when running subprocesses in bytes mode.

This comment has been minimized.

@ncoghlan

ncoghlan Mar 15, 2017

Contributor

For this particular test, I don't think so, as it's just checking that the runtime warning gets emitted when we bypass the locale coercion in the standard Python CLI, rather than doing anything with the standard streams.

However, it does make sense for test_locale_coercion itself (which I should rename to test_c_locale_coercion): that should check the behaviour when the Python subprocess is run with the -u option for unbuffered binary streams.

This comment has been minimized.

@ncoghlan

ncoghlan Mar 15, 2017

Contributor

So it turns out that even -u doesn't change anything of interest to this PEP any more, and there's just an error in the --help docs that makes it sound like it still does: http://bugs.python.org/issue28647

This comment has been minimized.

@ncoghlan

ncoghlan Mar 15, 2017

Contributor

However, now that the PEP potentially also affects the way the standard streams are configured, I expanded on the test cases to also cover that: 4e6d502

@@ -15,6 +15,110 @@ wmain(int argc, wchar_t **argv)
}
#else
/* Helpers to better handle the legacy C locale

This comment has been minimized.

@warsaw

warsaw Mar 14, 2017

Member

There's been some discussion about the "legacy C locale" nomenclature. What about "bare C locale"?

This comment has been minimized.

@ncoghlan

ncoghlan Mar 15, 2017

Contributor

Even glibc devs consider 7-bit ASCII a legacy encoding at this point, it's just an enormous ecosystem to try to migrate to a new default way of doing things.

My view would likely be different if we were planning to invest time and energy in getting Python 3 to "work properly" in the C locale, but the underlying assumptions are sufficiently incompatible that it isn't even clear what "working properly" would even mean. (PEP 540 gets closer than the status quo, but it's definition of "working properly" is "ignoring the declared locale encoding as almost certainly being wrong")

#ifdef PY_COERCE_C_LOCALE
static const char *_C_LOCALE_COERCION_WARNING =
"Python detected LC_CTYPE=C: %.20s coerced to %.20s (set another locale "
"or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).\n";

This comment has been minimized.

@warsaw

warsaw Mar 14, 2017

Member

This seems to come up in every project I work on. Do we use UK or US spelling? :)

I think in Python we use US spellings.

This comment has been minimized.

@ncoghlan

ncoghlan Mar 15, 2017

Contributor

Yeah, I'll typically use my native spellings when writing PEPs and email, but anything in the actual code or docs should be using the US spelling.

This comment has been minimized.

@ncoghlan

ncoghlan Mar 15, 2017

Contributor

Fixed in ccfc83f

* For consistency with the corresponding check in Programs/python.c
* we ignore the Python -E and -I flags here.
*/
if (coerce_c_locale == NULL || strncmp(coerce_c_locale, "0", 2) != 0) {

This comment has been minimized.

@warsaw

warsaw Mar 14, 2017

Member

You do this same check in two places. Would it make sense to refactor that into a call or macro so there's no chance in getting out of sync? (See my question at the top of the PR.)

This comment has been minimized.

@ncoghlan

ncoghlan Mar 15, 2017

Contributor

An alternative would be to check a C global here, rather than checking the environment variable directly, and have a _Py_DisableCLocaleWarning() helper that the CLI calls when PYTHONCOERCECLOCALE=0 is set.

This comment has been minimized.

@ncoghlan

ncoghlan Mar 15, 2017

Contributor

The downside of that approach would be that it means embedding applications can't use PYTHONCOERCECLOCALE=0 to silence the warning without changing the configured locale, and I don't want to define a new public embedding configuration API just to avoid checking the environment variable twice.

So I'll unconditionally export a private helper function from pylifecycle.c (so the ABI doesn't depend on the configure flag setting), and re-use that from the CLI implementation.

ncoghlan added some commits Mar 15, 2017

Use Py_SetStandardStreamEncoding instead of PYTHONIOENCODING
- setting PYTHONIOENCODING has unintended side effects on Python 2
  instances run in a subprocess (since Python 2 has no
  `surrogateescape` error handler
- Py_SetStandardStreamEncoding enables surrogateescape for the
  current process without any side effects on subprocesses
Restore Windows _testembed compatibility
Windows doesn't use setenv to set environment variables,
so set PYTHONIOENCODING from test_capi instead of
_testembed when running the forced_io_encoding test.
Update to latest version of PEP 538
- move all required logic inside the shared library
- explicitly setting one of the coercion target locales
  now also automatically enables "surrogateescape" on
  sys.stdin and sys.stdout
Change locale coercion to always respect LC_ALL
Locale coercion no longer has any effect if LC_ALL is
explicitly set in the environment.

When locale coercion triggers, it sets either both
LC_CTYPE & LANG (for full locales) or only LC_CTYPE
(for partial locales).

This change also eliminated the need for a custom
test case for the locale coercion warning - instead,
the test suite is able to check for that just by
setting LC_ALL in the child process environment.
@ncoghlan

This comment has been minimized.

Contributor

ncoghlan commented May 28, 2017

@zooba @brettcannon Would one of you have some time to look at the Windows failure here? I suspect what's happened is that I've changed some code that I thought was *nix specific, and it needs a preprocessor guard to get it to retain the old behaviour on Windows.

* ``C.UTF-8`` (``LC_ALL``)
* ``C.utf8`` (``LC_ALL``)
* ``UTF-8`` (``LC_CTYPE``)

This comment has been minimized.

@methane

methane May 29, 2017

Member

These LC_ALL should be updated.

@zooba

This comment has been minimized.

Member

zooba commented May 31, 2017

@ncoghlan Is there some state involved in checking/configuring the locale override that must be run before Py_Initialize? On Windows, main.c is exactly one line long and you haven't updated it to call the new functions, so it may be related to that. I don't see any changes to common code that could be a problem, but maybe I missed it.

As an aside, I'm somewhat uncomfortable about the "let's ignore -E -S" parts, but if you're sure it's not going to enable loading arbitrary code (via an encoding) then I won't push back too hard.

if (ctype_loc != NULL) {
/* "surrogateescape" is the default in the legacy C locale */
if (strcmp(ctype_loc, "C") == 0) {
return "surrogateescape";

This comment has been minimized.

@ncoghlan

ncoghlan Jun 2, 2017

Contributor

@zooba This is the bit that I suspect may be unintentionally changing the interpreter startup behaviour on Windows and thus causing the Appveyor test failures: there's currently nothing restricting this particular code path to only run on non-Windows systems, so I suspect it may be triggering and activating surrogateescape by default on the Windows standard streams, even though Windows uses its own code page system, rather than the POSIX locale system. If I'm right, then the entire body of this function should be inside some kind of #ifdef, such that the Windows case degenerates to return NULL;

This comment has been minimized.

@zooba

zooba Jun 2, 2017

Member

Possibly. The only thing that is available for coercion on Windows is locale.getpreferredencoding(), and there are no results from setlocale() that you can reasonably interpret as "wrong".

When/if the "force UTF-8" option comes along, that should apply to Windows, but I don't think we need to determine the error handler here at all for Windows.

This comment has been minimized.

@ncoghlan

ncoghlan Jun 3, 2017

Contributor

Comparing this to the old code (as a result of the test_sys failure), it seems we were setting surrogateescape by default in the C locale, even on Windows. So I've had a second go at fixing this by keeping that, and only skipping the new checks for the coercion target locales.

This comment has been minimized.

@ncoghlan

ncoghlan Jun 4, 2017

Contributor

It turned out this wasn't the problem, but rather the earlier call to setlocale(LC_CTYPE, ""), where I had removed a HAVE_SETLOCALE guard as being redundant in 2017. It turns out that even though Windows does have setlocale available:

  1. Passing an empty string doesn't work the same way it does in glibc (presumably it syncs the active C locale with the active Windows code page rather than looking at environment variables)
  2. The Windows builds don't define HAVE_SETLOCALE, so that guard was previously serving as a cryptic #ifndef MS_WINDOWS check

So rather than restoring the original guard, I replaced it with an explicit #ifndef MS_WINDOWS check instead.

* to give end users a way to force even scripts that are otherwise
* isolated from their environment to use the legacy ASCII-centric C
* locale.
*/

This comment has been minimized.

@ncoghlan

ncoghlan Jun 2, 2017

Contributor

Per @zooba: I need to explain the (lack of) security implications here, since:

  • the envvar can only be used to turn off locale coercion, it doesn't do anything else (i.e. it's a boolean toggle, not input data)
  • if an attacker is in a position to corrupt the behaviour of the ascii encoding, they're likely to be able to attack the utf-8 encoding just as effectively

This comment has been minimized.

@zooba

zooba Jun 2, 2017

Member

👍

ncoghlan added some commits Jun 3, 2017

Address CI failure and review comments
- avoid unintended side effects on Windows behaviour
- remove a single-use function that made the code harder to follow
- clarify the security considerations around ignoring -E and -I
  when checking PYTHONCOERCECLOCALE
OK, two-use function :)
The inline check for "Is this env var exactly zero?" is still more
self-explanatory than factoring out the helper function.
New theory regarding the Windows problem
A HAVE_SETLOCALE guard was removed when adding a check for __ANDROID__,
and that may be affecting the default locale reported on Windows.
@ncoghlan

This comment has been minimized.

Contributor

ncoghlan commented Jun 4, 2017

Huzzah, finally sorted out all the subtle inconsistencies with the behaviour on Windows :)

@methane, @zooba, @warsaw Do any of you want to take a further look at the code changes while I finish up with the What's New entry and adding some caveats around the current locale coercion warning?

@ncoghlan

This comment has been minimized.

Contributor

ncoghlan commented Jun 4, 2017

The What's New entry has been added, so I consider this PR to be feature complete now - to reduce the chance of conflicts, I won't add the NEWS entry until we're just about to merge it.

@ncoghlan

This comment has been minimized.

Contributor

ncoghlan commented Jun 11, 2017

AppVeyor seems to be MIA, but the last Windows run passed, and I've only made docs and metadata changes since then, so I'll be going ahead with the merge and marking the PEP as Final once the Travis run completes.

@ncoghlan ncoghlan merged commit 6ea4186 into python:master Jun 11, 2017

2 checks passed

bedevere/issue-number Issue number 28180 found.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@@ -70,7 +103,11 @@ main(int argc, char **argv)
/* Force again malloc() allocator to release memory blocks allocated
before Py_Main() */
#ifdef Py_DEBUG

This comment has been minimized.

@ronaldoussoren

ronaldoussoren Jun 11, 2017

Contributor

This appears to be an unrelated change.

This comment has been minimized.

@ncoghlan

ncoghlan Jun 12, 2017

Contributor

You're right that it's technically unrelated now, but an earlier iteration of the patch failed in debug mode without it - that iteration was using Py_SetStandardStreamEncoding, and when you do that, you allocate some memory that gets freed during interpreter startup after the default allocator has been configured. Without this change, that free attempt failed in debug mode due to the allocator mismatch.

So these guards mean that the "safe to call before Py_Initialize" APIs actually are safe to call from the CLI entry point.

mlouielu added a commit to mlouielu/cpython that referenced this pull request Jun 15, 2017

bpo-28180: Implementation for PEP 538 (python#659)
- new PYTHONCOERCECLOCALE config setting
- coerces legacy C locale to C.UTF-8, C.utf8 or UTF-8 by default
- always uses C.UTF-8 on Android
- uses `surrogateescape` on stdin and stdout in the coercion
  target locales
- configure option to disable locale coercion at build time
- configure option to disable C locale warning at build time
@teufelhunde3119

This comment has been minimized.

teufelhunde3119 commented on .gitignore in 6a00ce6 Dec 1, 2017

do you read

@ncoghlan ncoghlan deleted the ncoghlan:pep538-coerce-c-locale branch Mar 30, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment