Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-28180: Implementation for PEP 538 #659

Merged
merged 43 commits into from
Jun 11, 2017

Conversation

ncoghlan
Copy link
Contributor

@ncoghlan ncoghlan commented Mar 13, 2017

Reference implementation for PEP 538, which is in turn a partial fix for bpo-28180.

This updates the CPython CLI to attempt to set LC_CTYPE to a suitable UTF-8
based locale before loading the runtime when it detects that it is running in
the C locale.

It also updates the CPython runtime to emit a compatibility warning on stderr
when running in the C locale.

Remaining work:

  • Fix the Windows compatibility issue reported by Appveyor
  • What's New entry
  • Add a note in the What's New entry regarding the current locale coercion warning that indicates we're actively seeking feedback on it during the 3.7 pre-release cycle. Our experience with the Fedora 26 backport so far has been that the warning is useful in spotting places where we should probably change the configured locale to C.UTF-8 (e.g. in RPM build environments), but I'm still starting to wonder if it might be better for us to disable it before the F26 final release.
  • NEWS entry

- new PYTHONCOERCECLOCALE config setting
- coerces legacy C locale to C.UTF-8, C.utf8 or UTF-8 by default

TODO:

- configure option to disable locale coercion at build time
- configure option to disable C locale warning at build time
- skip runtime locale warning on Mac OS X
* --with(out)-c-locale-coercion for PY_COERCE_C_LOCALE
* --with(out)-c-locale-warning for PY_WARN_ON_C_LOCALE
/* Set PYTHONIOENCODING if not already set */
if (setenv("PYTHONIOENCODING", "utf-8:surrogateescape", 0)) {
fprintf(stderr,
"Error setting PYTHONIOENCODING during C locale coercion\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may break old Python 2 in subprocess.

$ cat t.py
# encoding: utf-8
print(u"こんにちは")

$ /usr/bin/python2.6 t.py
こんにちは

$ export PYTHONIOENCODING=utf-8:surrogateescape
$ /usr/bin/python2.6 t.py
TypeError: an integer is required

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, my environment was wrong.
I accidently test above in Lib/ directory in cpython's checkout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still raises a good question though, as that setting does affect Python 2 differently from the way it affects Python 3 - it changes the implicit encoding step on stdout, but stdin still relies on passing the raw bytes through without interpretation:

$ LANG=C python2
Python 2.7.13 (default, Jan 12 2017, 17:59:37) 
[GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print(u"こんにちは")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-14: ordinal not in range(128)
>>> print("こんにちは")
こんにちは
>>> 
$ PYTHONIOENCODING=utf-8:surrogateescape LANG=C python2
Python 2.7.13 (default, Jan 12 2017, 17:59:37) 
[GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print(u"こんにちは")
こんにちは
>>> print("こんにちは")
こんにちは
>>> 

It's also a potential problem that 'surrogateescape' doesn't exist in Python 2, so it may be better to just use Py_SetStandardStreamEncoding in PEP 538, and leave enabling surrogateescape in subprocesses as well to PEP 540 (via PYTHONUTF8=1 in the parent environment).

It also turns out that LANG=C python2 is an easy way to demonstrate that GNU readline just plain doesn't handle UTF-8 properly in the C locale - attempting to edit the print(u"こんにちは") line at the interactive prompt to remove the u prefix or add it back results in nonsense:

>>> print(u"こんにちは")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-14: ordinal not in range(128)
>>> print(�こんにちは")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-13: ordinal not in range(128)
>>> print("こんにちは")
こんにちは
>>> print(u��にちは") ")
こ�u��にちは

Copy link
Member

@warsaw warsaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving out for the moment whether this is a good idea or not (we'll use the PEP process for that), I have some questions about the code.

to skip coercing the legacy ASCII-based C locale to a more capable UTF-8
based alternative. Note that this setting is checked even when the
:option:`-E` or :option:`-I` options are used, as it is handled prior to
the processing of command line options.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I reading this right? It seems odd that setting PYTHONCOERCECLOCALE would disable coercing the legacy locale. The sense is exactly opposite of what I'd expect.

Should the envar be named PYTHONNOCOERCECLOCALE?

Also, what if the environment variable is set to "0"? Given the above description, that should still skip coercion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, that's a holdover from when the setting was PYTHONALLOWCLOCALE - presumably I changed the title of the section, but then got distracted by something else before updating the body.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1c3a270

# Details of the CLI warning emitted at runtime
CLI_COERCION_WARNING_FMT = (
"Python detected LC_CTYPE=C: {} coerced to {} (set another locale "
"or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour)."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the value of the envar matter or not? It just needs to be any string value, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs bug - the warning matches the behaviour, the docs don't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs fixed in 1c3a270

"LC_CTYPE": "",
"LC_ALL": "",
}
for env_var in ("LC_ALL", "LC_CTYPE", "LANG"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the order matter? If not, what about using for env_var in base_var_dict:?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in d12b412

"LC_CTYPE": "",
"LC_ALL": "",
}
for env_var in ("LC_ALL", "LC_CTYPE", "LANG"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in d12b412

self._check_c_locale_coercion("utf-8", coerce_c_locale=None)

def test_PYTHONCOERCECLOCALE_not_zero(self):
# *Any* string other that "0" is considered "set" for our purposes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this contradict the documentation at the top of this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs fixed in 1c3a270

os.chdir(basepath)

def tearDown(self):
os.chdir(self.oldcwd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using self.addCleanup(os.chdir, self.oldcwd) in the setUp() instead and getting rid of the tearDown()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - I stole the basic structure from test_capi (simplified a bit since this doesn't need to run on Windows), so it's missing some modern niceties. With addCleanup, we wouldn't even need self.oldcwd as a record, we can just add the cleanup operation before changing the directory:

self.addCleanup(os.chdir, os.getcwd())
os.chdir(basepath)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in d12b412

p = subprocess.Popen(cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
universal_newlines=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make any sense to run this without universal_newlines also? Since the locale changes should only affect text mode, that additional test would just ensure that nothing's changed when running subprocesses in bytes mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this particular test, I don't think so, as it's just checking that the runtime warning gets emitted when we bypass the locale coercion in the standard Python CLI, rather than doing anything with the standard streams.

However, it does make sense for test_locale_coercion itself (which I should rename to test_c_locale_coercion): that should check the behaviour when the Python subprocess is run with the -u option for unbuffered binary streams.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it turns out that even -u doesn't change anything of interest to this PEP any more, and there's just an error in the --help docs that makes it sound like it still does: http://bugs.python.org/issue28647

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, now that the PEP potentially also affects the way the standard streams are configured, I expanded on the test cases to also cover that: 4e6d502

@@ -15,6 +15,110 @@ wmain(int argc, wchar_t **argv)
}
#else

/* Helpers to better handle the legacy C locale
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's been some discussion about the "legacy C locale" nomenclature. What about "bare C locale"?

Copy link
Contributor Author

@ncoghlan ncoghlan Mar 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even glibc devs consider 7-bit ASCII a legacy encoding at this point, it's just an enormous ecosystem to try to migrate to a new default way of doing things.

My view would likely be different if we were planning to invest time and energy in getting Python 3 to "work properly" in the C locale, but the underlying assumptions are sufficiently incompatible that it isn't even clear what "working properly" would even mean. (PEP 540 gets closer than the status quo, but it's definition of "working properly" is "ignoring the declared locale encoding as almost certainly being wrong")

#ifdef PY_COERCE_C_LOCALE
static const char *_C_LOCALE_COERCION_WARNING =
"Python detected LC_CTYPE=C: %.20s coerced to %.20s (set another locale "
"or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to come up in every project I work on. Do we use UK or US spelling? :)

I think in Python we use US spellings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll typically use my native spellings when writing PEPs and email, but anything in the actual code or docs should be using the US spelling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ccfc83f

* For consistency with the corresponding check in Programs/python.c
* we ignore the Python -E and -I flags here.
*/
if (coerce_c_locale == NULL || strncmp(coerce_c_locale, "0", 2) != 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do this same check in two places. Would it make sense to refactor that into a call or macro so there's no chance in getting out of sync? (See my question at the top of the PR.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative would be to check a C global here, rather than checking the environment variable directly, and have a _Py_DisableCLocaleWarning() helper that the CLI calls when PYTHONCOERCECLOCALE=0 is set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside of that approach would be that it means embedding applications can't use PYTHONCOERCECLOCALE=0 to silence the warning without changing the configured locale, and I don't want to define a new public embedding configuration API just to avoid checking the environment variable twice.

So I'll unconditionally export a private helper function from pylifecycle.c (so the ABI doesn't depend on the configure flag setting), and re-use that from the CLI implementation.

- setting PYTHONIOENCODING has unintended side effects on Python 2
  instances run in a subprocess (since Python 2 has no
  `surrogateescape` error handler
- Py_SetStandardStreamEncoding enables surrogateescape for the
  current process without any side effects on subprocesses
Windows doesn't use setenv to set environment variables,
so set PYTHONIOENCODING from test_capi instead of
_testembed when running the forced_io_encoding test.
- move all required logic inside the shared library
- explicitly setting one of the coercion target locales
  now also automatically enables "surrogateescape" on
  sys.stdin and sys.stdout
Locale coercion no longer has any effect if LC_ALL is
explicitly set in the environment.

When locale coercion triggers, it sets either both
LC_CTYPE & LANG (for full locales) or only LC_CTYPE
(for partial locales).

This change also eliminated the need for a custom
test case for the locale coercion warning - instead,
the test suite is able to check for that just by
setting LC_ALL in the child process environment.
@ncoghlan ncoghlan changed the title bpo-28180: Reference implementation for PEP 538 bpo-28180: Implementation for PEP 538 May 28, 2017
@ncoghlan
Copy link
Contributor Author

@zooba @brettcannon Would one of you have some time to look at the Windows failure here? I suspect what's happened is that I've changed some code that I thought was *nix specific, and it needs a preprocessor guard to get it to retain the old behaviour on Windows.


* ``C.UTF-8`` (``LC_ALL``)
* ``C.utf8`` (``LC_ALL``)
* ``UTF-8`` (``LC_CTYPE``)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These LC_ALL should be updated.

@zooba
Copy link
Member

zooba commented May 31, 2017

@ncoghlan Is there some state involved in checking/configuring the locale override that must be run before Py_Initialize? On Windows, main.c is exactly one line long and you haven't updated it to call the new functions, so it may be related to that. I don't see any changes to common code that could be a problem, but maybe I missed it.

As an aside, I'm somewhat uncomfortable about the "let's ignore -E -S" parts, but if you're sure it's not going to enable loading arbitrary code (via an encoding) then I won't push back too hard.

if (ctype_loc != NULL) {
/* "surrogateescape" is the default in the legacy C locale */
if (strcmp(ctype_loc, "C") == 0) {
return "surrogateescape";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zooba This is the bit that I suspect may be unintentionally changing the interpreter startup behaviour on Windows and thus causing the Appveyor test failures: there's currently nothing restricting this particular code path to only run on non-Windows systems, so I suspect it may be triggering and activating surrogateescape by default on the Windows standard streams, even though Windows uses its own code page system, rather than the POSIX locale system. If I'm right, then the entire body of this function should be inside some kind of #ifdef, such that the Windows case degenerates to return NULL;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly. The only thing that is available for coercion on Windows is locale.getpreferredencoding(), and there are no results from setlocale() that you can reasonably interpret as "wrong".

When/if the "force UTF-8" option comes along, that should apply to Windows, but I don't think we need to determine the error handler here at all for Windows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing this to the old code (as a result of the test_sys failure), it seems we were setting surrogateescape by default in the C locale, even on Windows. So I've had a second go at fixing this by keeping that, and only skipping the new checks for the coercion target locales.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turned out this wasn't the problem, but rather the earlier call to setlocale(LC_CTYPE, ""), where I had removed a HAVE_SETLOCALE guard as being redundant in 2017. It turns out that even though Windows does have setlocale available:

  1. Passing an empty string doesn't work the same way it does in glibc (presumably it syncs the active C locale with the active Windows code page rather than looking at environment variables)
  2. The Windows builds don't define HAVE_SETLOCALE, so that guard was previously serving as a cryptic #ifndef MS_WINDOWS check

So rather than restoring the original guard, I replaced it with an explicit #ifndef MS_WINDOWS check instead.

* to give end users a way to force even scripts that are otherwise
* isolated from their environment to use the legacy ASCII-centric C
* locale.
*/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per @zooba: I need to explain the (lack of) security implications here, since:

  • the envvar can only be used to turn off locale coercion, it doesn't do anything else (i.e. it's a boolean toggle, not input data)
  • if an attacker is in a position to corrupt the behaviour of the ascii encoding, they're likely to be able to attack the utf-8 encoding just as effectively

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

- avoid unintended side effects on Windows behaviour
- remove a single-use function that made the code harder to follow
- clarify the security considerations around ignoring -E and -I
  when checking PYTHONCOERCECLOCALE
The inline check for "Is this env var exactly zero?" is still more
self-explanatory than factoring out the helper function.
A HAVE_SETLOCALE guard was removed when adding a check for __ANDROID__,
and that may be affecting the default locale reported on Windows.
@ncoghlan
Copy link
Contributor Author

ncoghlan commented Jun 4, 2017

Huzzah, finally sorted out all the subtle inconsistencies with the behaviour on Windows :)

@methane, @zooba, @warsaw Do any of you want to take a further look at the code changes while I finish up with the What's New entry and adding some caveats around the current locale coercion warning?

@ncoghlan
Copy link
Contributor Author

ncoghlan commented Jun 4, 2017

The What's New entry has been added, so I consider this PR to be feature complete now - to reduce the chance of conflicts, I won't add the NEWS entry until we're just about to merge it.

@ncoghlan
Copy link
Contributor Author

AppVeyor seems to be MIA, but the last Windows run passed, and I've only made docs and metadata changes since then, so I'll be going ahead with the merge and marking the PEP as Final once the Travis run completes.

@ncoghlan ncoghlan merged commit 6ea4186 into python:master Jun 11, 2017
@@ -70,7 +103,11 @@ main(int argc, char **argv)

/* Force again malloc() allocator to release memory blocks allocated
before Py_Main() */
#ifdef Py_DEBUG
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be an unrelated change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that it's technically unrelated now, but an earlier iteration of the patch failed in debug mode without it - that iteration was using Py_SetStandardStreamEncoding, and when you do that, you allocate some memory that gets freed during interpreter startup after the default allocator has been configured. Without this change, that free attempt failed in debug mode due to the allocator mismatch.

So these guards mean that the "safe to call before Py_Initialize" APIs actually are safe to call from the CLI entry point.

@ncoghlan ncoghlan deleted the pep538-coerce-c-locale branch March 30, 2018 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants