New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of the PEP 538: coerce C locale to C.utf-8 #72367
Comments
Working with Docker I often end up with an environment where the locale isn't correctly set. In these cases it would be great if sys.getfilesystemencoding() could default to 'utf-8' instead of 'ascii', as it's the encoding of the future and ascii is a subset of it anyway. Related: http://bugs.python.org/issue19846 |
This is a duplicate of bpo-27781. |
Unfortunately no, as this would mean I'll have to change all my python invocations in my scripts and it wouldn't work for executable files with #!/usr/bin/env python3 would it? |
I thought we "fixed" this by using surrogate escape when the locale was ASCII? We certainly have discussed changing the default and posix and so far have decided not to (someday that will change...is this someday already?) |
Not yet :-) |
Why not? |
I want locale free Python which behaves like on C.UTF-8 locale. But Python 3.6 is feature freeze already >_<;; |
I think we're genuinely getting to the point now where the majority of "LANG=C" cases are misconfigurations rather than intended behaviour. We're also to the point where:
So I think for Python 3.7 it makes sense to do the following on other *nix systems:
I do think we actually want to *change* the C level locale in the process though, as otherwise we can expect to see weird interactions where CPython and extension modules disagree about the default text encoding. |
Note also that if we say we're going to do this for 3.7, *and* go ahead and implement it, then distros may be more inclined to incorporate the same behavioural changes into distro-provided releases of 3.6, providing real world testing of the concept before we make it the default behaviour. |
Actually in a new Docker container, the LANG variable isn't set at all. Defaulting to UTF-8 in that case should be easier to reason about, shouldn't it? |
From CPython's point of view, glibc behaves the same way (i.e. reporting |
https://sourceware.org/glibc/wiki/Proposals/C.UTF-8#Defaults mentions that C.UTF-8 should be glibc's default. This bug report also mentions Python: https://sourceware.org/bugzilla/show_bug.cgi?id=17318 |
If we just restrict this to the file system encoding (and not the whole LANG setting), how about:
(*) I believe we discussed this at some point already, but don't remember the outcome. Regarding the questions of defaulting to LANG=C.UTF-8: I think this needs some more thought, since it would also affect many C locale aware functions. To make this work, Python would have to call setlocale() early on in the startup phase to adjust the C lib accordingly. |
Sorry for confusing. I meant use UTF-8 as default fsencoding, stdioencoding regardless locale, |
The challenge that arises in being selective about this is that "sys.getfilesystemencoding()" is actually a misnomer, and some of the things we use it for (like decoding command line arguments and environment variables) necessarily happen *really* early in the interpreter bootstrapping process. The bugs that arise from being internally inconsistent are then even harder to debug than those that arise from believing the OS when it says the right encoding to use is ASCII - the latter at least don't tend to be subtle, and are amenable to being resolved via "LC_ALL=C.UTF-8" and "LANG=C.UTF-8". I believe Victor put quite a bit of time into trying to get more selective approaches to work reliably and eventually gave up. For Fedora 26, I'm going to explore the feasibility of patching our system 3.6 installation such that the python3 command itself (rather than the shared library) checks for "LC_CTYPE=C" as almost the first thing it does, and forcibly sets LANG and LC_ALL to C.UTF-8 if it gets an answer it doesn't like. If we're able to do that successfully in the more constrained environment of a specific recent Fedora release, then I think it will bode well for doing something similar by default in CPython 3.7 |
Downstream Fedora issue proposing the above idea for F26: https://bugzilla.redhat.com/show_bug.cgi?id=1404918 I've also attached the patch from that issue here. |
Victor>> I proposed to add "-X utf8" command line option for UNIX to force utf8 encoding. Would it work for you? Jan Niklas Hasse> Unfortunately no, as this would mean I'll have to change all my python invocations in my scripts and it wouldn't work for executable files with "#!/usr/bin/env python3" would it? Usually, when a new option is added to Python, we add a command line option (-X utf8) but also an environment variable: I propose PYTHONUTF8=1. Use your favorite method to define the env var "system wide" in your docker containers. Note: Technically, I'm not sure that it's possible to support -E option with PYTHONUTF8, since -E comes from the command line, and we first need to decode command line arguments with an encoding to parse these options.... Chicken-and-egg issue ;-) |
Yeah, it just doesn't work to use more than one encoding per process. You should use the same encoding for the whole lifetime of a process. If you decode early data from an encoding A and later encode it back to encoding B, you get mojibake. The problem is simple. Using more than one encoding per process means starting to make assumtpions on how data is used. For example, consider that environment variables use the encoding A, but filenames should use the encoding B. Or, but what if an environment variable contains a filename? Similar issues for command line arguments, subprocess pipes, standard streams (sys.std*), etc. |
We've been discussing this further downstream in the Fedora Python SIG, and we have a draft approach that we're pretty sure will work for us (based in turn on the approach Armin Ronacher came up with for click), and we think it should work for other distros as well (as long as they already ship the C.UTF-8 locale, and if they don't, they should fix that limitation anyway). So I'm assigning this to myself as I think the next step will be to write a PEP that both proposes the specific idea as the default behaviour in 3.7, and also encourages distros to opt-in to trialling it as a downstream patch for 3.6. |
Making an explicit note of this so I remember to mention it in the draft PEP: one of the biggest problems that arises in any attempt at a Python-only solution to overriding the locale is that we can end up disagreeing with C/C++ extensions, and this is *especially* a problem when sharing a process with GUI frameworks like Tcl/Tk, Qt, and GTK (since they tend to read the process-wide settings, rather than querying anything that CPython configures during normal operation). So the approach I'm proposing is to implement a C->C.UTF-8 locale override in the *actual python CLI executable*, and then in the dynamically linked library we only emit a warning if we detect the C locale, we don't actually do anything to change it. |
On 17.12.2016 08:56, Nick Coghlan wrote:
Another use case to consider is embedding the Python |
On 17 December 2016 at 20:15, Marc-Andre Lemburg <report@bugs.python.org>
Aye, that's the origin of the split proposal to only emit a warning in the The hard part of writing the PEP isn't really going to be explaining the |
This doesn't help me, as I already set LANG to C.utf-8. I'm rather thing about new people trying out Python in Docker who don't know about this. Furthermore I think that UTF-8 is the future and the use of ASCII should be discouraged. |
It seems like this change: def test_forced_io_encoding(self):
# Checks forced configuration of embedded interpreter IO streams
- out, err = self.run_embedded_interpreter("forced_io_encoding")
- if support.verbose:
+ env = {"PYTHONIOENCODING": "utf-8:surrogateescape"}
+ out, err = self.run_embedded_interpreter("forced_io_encoding", env=env)
(...) Caused a failure on the "shared" buildbot (./configure --enable-shared): http://buildbot.python.org/all/builders/x86%20Ubuntu%20Shared%203.x/builds/877/steps/test/logs/stdio ====================================================================== Traceback (most recent call last):
File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_capi.py", line 484, in test_forced_io_encoding
out, err = self.run_embedded_interpreter("forced_io_encoding", env=env)
File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_capi.py", line 392, in run_embedded_interpreter
(p.returncode, err))
AssertionError: 127 != 0 : bad returncode 127, stderr is '/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Programs/_testembed: error while loading shared libraries: libpython3.7dm.so.1.0: cannot open shared object file: No such file or directory\n' |
I've added dependencies for PEP-538 induced testing problems that have been broken out into their own issues. I've also merged my attempt at fixing the tests on Mac OS X. Something that's included in that patch is an implicit skip of the "LANG=UTF-8" case when checking external locale configuration. I expected that to behave the same way as "LC_CTYPE=UTF-8", but instead it's behaving more like "LC_CTYPE=C". |
Ah, I finally understand Victor's comment on my initial attempt at fixing the tests on Mac OS X - the standard streams *don't* use the filesystem encoding, so they default to ASCII in the C locale, even on Mac OS X. |
Given that bpo-32002 and bpo-30672 track the known challenges in testing the expected locale coercion behaviour reliably, I'm going to go ahead and close this overall implementation issue (the feature is there, and works in a way we're happy with, we're just encountering some challenges clearly expressing those expectations as a regression test). |
I have tried to port this patch to Python 3.4 (still maintained by SUSE on SLE-12), but I have the hardest time to debug this. All affected tests end with errors like this: [ 493s] ====================================================================== [ 493s] Traceback (most recent call last):
[ 493s] File "/home/abuild/rpmbuild/BUILD/Python-3.4.10/Lib/test/test_c_locale_coercion.py", line 326, in _check_c_locale_coercion
[ 493s] coercion_expected)
[ 493s] File "/home/abuild/rpmbuild/BUILD/Python-3.4.10/Lib/test/test_c_locale_coercion.py", line 219, in _check_child_encoding_details
[ 493s] self.assertEqual(encoding_details, expected_details)
[ 493s] AssertionError: {'fse[79 chars]cii:strict', 'stderr_info': 'ascii:backslashre[45 chars]ict'} != {'fse[79 chars]cii:surrogateescape', 'stderr_info': 'ascii:ba[63 chars]ape'}
[ 493s] {'fsencoding': 'ascii',
[ 493s] 'lang': '',
[ 493s] 'lc_all': '',
[ 493s] 'lc_ctype': 'invalid.ascii',
[ 493s] 'stderr_info': 'ascii:backslashreplace',
[ 493s] - 'stdin_info': 'ascii:strict',
[ 493s] ? ^^ ^
[ 493s]
[ 493s] + 'stdin_info': 'ascii:surrogateescape',
[ 493s] ? ++++++ ^^^ ^^^
[ 493s]
[ 493s] - 'stdout_info': 'ascii:strict'}
[ 493s] ? ^^ ^
[ 493s]
[ 493s] + 'stdout_info': 'ascii:surrogateescape'}
[ 493s] ? ++++++ ^^^ ^^^ yes, it is always a conflict between strict and surrogateescape. I probably don’t have time to finish debugging this, so I am just leaving this for posterity. |
Python 3.4 is no longer supported upstream. Python 3 got tons of Unicode fixes between Python 3.4 and Python 3.8. |
Of course, I know that, but I just didn’t want to throw all my effort away, when I spent some hours on making it. And I guess, there may be somebody else who cares for 3.4 (ehm, RHEL-7 has 3.3, doesn’t it?). |
The test cases for locale coercion *not* triggering still assume that bpo-19977, using surrogateescape on the standard streams in the POSIX locale, has been implemented (since that was implemented in Python 3.5). Hence the various test cases complaining that they found "ascii:strict" (Py 3.4 behaviour without bpo-19977) where they expected "ascii:surrogateescape" (the Py 3.5+ behaviour *with* bpo-19977). To get a PEP-538 backport to work as intended on 3.4, you'd need to backport that earlier IO stream error handling change as well. |
Not sure whether this is the appropriate place to mention/ask... but it seems that with this chance it's impossible to get the original environment Python was invoked with, or is there? |
You can set |
Yes, but that just shifts the problem. Then one wouldn't know whether |
What is the real use case for it? If it is really your problem, you can disable the coercion by default by building your Python with |
Take any program that shall be written in Python and which works just as a wrapper around some other program, with the specific intention that the environment is passed on exactly to that program. Just as an example, a wrapper-program that adds numbers to lines... or maybe counts characters in the stdout of the program. The program the wrapper calls may even be another Python program... at that point it should become quite clear that it's apparently not possible to get that behaviour. If one disables that behaviour with |
Oh and, as unlikely as it is, there is no guarantee for that, at all. Changing the built-options doesn't really help either, as no real-world Python will have that. The problem here really is that one seems to have no way to either get/restore the true original local (directly or indirectly). |
I understand theorical problem. What I am asking for is how this issue affects to users in real world. And this issue is not good place for this discussion. Please move to discuss.python.org. |
Well I wouldn't call it a theoretical(only) problem. There is a clear use-case described along with several examples (i.e. any program who wants to serve as a transparent wrapper)... which I think speaks for itself. However, I don't think any further discussion is really needed, at least not from my side,... I've simply used C now which is as good for me and does the job well enough. As said, in the original post, I've just noted that this makes it impossible to get the original environment (except when using hacks)... and wanted to bring that to attention. Whether Python wants to support that is up to that. A simple solution would perhaps be to provide something like a |
Wrapper program or no, setting 7-bit ASCII as the text encoding was deemed an unsupported system configuration error in 2018 when this PEP was accepted and implemented, and is still seen that way now. Given the availability of UTF-8 as an alternative, there's no good reason to run English-only systems any more. Hence the platform compatibility guidelines in PEP 11 being updated when these encoding related PEPs were accepted. That said, setting |
Note: any such a feature request should be filed as a new issue, and while I'd be happy to review a PR for such an addition, I wouldn't write it myself. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: