Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of the PEP 538: coerce C locale to C.utf-8 #72367

Closed
JanNiklasHasse mannequin opened this issue Sep 16, 2016 · 99 comments
Closed

Implementation of the PEP 538: coerce C locale to C.utf-8 #72367

JanNiklasHasse mannequin opened this issue Sep 16, 2016 · 99 comments
Assignees
Labels
3.7 expert-unicode type-bug An unexpected behavior, bug, or error

Comments

@JanNiklasHasse
Copy link
Mannequin

JanNiklasHasse mannequin commented Sep 16, 2016

BPO 28180
Nosy @malemburg, @warsaw, @ronaldoussoren, @ncoghlan, @vstinner, @ned-deily, @mcepl, @ezio-melotti, @bitdancer, @methane, @4kir4, @xdegaye, @yan12125, @Vgr255
PRs
  • bpo-28180: Implementation for PEP 538 #659
  • bpo-28180: assume UTF-8 for Mac OS X PEP 538 tests #2130
  • bpo-28180: Fix test_capi.test_forced_io_encoding() #2155
  • bpo-28180: Standard stream & FS encoding differ on Mac OS X #2208
  • bpo-28180: Fix the implementation of PEP 538 on Android #4334
  • Dependencies
  • bpo-30565: PEP 538: silence locale coercion and compatibility warnings by default?
  • bpo-30635: Leak in test_c_locale_coercion
  • bpo-30647: CODESET error on AMD64 FreeBSD 10.x Shared 3.x caused by the PEP 538
  • Files
  • fedora-cpython-force-c-utf-8.diff: Downstream patch currently proposed for Fedora 26
  • fedora-cpython-PYTHONALLOWCLOCALE.diff: Draft Fedora 26 patch as at 2016-12-18
  • pep538_coerce_legacy_c_locale.diff: Initial patch for PEP 538 (targeting 3.7)
  • pep538_coerce_legacy_c_locale_v2.diff: Add test cases for handling of unknown locales
  • pep538-check-click.sh: Utility script to check click's behaviour in a PEP 538 patched CPython
  • pep538_coerce_legacy_c_locale_v3.diff: Refactor PEP 538 test cases to cover no locale setting, C locale, POSIX locale and unknown locale
  • android_setlocale.patch
  • pep538_coerce_legacy_c_locale.patch: Ufinished attempt to port this patch to Python 3.4
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ncoghlan'
    closed_at = <Date 2018-03-29.01:23:38.218>
    created_at = <Date 2016-09-16.11:17:02.590>
    labels = ['type-bug', '3.7', 'expert-unicode']
    title = 'Implementation of the PEP 538: coerce C locale to C.utf-8'
    updated_at = <Date 2020-03-22.16:25:00.648>
    user = 'https://bugs.python.org/JanNiklasHasse'

    bugs.python.org fields:

    activity = <Date 2020-03-22.16:25:00.648>
    actor = 'mcepl'
    assignee = 'ncoghlan'
    closed = True
    closed_date = <Date 2018-03-29.01:23:38.218>
    closer = 'ncoghlan'
    components = ['Unicode']
    creation = <Date 2016-09-16.11:17:02.590>
    creator = 'Jan Niklas Hasse'
    dependencies = ['30565', '30635', '30647']
    files = ['45907', '45951', '46059', '46121', '46190', '46205', '46329', '48991']
    hgrepos = []
    issue_num = 28180
    keywords = ['patch']
    message_count = 89.0
    messages = ['276693', '276694', '276707', '276709', '276722', '276729', '277273', '277274', '282964', '282965', '282970', '282971', '282972', '282977', '282978', '282984', '283244', '283408', '283409', '283469', '283471', '283482', '283495', '283515', '283543', '283732', '284150', '284170', '284176', '284537', '284605', '284620', '284621', '284631', '284641', '284647', '284697', '284716', '284718', '284719', '284720', '284722', '284725', '284729', '284736', '284742', '284747', '284764', '284782', '284794', '284795', '284799', '284882', '284884', '284886', '284887', '284900', '284908', '284943', '284952', '285735', '286001', '289002', '289534', '295121', '295683', '295688', '295698', '295710', '295713', '295722', '295871', '295872', '295875', '295885', '295913', '295914', '296064', '296075', '296077', '305850', '306108', '314627', '364740', '364760', '364767', '364770', '364804', '364810']
    nosy_count = 16.0
    nosy_names = ['lemburg', 'barry', 'ronaldoussoren', 'ncoghlan', 'vstinner', 'ned.deily', 'mcepl', 'ezio.melotti', 'r.david.murray', 'methane', 'akira', 'Sworddragon', 'xdegaye', 'yan12125', 'abarry', 'Jan Niklas Hasse']
    pr_nums = ['659', '2130', '2155', '2208', '4334']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue28180'
    versions = ['Python 3.7']

    @JanNiklasHasse
    Copy link
    Mannequin Author

    JanNiklasHasse mannequin commented Sep 16, 2016

    Working with Docker I often end up with an environment where the locale isn't correctly set. In these cases it would be great if sys.getfilesystemencoding() could default to 'utf-8' instead of 'ascii', as it's the encoding of the future and ascii is a subset of it anyway.

    Related: http://bugs.python.org/issue19846

    @JanNiklasHasse JanNiklasHasse mannequin added expert-unicode type-bug An unexpected behavior, bug, or error labels Sep 16, 2016
    @Vgr255
    Copy link
    Mannequin

    Vgr255 mannequin commented Sep 16, 2016

    This is a duplicate of bpo-27781.

    @Vgr255 Vgr255 mannequin closed this as completed Sep 16, 2016
    @vstinner
    Copy link
    Member

    vstinner commented Sep 16, 2016

    This is a duplicate of bpo-27781.

    bpo-27781 is specific to Windows. I'm not sure that it's the base in this issue. So I reopen the issue.

    @jan Niklas Hasse: What is your OS?

    I proposed to add "-X utf8" command line option for UNIX to force utf8 encoding. Would it work for you?

    @vstinner vstinner reopened this Sep 16, 2016
    @JanNiklasHasse
    Copy link
    Mannequin Author

    JanNiklasHasse mannequin commented Sep 16, 2016

    Unfortunately no, as this would mean I'll have to change all my python invocations in my scripts and it wouldn't work for executable files with

    #!/usr/bin/env python3

    would it?

    @bitdancer
    Copy link
    Member

    bitdancer commented Sep 16, 2016

    I thought we "fixed" this by using surrogate escape when the locale was ASCII? We certainly have discussed changing the default and posix and so far have decided not to (someday that will change...is this someday already?)

    @bitdancer bitdancer added the 3.7 label Sep 16, 2016
    @vstinner
    Copy link
    Member

    vstinner commented Sep 16, 2016

    is this someday already?)

    Not yet :-)

    @JanNiklasHasse
    Copy link
    Mannequin Author

    JanNiklasHasse mannequin commented Sep 23, 2016

    Why not?

    @methane
    Copy link
    Member

    methane commented Sep 23, 2016

    I want locale free Python which behaves like on C.UTF-8 locale.
    (stdio encoding, preferred encoding, weekday in _strptime._strptime,
    and more maybe)

    But Python 3.6 is feature freeze already >_<;;

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Dec 12, 2016

    I think we're genuinely getting to the point now where the majority of "LANG=C" cases are misconfigurations rather than intended behaviour. We're also to the point where:

    • on Mac OS X, binary system interfaces have been handled as UTF-8 by default since 3.0
    • on Windows, as of 3.6, the OS native binary system interfaces are now bypassed entirely in favour of transcoding from UTF-8 to UTF-16-LE

    So I think for Python 3.7 it makes sense to do the following on other *nix systems:

    • very early in CPython startup (even before argument processing), if the detected locale is "C", force it to "C.UTF-8" if possible, and print a warning either way
    • add a PYTHONKEEPASCIILOCALE environment variable to turn that behaviour off

    I do think we actually want to *change* the C level locale in the process though, as otherwise we can expect to see weird interactions where CPython and extension modules disagree about the default text encoding.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Dec 12, 2016

    Note also that if we say we're going to do this for 3.7, *and* go ahead and implement it, then distros may be more inclined to incorporate the same behavioural changes into distro-provided releases of 3.6, providing real world testing of the concept before we make it the default behaviour.

    @JanNiklasHasse
    Copy link
    Mannequin Author

    JanNiklasHasse mannequin commented Dec 12, 2016

    Actually in a new Docker container, the LANG variable isn't set at all. Defaulting to UTF-8 in that case should be easier to reason about, shouldn't it?

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Dec 12, 2016

    From CPython's point of view, glibc behaves the same way (i.e. reporting ascii as the preferred encoding for operating system interfaces) regardless of whether the cause is the locale not being set at all, or due to it being explicitly set to the legacy POSIX locale via LANG=C.

    @JanNiklasHasse
    Copy link
    Mannequin Author

    JanNiklasHasse mannequin commented Dec 12, 2016

    https://sourceware.org/glibc/wiki/Proposals/C.UTF-8#Defaults mentions that C.UTF-8 should be glibc's default.

    This bug report also mentions Python: https://sourceware.org/bugzilla/show_bug.cgi?id=17318
    It hasn't been fixed yet, though :/

    @malemburg
    Copy link
    Member

    malemburg commented Dec 12, 2016

    If we just restrict this to the file system encoding (and not the whole LANG setting), how about:

    • default the file system encoding to 'utf-8' and use the surrogate escape handler as default error handler
    • add a PYTHONFSENCODING env var to set the file system encoding to something else (*)

    (*) I believe we discussed this at some point already, but don't remember the outcome.

    Regarding the questions of defaulting to LANG=C.UTF-8: I think this needs some more thought, since it would also affect many C locale aware functions. To make this work, Python would have to call setlocale() early on in the startup phase to adjust the C lib accordingly.

    @methane
    Copy link
    Member

    methane commented Dec 12, 2016

    Sorry for confusing.
    I didn't meant defaulting LANG=C.UTF-8.

    I meant use UTF-8 as default fsencoding, stdioencoding regardless locale,
    and locale.getpreferredencoding() returns 'utf-8' when LC_CTYPE is ascii.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Dec 12, 2016

    The challenge that arises in being selective about this is that "sys.getfilesystemencoding()" is actually a misnomer, and some of the things we use it for (like decoding command line arguments and environment variables) necessarily happen *really* early in the interpreter bootstrapping process. The bugs that arise from being internally inconsistent are then even harder to debug than those that arise from believing the OS when it says the right encoding to use is ASCII - the latter at least don't tend to be subtle, and are amenable to being resolved via "LC_ALL=C.UTF-8" and "LANG=C.UTF-8".

    I believe Victor put quite a bit of time into trying to get more selective approaches to work reliably and eventually gave up.

    For Fedora 26, I'm going to explore the feasibility of patching our system 3.6 installation such that the python3 command itself (rather than the shared library) checks for "LC_CTYPE=C" as almost the first thing it does, and forcibly sets LANG and LC_ALL to C.UTF-8 if it gets an answer it doesn't like. If we're able to do that successfully in the more constrained environment of a specific recent Fedora release, then I think it will bode well for doing something similar by default in CPython 3.7

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Dec 15, 2016

    Downstream Fedora issue proposing the above idea for F26: https://bugzilla.redhat.com/show_bug.cgi?id=1404918

    I've also attached the patch from that issue here.

    @vstinner
    Copy link
    Member

    vstinner commented Dec 16, 2016

    Victor>> I proposed to add "-X utf8" command line option for UNIX to force utf8 encoding. Would it work for you?

    Jan Niklas Hasse> Unfortunately no, as this would mean I'll have to change all my python invocations in my scripts and it wouldn't work for executable files with "#!/usr/bin/env python3" would it?

    Usually, when a new option is added to Python, we add a command line option (-X utf8) but also an environment variable: I propose PYTHONUTF8=1.

    Use your favorite method to define the env var "system wide" in your docker containers.

    Note: Technically, I'm not sure that it's possible to support -E option with PYTHONUTF8, since -E comes from the command line, and we first need to decode command line arguments with an encoding to parse these options.... Chicken-and-egg issue ;-)

    @vstinner
    Copy link
    Member

    vstinner commented Dec 16, 2016

    I believe Victor put quite a bit of time into trying to get more selective approaches to work reliably and eventually gave up.

    Yeah, it just doesn't work to use more than one encoding per process. You should use the same encoding for the whole lifetime of a process.

    If you decode early data from an encoding A and later encode it back to encoding B, you get mojibake. The problem is simple.

    Using more than one encoding per process means starting to make assumtpions on how data is used. For example, consider that environment variables use the encoding A, but filenames should use the encoding B. Or, but what if an environment variable contains a filename? Similar issues for command line arguments, subprocess pipes, standard streams (sys.std*), etc.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Dec 17, 2016

    We've been discussing this further downstream in the Fedora Python SIG, and we have a draft approach that we're pretty sure will work for us (based in turn on the approach Armin Ronacher came up with for click), and we think it should work for other distros as well (as long as they already ship the C.UTF-8 locale, and if they don't, they should fix that limitation anyway).

    So I'm assigning this to myself as I think the next step will be to write a PEP that both proposes the specific idea as the default behaviour in 3.7, and also encourages distros to opt-in to trialling it as a downstream patch for 3.6.

    @ncoghlan ncoghlan self-assigned this Dec 17, 2016
    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Dec 17, 2016

    Making an explicit note of this so I remember to mention it in the draft PEP: one of the biggest problems that arises in any attempt at a Python-only solution to overriding the locale is that we can end up disagreeing with C/C++ extensions, and this is *especially* a problem when sharing a process with GUI frameworks like Tcl/Tk, Qt, and GTK (since they tend to read the process-wide settings, rather than querying anything that CPython configures during normal operation).

    So the approach I'm proposing is to implement a C->C.UTF-8 locale override in the *actual python CLI executable*, and then in the dynamically linked library we only emit a warning if we detect the C locale, we don't actually do anything to change it.

    @malemburg
    Copy link
    Member

    malemburg commented Dec 17, 2016

    On 17.12.2016 08:56, Nick Coghlan wrote:

    Making an explicit note of this so I remember to mention it in the draft PEP: one of the biggest problems that arises in any attempt at a Python-only solution to overriding the locale is that we can end up disagreeing with C/C++ extensions, and this is *especially* a problem when sharing a process with GUI frameworks like Tcl/Tk, Qt, and GTK (since they tend to read the process-wide settings, rather than querying anything that CPython configures during normal operation).

    Another use case to consider is embedding the Python
    interpreter in another application. In such situations,
    the C locale will usually already be set by the main
    application and it may conflict with the LANG or other
    locale env var settings, since the user may have chosen
    to use a different locale in the context of the application.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Dec 17, 2016

    On 17 December 2016 at 20:15, Marc-Andre Lemburg <report@bugs.python.org>
    wrote:

    Another use case to consider is embedding the Python
    interpreter in another application. In such situations,
    the C locale will usually already be set by the main
    application and it may conflict with the LANG or other
    locale env var settings, since the user may have chosen
    to use a different locale in the context of the application.

    Aye, that's the origin of the split proposal to only emit a warning in the
    shared library (since CPython might only be a piece of a larger
    application), but implement actual locale coercion (by overriding LANG and
    LC_ALL in the process environment) in the command line app's main()
    function (as in that case we know CPython *is* the application).

    The hard part of writing the PEP isn't really going to be explaining the
    proposal itself (I expect it to be around a 20 line patch to the C code) -
    it's going to be explaining why all the other possibilities we've
    considered over the years don't work, and why we (as in the Fedora Python
    SIG) think this one actually stands a chance of working properly :)

    @JanNiklasHasse
    Copy link
    Mannequin Author

    JanNiklasHasse mannequin commented Dec 17, 2016

    Usually, when a new option is added to Python, we add a command line option (-X utf8) but also an environment variable: I propose PYTHONUTF8=1.

    Use your favorite method to define the env var "system wide" in your docker containers.

    This doesn't help me, as I already set LANG to C.utf-8.

    I'm rather thing about new people trying out Python in Docker who don't know about this.

    Furthermore I think that UTF-8 is the future and the use of ASCII should be discouraged.

    @vstinner
    Copy link
    Member

    vstinner commented Jun 13, 2017

    It seems like this change:

         def test_forced_io_encoding(self):
             # Checks forced configuration of embedded interpreter IO streams
    -        out, err = self.run_embedded_interpreter("forced_io_encoding")
    -        if support.verbose:
    +        env = {"PYTHONIOENCODING": "utf-8:surrogateescape"}
    +        out, err = self.run_embedded_interpreter("forced_io_encoding", env=env)
    (...)

    Caused a failure on the "shared" buildbot (./configure --enable-shared):

    http://buildbot.python.org/all/builders/x86%20Ubuntu%20Shared%203.x/builds/877/steps/test/logs/stdio

    ======================================================================
    FAIL: test_forced_io_encoding (test.test_capi.EmbeddingTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_capi.py", line 484, in test_forced_io_encoding
        out, err = self.run_embedded_interpreter("forced_io_encoding", env=env)
      File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_capi.py", line 392, in run_embedded_interpreter
        (p.returncode, err))
    AssertionError: 127 != 0 : bad returncode 127, stderr is '/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Programs/_testembed: error while loading shared libraries: libpython3.7dm.so.1.0: cannot open shared object file: No such file or directory\n'

    @vstinner vstinner changed the title sys.getfilesystemencoding() should default to utf-8 Implementation of the PEP 538: coerce C locale to C.utf-8 Jun 13, 2017
    @vstinner
    Copy link
    Member

    vstinner commented Jun 13, 2017

    New changeset eb52ac8 by Victor Stinner in branch 'master':
    bpo-28180: Fix test_capi.test_forced_io_encoding() (bpo-2155)
    eb52ac8

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Jun 13, 2017

    New changeset 4563099 by Nick Coghlan in branch 'master':
    bpo-28180: assume UTF-8 for Mac OS X PEP-538 tests (GH-2130)
    4563099

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Jun 13, 2017

    I've added dependencies for PEP-538 induced testing problems that have been broken out into their own issues.

    I've also merged my attempt at fixing the tests on Mac OS X.

    Something that's included in that patch is an implicit skip of the "LANG=UTF-8" case when checking external locale configuration. I expected that to behave the same way as "LC_CTYPE=UTF-8", but instead it's behaving more like "LC_CTYPE=C".

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Jun 15, 2017

    Ah, I finally understand Victor's comment on my initial attempt at fixing the tests on Mac OS X - the standard streams *don't* use the filesystem encoding, so they default to ASCII in the C locale, even on Mac OS X.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Jun 15, 2017

    New changeset 7926516 by Nick Coghlan in branch 'master':
    bpo-28180: Standard stream & FS encoding differ on Mac OS X (GH-2208)
    7926516

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Jun 15, 2017

    The latest commit should get the Mac OS X buildbot back to green, but I had to disable some test cases to do it - see bpo-30672 for details.

    bpo-30565 is the one that covers silencing the locale coercion and locale compatibility warnings by default.

    @xdegaye
    Copy link
    Mannequin

    xdegaye mannequin commented Nov 8, 2017

    PR 4334 added: fix the implementation of PEP-538 on Android.

    The current implementation of PEP-538 fixes bpo-28997 without the locale coercion for Android added by PR 4334, see msg305848.

    @xdegaye
    Copy link
    Mannequin

    xdegaye mannequin commented Nov 12, 2017

    New changeset 1588be6 by xdegaye in branch 'master':
    bpo-28180: Fix the implementation of PEP-538 on Android (GH-4334)
    1588be6

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Mar 29, 2018

    Given that bpo-32002 and bpo-30672 track the known challenges in testing the expected locale coercion behaviour reliably, I'm going to go ahead and close this overall implementation issue (the feature is there, and works in a way we're happy with, we're just encountering some challenges clearly expressing those expectations as a regression test).

    @mcepl
    Copy link
    Mannequin

    mcepl mannequin commented Mar 21, 2020

    I have tried to port this patch to Python 3.4 (still maintained by SUSE on SLE-12), but I have the hardest time to debug this. All affected tests end with errors like this:

    [ 493s] ======================================================================
    [ 493s] FAIL: test_test_PYTHONCOERCECLOCALE_not_set (test.test_c_locale_coercion.LocaleCoercionTests) (PYTHONCOERCECLOCALE=None, env_var='LC_CTYPE', nominal_locale='invalid.ascii')
    [ 493s] ----------------------------------------------------------------------

    [  493s] Traceback (most recent call last):
    [  493s]   File "/home/abuild/rpmbuild/BUILD/Python-3.4.10/Lib/test/test_c_locale_coercion.py", line 326, in _check_c_locale_coercion
    [  493s]     coercion_expected)
    [  493s]   File "/home/abuild/rpmbuild/BUILD/Python-3.4.10/Lib/test/test_c_locale_coercion.py", line 219, in _check_child_encoding_details
    [  493s]     self.assertEqual(encoding_details, expected_details)
    [  493s] AssertionError: {'fse[79 chars]cii:strict', 'stderr_info': 'ascii:backslashre[45 chars]ict'} != {'fse[79 chars]cii:surrogateescape', 'stderr_info': 'ascii:ba[63 chars]ape'}
    [  493s]   {'fsencoding': 'ascii',
    [  493s]    'lang': '',
    [  493s]    'lc_all': '',
    [  493s]    'lc_ctype': 'invalid.ascii',
    [  493s]    'stderr_info': 'ascii:backslashreplace',
    [  493s] -  'stdin_info': 'ascii:strict',
    [  493s] ?                         ^^ ^
    [  493s] 
    [  493s] +  'stdin_info': 'ascii:surrogateescape',
    [  493s] ?                        ++++++ ^^^ ^^^
    [  493s] 
    [  493s] -  'stdout_info': 'ascii:strict'}
    [  493s] ?                          ^^ ^
    [  493s] 
    [  493s] +  'stdout_info': 'ascii:surrogateescape'}
    [  493s] ?                         ++++++ ^^^ ^^^

    yes, it is always a conflict between strict and surrogateescape. I probably don’t have time to finish debugging this, so I am just leaving this for posterity.

    @vstinner
    Copy link
    Member

    vstinner commented Mar 21, 2020

    Python 3.4 is no longer supported upstream. Python 3 got tons of Unicode fixes between Python 3.4 and Python 3.8.

    @mcepl
    Copy link
    Mannequin

    mcepl mannequin commented Mar 21, 2020

    Python 3.4 is no longer supported upstream. Python 3 got tons of Unicode fixes between Python 3.4 and Python 3.8.

    Of course, I know that, but I just didn’t want to throw all my effort away, when I spent some hours on making it. And I guess, there may be somebody else who cares for 3.4 (ehm, RHEL-7 has 3.3, doesn’t it?).

    @vstinner
    Copy link
    Member

    vstinner commented Mar 21, 2020

    RHEL 7.7 and RHEL 8 provides Python 3.6. PEP-538 was implemented in Python 3.7. PEP-538 feature was backported in RHEL 7.7 and RHEL 8 Python 3.6.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Mar 22, 2020

    The test cases for locale coercion *not* triggering still assume that bpo-19977, using surrogateescape on the standard streams in the POSIX locale, has been implemented (since that was implemented in Python 3.5).

    Hence the various test cases complaining that they found "ascii:strict" (Py 3.4 behaviour without bpo-19977) where they expected "ascii:surrogateescape" (the Py 3.5+ behaviour *with* bpo-19977).

    To get a PEP-538 backport to work as intended on 3.4, you'd need to backport that earlier IO stream error handling change as well.

    @mcepl
    Copy link
    Mannequin

    mcepl mannequin commented Mar 22, 2020

    Thank you very much for the hint. Do I have to include the patch for bpo-19977 only (that would be easy), or also all twelve PRs for bpo-29240 (that would probably broke my will to do it)?

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @calestyo
    Copy link
    Contributor

    calestyo commented May 29, 2022

    Not sure whether this is the appropriate place to mention/ask... but it seems that with this chance it's impossible to get the original environment Python was invoked with, or is there?
    This in turn may however be important for e.g. wrapper programs, which would want to invoke other programs with the exact original locale they were invoked themselves.

    @methane
    Copy link
    Member

    methane commented May 29, 2022

    You can set PYTHONCOERCECLOCALE=0 to disable coercion.

    @calestyo
    Copy link
    Contributor

    calestyo commented May 29, 2022

    Yes, but that just shifts the problem. Then one wouldn't know whether PYTHONCOERCECLOCALE was meant to be part of the real original environment, or whether it was just there to not mess around with that.

    @methane
    Copy link
    Member

    methane commented May 29, 2022

    What is the real use case for it?
    locale affects many programs, but PYTHONCOERCELOCALE affects only Python. I don't think it is real problem for most users.

    If it is really your problem, you can disable the coercion by default by building your Python with --with-c-locale-coercion option.

    @calestyo
    Copy link
    Contributor

    calestyo commented May 29, 2022

    Take any program that shall be written in Python and which works just as a wrapper around some other program, with the specific intention that the environment is passed on exactly to that program. Just as an example, a wrapper-program that adds numbers to lines... or maybe counts characters in the stdout of the program.

    The program the wrapper calls may even be another Python program... at that point it should become quite clear that it's apparently not possible to get that behaviour.
    Either one does nothing respectively e.g. explicitly wants to run in a C local environment and does env -i LC_CTYPE=C python3 wrapper.py command or so... then Python would do the coercion and command wouldn't see the original environment.
    Same if one does env -i python3 in which case, as per POSIX, command should fall back to the “implementation-defined default locale”, but Python would override this to C.UTF-8.

    If one disables that behaviour with PYTHONCOERCELOCALE one cannot know whether this was meant (just) for the wrapper script in order to not do the coercion, or whether it was meant for command.
    Or take something where command is something executed via ssh, LC_CTYPE is often configured to be forwarded via that, so one may not wish to have that set by Python, OTOH, one may again execute a remote Python command and may not want to have PYTHONCOERCELOCALE in the environment either, unless it's really meant for the remote.

    @calestyo
    Copy link
    Contributor

    calestyo commented May 29, 2022

    but PYTHONCOERCELOCALE affects only Python

    Oh and, as unlikely as it is, there is no guarantee for that, at all. Changing the built-options doesn't really help either, as no real-world Python will have that.

    The problem here really is that one seems to have no way to either get/restore the true original local (directly or indirectly).

    @methane
    Copy link
    Member

    methane commented May 29, 2022

    I understand theorical problem. What I am asking for is how this issue affects to users in real world.
    Please give us more concrete use cases.

    And this issue is not good place for this discussion. Please move to discuss.python.org.

    @calestyo
    Copy link
    Contributor

    calestyo commented May 29, 2022

    Well I wouldn't call it a theoretical(only) problem. There is a clear use-case described along with several examples (i.e. any program who wants to serve as a transparent wrapper)... which I think speaks for itself.
    At least I wouldn't know how to "prove" validity of the use case any further - that feels a bit as if division by 834756387465723672348394923 would fail any I'd need to prove why on earth anyone could ever want to divide by exactly that.

    However, I don't think any further discussion is really needed, at least not from my side,... I've simply used C now which is as good for me and does the job well enough.

    As said, in the original post, I've just noted that this makes it impossible to get the original environment (except when using hacks)... and wanted to bring that to attention.

    Whether Python wants to support that is up to that. A simple solution would perhaps be to provide something like a os.original_environment or so, which would solve the same problem should any other modifications done in the future.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Jun 3, 2022

    Wrapper program or no, setting 7-bit ASCII as the text encoding was deemed an unsupported system configuration error in 2018 when this PEP was accepted and implemented, and is still seen that way now. Given the availability of UTF-8 as an alternative, there's no good reason to run English-only systems any more. Hence the platform compatibility guidelines in PEP 11 being updated when these encoding related PEPs were accepted.

    That said, setting PYTHON_PRECOERCION_LOCALE in the environment when the coercion triggers would be straightforward, so I'd personally be +0 on providing that as a debugging aid, even though I'd advise wrapper script authors to resist the temptation to revert the coercion (all the wrapper scripts in the standard library intentionally leave the coerced locale in place)

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Jun 3, 2022

    Note: any such a feature request should be filed as a new issue, and while I'd be happy to review a PR for such an addition, I wouldn't write it myself.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 expert-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    8 participants