Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

locale.normalize() and getdefaultlocale() convert C.UTF-8 to en_US.UTF-8 #74940

Closed
mattheww mannequin opened this issue Jun 25, 2017 · 8 comments
Closed

locale.normalize() and getdefaultlocale() convert C.UTF-8 to en_US.UTF-8 #74940

mattheww mannequin opened this issue Jun 25, 2017 · 8 comments
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@mattheww
Copy link
Mannequin

mattheww mannequin commented Jun 25, 2017

BPO 30755
Nosy @malemburg, @ncoghlan, @mattheww, @benjaminp, @serhiy-storchaka, @hroncok, @gordonmessmer, @websurfer5
PRs
  • gh-74940: Allow fallback to UTF-8 encoding on systems with no locales installed. #14925
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2017-06-25.16:58:59.153>
    labels = ['3.7', '3.8']
    title = 'locale.normalize() and getdefaultlocale() convert C.UTF-8 to en_US.UTF-8'
    updated_at = <Date 2019-07-29.15:54:44.990>
    user = 'https://github.com/mattheww'

    bugs.python.org fields:

    activity = <Date 2019-07-29.15:54:44.990>
    actor = 'vstinner'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = []
    creation = <Date 2017-06-25.16:58:59.153>
    creator = 'mattheww'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 30755
    keywords = ['patch']
    message_count = 8.0
    messages = ['296828', '297342', '302981', '302982', '347520', '347521', '347528', '348367']
    nosy_count = 8.0
    nosy_names = ['lemburg', 'ncoghlan', 'mattheww', 'benjamin.peterson', 'serhiy.storchaka', 'hroncok', 'gordonmessmer', 'Jeffrey.Kintscher']
    pr_nums = ['14925']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue30755'
    versions = ['Python 3.4', 'Python 3.5', 'Python 3.6', 'Python 3.7', 'Python 3.8']

    @mattheww
    Copy link
    Mannequin Author

    mattheww mannequin commented Jun 25, 2017

    I have a system where the default locale is C.UTF-8, and en_US.UTF-8 is
    not installed.

    But locale.normalize() unhelpfully converts "C.UTF-8" to "en_US.UTF-8".

    So the following crashes for me:

    python3.6 -c "import locale;locale.setlocale(locale.LC_ALL, ('C', 'UTF-8'))"

    Similarly getdefaultlocale() returns ('en_US', 'UTF-8'), so this crashes too:

    export LANG=C.UTF-8
    unset LC_CTYPE
    unset LC_ALL
    unset LANGUAGE
    python3.6 -c "import locale;locale.setlocale(locale.LC_ALL, locale.getdefaultlocale())"

    This behaviour is caused by a locale_alias entry in Lib/locale.py .

    https://bugs.python.org/issue20076 documents its addition but doesn't
    provide a rationale.

    I can see that it might be helpful to provide such a conversion if
    C.UTF-8 doesn't exist and en_US.UTF-8 does, but the current code is
    breaking modern correctly-configured systems for the benefit of old
    misconfigured ones (C.UTF-8 shouldn't really be in the environment if it
    isn't available on the system, after all).

    @ncoghlan
    Copy link
    Contributor

    I'm honestly not sure how our Python level locale handling really works (I've mainly worked on the lower level C locale manipulation), so adding folks to the nosy list based on bpo-20076 and bpo-29571.

    I agree we shouldn't be aliasing C.UTF-8 to en_US.UTF-8 though - we took en_US.UTF-8 out of the locale coercion fallback list in PEP-538 because it wasn't really right.

    @mattheww
    Copy link
    Mannequin Author

    mattheww mannequin commented Sep 25, 2017

    I've investigated a bit more.

    First, I've tried with Python 3.7.0a1 . As you'd expect, PEP-537 means
    this behaviour now also occurs when no locale environment variables at
    all are set.

    Second, I've looked through locale.py a bit. I believe what it calls the
    "aliasing engine" is applied for:

    • getlocale()
    • getdefaultlocale()
    • setlocale() when passed a tuple, but not when passed a string

    This leads to some rather odd results.

    With 3.7.0a1 and no locale environment variables:

      >>> import locale
      >>> locale.getlocale()
      ('en_US', 'UTF-8')
    
      # getlocale() is lying: the effective locale is really C.UTF-8
      >>> sorted("abcABC", key=locale.strxfrm)
      ['A', 'B', 'C', 'a', 'b', 'c']

    Third, I've checked on a system which does have en_US.UTF-8 installed,
    and (as you'd expect) instead of crashing it gives wrong results:

      >>> import locale
      >>> locale.setlocale(locale.LC_ALL, ('C', 'UTF-8'))
      'en_US.UTF-8'
      >>> locale.getlocale()
      ('en_US', 'UTF-8')
    
      # now getlocale() is telling the truth, and the user isn't getting the
      # collation they requested
      >>> sorted("abcABC", key=locale.strxfrm)
      ['a', 'A', 'b', 'B', 'c', 'C']

    @mattheww mattheww mannequin added the 3.7 (EOL) end of life label Sep 25, 2017
    @mattheww
    Copy link
    Mannequin Author

    mattheww mannequin commented Sep 25, 2017

    (For PEP-537 please read PEP-538, sorry)

    @gordonmessmer
    Copy link
    Mannequin

    gordonmessmer mannequin commented Jul 9, 2019

    I can see that it might be helpful to provide such a conversion if
    C.UTF-8 doesn't exist and en_US.UTF-8 does

    That can't happen. The "C" locale describes the behavior defined in the ISO C standard. It's built-in to glibc (and should be for all other libc implementations). All other locales require external support (i.e. /usr/lib/locale/<locale>)

    https://www.gnu.org/software/libc/manual/html_node/Standard-Locales.html#Standard-Locales

    @gordonmessmer
    Copy link
    Mannequin

    gordonmessmer mannequin commented Jul 9, 2019

    I agree we shouldn't be aliasing C.UTF-8 to en_US.UTF-8 though

    What can we do about reverting that change? Python's current behavior causes unexpected exceptions, especially in containers.

    I'm currently debugging test failures in a Python application that occur in Fedora rawhide containers. Those containers don't have any locales installed. The test software saves its current locale, changes the locale in order to run a test, and then restores the original. Because Python is incorrectly reporting the original locale as "en_US", restoring the original fails.

    @hroncok
    Copy link
    Mannequin

    hroncok mannequin commented Jul 9, 2019

    > C.UTF-8 doesn't exist and en_US.UTF-8 does
    That can't happen

    It certainly can. Take for example RHEL 7 or 6.

    @hroncok hroncok mannequin added the 3.8 only security fixes label Jul 9, 2019
    @gordonmessmer
    Copy link
    Mannequin

    gordonmessmer mannequin commented Jul 24, 2019

    As an example, let's consider dnf's i18n setup:

    try:
        dnf.pycomp.setlocale(locale.LC_ALL, '')
    except locale.Error:
        # default to C.UTF-8 or C locale if we got a failure.
        try:
            dnf.pycomp.setlocale(locale.LC_ALL, 'C.UTF-8')
            os.environ['LC_ALL'] = 'C.UTF-8'
        except locale.Error:
            dnf.pycomp.setlocale(locale.LC_ALL, 'C')
            os.environ['LC_ALL'] = 'C'
    

    If setting the environment-specified locale fails, dnf will attempt to set the locale
    to C.UTF-8, and if that fails it will set the locale to C. This seems like an ideal
    process. If the expected locale is missing, dnf will attempt to at least use UTF-8,
    before falling back to the C locale.

    Unfortunately, because of the alias, this process will be unable to set the 'C.UTF-8'
    locale on systems which do not have the 'en_US' locale installed. This renders
    system support for 'C.UTF-8' unusable when no locales are installed.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @arhadthedev arhadthedev added stdlib Python modules in the Lib dir and removed 3.8 only security fixes 3.7 (EOL) end of life labels Apr 26, 2023
    methane pushed a commit that referenced this issue Apr 26, 2023
    … installed (GH-14925)
    
    This change removes the alias of the 'C' locale to 'en_US'. Because of
    this alias, it is currently impossible for an application to use
    setlocale() to specify a UTF-8 locale on a system that has no locales
    installed, but which supports the C.UTF-8 locale/encoding.
    @arhadthedev arhadthedev added the type-feature A feature request or enhancement label Apr 26, 2023
    itamaro pushed a commit to itamaro/cpython that referenced this issue Apr 26, 2023
    …ocales installed (pythonGH-14925)
    
    This change removes the alias of the 'C' locale to 'en_US'. Because of
    this alias, it is currently impossible for an application to use
    setlocale() to specify a UTF-8 locale on a system that has no locales
    installed, but which supports the C.UTF-8 locale/encoding.
    nijel added a commit to WeblateOrg/docker that referenced this issue Oct 30, 2023
    The C locale no longer does what we need in Python 3.12, see
    python/cpython#74940
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    Status: Done
    Development

    No branches or pull requests

    2 participants