locale.normalize() and getdefaultlocale() convert C.UTF-8 to en_US.UTF-8 #74940

mattheww · 2017-06-25T16:58:59Z

BPO	30755
Nosy	@malemburg, @ncoghlan, @mattheww, @benjaminp, @serhiy-storchaka, @hroncok, @gordonmessmer, @websurfer5
PRs	gh-74940: Allow fallback to UTF-8 encoding on systems with no locales installed. #14925

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2017-06-25.16:58:59.153>
labels = ['3.7', '3.8']
title = 'locale.normalize() and getdefaultlocale() convert C.UTF-8 to en_US.UTF-8'
updated_at = <Date 2019-07-29.15:54:44.990>
user = 'https://github.com/mattheww'

bugs.python.org fields:

activity = <Date 2019-07-29.15:54:44.990>
actor = 'vstinner'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = []
creation = <Date 2017-06-25.16:58:59.153>
creator = 'mattheww'
dependencies = []
files = []
hgrepos = []
issue_num = 30755
keywords = ['patch']
message_count = 8.0
messages = ['296828', '297342', '302981', '302982', '347520', '347521', '347528', '348367']
nosy_count = 8.0
nosy_names = ['lemburg', 'ncoghlan', 'mattheww', 'benjamin.peterson', 'serhiy.storchaka', 'hroncok', 'gordonmessmer', 'Jeffrey.Kintscher']
pr_nums = ['14925']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = None
url = 'https://bugs.python.org/issue30755'
versions = ['Python 3.4', 'Python 3.5', 'Python 3.6', 'Python 3.7', 'Python 3.8']

mattheww · 2017-06-25T16:58:59Z

I have a system where the default locale is C.UTF-8, and en_US.UTF-8 is
not installed.

But locale.normalize() unhelpfully converts "C.UTF-8" to "en_US.UTF-8".

So the following crashes for me:

python3.6 -c "import locale;locale.setlocale(locale.LC_ALL, ('C', 'UTF-8'))"

Similarly getdefaultlocale() returns ('en_US', 'UTF-8'), so this crashes too:

export LANG=C.UTF-8
unset LC_CTYPE
unset LC_ALL
unset LANGUAGE
python3.6 -c "import locale;locale.setlocale(locale.LC_ALL, locale.getdefaultlocale())"

This behaviour is caused by a locale_alias entry in Lib/locale.py .

https://bugs.python.org/issue20076 documents its addition but doesn't
provide a rationale.

I can see that it might be helpful to provide such a conversion if
C.UTF-8 doesn't exist and en_US.UTF-8 does, but the current code is
breaking modern correctly-configured systems for the benefit of old
misconfigured ones (C.UTF-8 shouldn't really be in the environment if it
isn't available on the system, after all).

ncoghlan · 2017-06-30T02:20:13Z

I'm honestly not sure how our Python level locale handling really works (I've mainly worked on the lower level C locale manipulation), so adding folks to the nosy list based on bpo-20076 and bpo-29571.

I agree we shouldn't be aliasing C.UTF-8 to en_US.UTF-8 though - we took en_US.UTF-8 out of the locale coercion fallback list in PEP-538 because it wasn't really right.

mattheww · 2017-09-25T22:37:17Z

I've investigated a bit more.

First, I've tried with Python 3.7.0a1 . As you'd expect, PEP-537 means
this behaviour now also occurs when no locale environment variables at
all are set.

Second, I've looked through locale.py a bit. I believe what it calls the
"aliasing engine" is applied for:

getlocale()
getdefaultlocale()
setlocale() when passed a tuple, but not when passed a string

This leads to some rather odd results.

With 3.7.0a1 and no locale environment variables:

  >>> import locale
  >>> locale.getlocale()
  ('en_US', 'UTF-8')

  # getlocale() is lying: the effective locale is really C.UTF-8
  >>> sorted("abcABC", key=locale.strxfrm)
  ['A', 'B', 'C', 'a', 'b', 'c']

Third, I've checked on a system which does have en_US.UTF-8 installed,
and (as you'd expect) instead of crashing it gives wrong results:

  >>> import locale
  >>> locale.setlocale(locale.LC_ALL, ('C', 'UTF-8'))
  'en_US.UTF-8'
  >>> locale.getlocale()
  ('en_US', 'UTF-8')

  # now getlocale() is telling the truth, and the user isn't getting the
  # collation they requested
  >>> sorted("abcABC", key=locale.strxfrm)
  ['a', 'A', 'b', 'B', 'c', 'C']

mattheww · 2017-09-25T22:39:15Z

(For PEP-537 please read PEP-538, sorry)

gordonmessmer · 2019-07-09T05:44:39Z

I can see that it might be helpful to provide such a conversion if
C.UTF-8 doesn't exist and en_US.UTF-8 does

That can't happen. The "C" locale describes the behavior defined in the ISO C standard. It's built-in to glibc (and should be for all other libc implementations). All other locales require external support (i.e. /usr/lib/locale/<locale>)

https://www.gnu.org/software/libc/manual/html_node/Standard-Locales.html#Standard-Locales

gordonmessmer · 2019-07-09T06:10:15Z

I agree we shouldn't be aliasing C.UTF-8 to en_US.UTF-8 though

What can we do about reverting that change? Python's current behavior causes unexpected exceptions, especially in containers.

I'm currently debugging test failures in a Python application that occur in Fedora rawhide containers. Those containers don't have any locales installed. The test software saves its current locale, changes the locale in order to run a test, and then restores the original. Because Python is incorrectly reporting the original locale as "en_US", restoring the original fails.

hroncok · 2019-07-09T08:42:49Z

> C.UTF-8 doesn't exist and en_US.UTF-8 does
That can't happen

It certainly can. Take for example RHEL 7 or 6.

gordonmessmer · 2019-07-24T04:24:56Z

As an example, let's consider dnf's i18n setup:

try:
    dnf.pycomp.setlocale(locale.LC_ALL, '')
except locale.Error:
    # default to C.UTF-8 or C locale if we got a failure.
    try:
        dnf.pycomp.setlocale(locale.LC_ALL, 'C.UTF-8')
        os.environ['LC_ALL'] = 'C.UTF-8'
    except locale.Error:
        dnf.pycomp.setlocale(locale.LC_ALL, 'C')
        os.environ['LC_ALL'] = 'C'

If setting the environment-specified locale fails, dnf will attempt to set the locale
to C.UTF-8, and if that fails it will set the locale to C. This seems like an ideal
process. If the expected locale is missing, dnf will attempt to at least use UTF-8,
before falling back to the C locale.

Unfortunately, because of the alias, this process will be unable to set the 'C.UTF-8'
locale on systems which do not have the 'en_US' locale installed. This renders
system support for 'C.UTF-8' unusable when no locales are installed.

… installed (GH-14925) This change removes the alias of the 'C' locale to 'en_US'. Because of this alias, it is currently impossible for an application to use setlocale() to specify a UTF-8 locale on a system that has no locales installed, but which supports the C.UTF-8 locale/encoding.

…ocales installed (pythonGH-14925) This change removes the alias of the 'C' locale to 'en_US'. Because of this alias, it is currently impossible for an application to use setlocale() to specify a UTF-8 locale on a system that has no locales installed, but which supports the C.UTF-8 locale/encoding.

The C locale no longer does what we need in Python 3.12, see python/cpython#74940

mattheww mannequin added the 3.7 (EOL) end of life label Sep 25, 2017

hroncok mannequin added the 3.8 only security fixes label Jul 9, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

bedevere-bot mentioned this issue Apr 26, 2023

gh-74940: Allow fallback to UTF-8 encoding on systems with no locales installed. #14925

Merged

arhadthedev added stdlib Python modules in the Lib dir and removed 3.8 only security fixes 3.7 (EOL) end of life labels Apr 26, 2023

arhadthedev added the type-feature A feature request or enhancement label Apr 26, 2023

arhadthedev closed this as completed Apr 26, 2023

nijel added a commit to WeblateOrg/docker that referenced this issue Oct 30, 2023

change docker container locales

4981f3d

The C locale no longer does what we need in Python 3.12, see python/cpython#74940

nijel mentioned this issue Nov 2, 2023

chore(deps): update python docker tag to v3.12.0 WeblateOrg/docker#1990

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

locale.normalize() and getdefaultlocale() convert C.UTF-8 to en_US.UTF-8 #74940

locale.normalize() and getdefaultlocale() convert C.UTF-8 to en_US.UTF-8 #74940

mattheww mannequin commented Jun 25, 2017

mattheww mannequin commented Jun 25, 2017

ncoghlan commented Jun 30, 2017

mattheww mannequin commented Sep 25, 2017

mattheww mannequin commented Sep 25, 2017

gordonmessmer mannequin commented Jul 9, 2019

gordonmessmer mannequin commented Jul 9, 2019

hroncok mannequin commented Jul 9, 2019

gordonmessmer mannequin commented Jul 24, 2019

locale.normalize() and getdefaultlocale() convert C.UTF-8 to en_US.UTF-8 #74940

locale.normalize() and getdefaultlocale() convert C.UTF-8 to en_US.UTF-8 #74940

Comments

mattheww mannequin commented Jun 25, 2017

mattheww mannequin commented Jun 25, 2017

ncoghlan commented Jun 30, 2017

mattheww mannequin commented Sep 25, 2017

mattheww mannequin commented Sep 25, 2017

gordonmessmer mannequin commented Jul 9, 2019

gordonmessmer mannequin commented Jul 9, 2019

hroncok mannequin commented Jul 9, 2019

gordonmessmer mannequin commented Jul 24, 2019