Skip to content

Commit

Permalink
bpo-29240: PEP 540: Add a new UTF-8 Mode (#855)
Browse files Browse the repository at this point in the history
* Add -X utf8 command line option, PYTHONUTF8 environment variable
  and a new sys.flags.utf8_mode flag.
* If the LC_CTYPE locale is "C" at startup: enable automatically the
  UTF-8 mode.
* Add _winapi.GetACP(). encodings._alias_mbcs() now calls
  _winapi.GetACP() to get the ANSI code page
* locale.getpreferredencoding() now returns 'UTF-8' in the UTF-8
  mode. As a side effect, open() now uses the UTF-8 encoding by
  default in this mode.
* Py_DecodeLocale() and Py_EncodeLocale() now use the UTF-8 encoding
  in the UTF-8 Mode.
* Update subprocess._args_from_interpreter_flags() to handle -X utf8
* Skip some tests relying on the current locale if the UTF-8 mode is
  enabled.
* Add test_utf8mode.py.
* _Py_DecodeUTF8_surrogateescape() gets a new optional parameter to
  return also the length (number of wide characters).
* pymain_get_global_config() and pymain_set_global_config() now
  always copy flag values, rather than only copying if the new value
  is greater than the old value.
  • Loading branch information
vstinner committed Dec 13, 2017
1 parent c3e070f commit 91106cd
Show file tree
Hide file tree
Showing 27 changed files with 593 additions and 178 deletions.
13 changes: 11 additions & 2 deletions Doc/c-api/sys.rst
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,9 @@ Operating System Utilities
.. versionadded:: 3.5
.. versionchanged:: 3.7
The function now uses the UTF-8 encoding in the UTF-8 mode.
.. c:function:: char* Py_EncodeLocale(const wchar_t *text, size_t *error_pos)
Expand All @@ -138,19 +141,25 @@ Operating System Utilities
to free the memory. Return ``NULL`` on encoding error or memory allocation
error
If error_pos is not ``NULL``, ``*error_pos`` is set to the index of the
invalid character on encoding error, or set to ``(size_t)-1`` otherwise.
If error_pos is not ``NULL``, ``*error_pos`` is set to ``(size_t)-1`` on
success, or set to the index of the invalid character on encoding error.
Use the :c:func:`Py_DecodeLocale` function to decode the bytes string back
to a wide character string.
.. versionchanged:: 3.7
The function now uses the UTF-8 encoding in the UTF-8 mode.
.. seealso::
The :c:func:`PyUnicode_EncodeFSDefault` and
:c:func:`PyUnicode_EncodeLocale` functions.
.. versionadded:: 3.5
.. versionchanged:: 3.7
The function now supports the UTF-8 mode.
.. _systemfunctions:
Expand Down
7 changes: 7 additions & 0 deletions Doc/library/locale.rst
Original file line number Diff line number Diff line change
Expand Up @@ -316,6 +316,13 @@ The :mod:`locale` module defines the following exception and functions:
preferences, so this function is not thread-safe. If invoking setlocale is not
necessary or desired, *do_setlocale* should be set to ``False``.

On Android or in the UTF-8 mode (:option:`-X` ``utf8`` option), always
return ``'UTF-8'``, the locale and the *do_setlocale* argument are ignored.

.. versionchanged:: 3.7
The function now always returns ``UTF-8`` on Android or if the UTF-8 mode
is enabled.


.. function:: normalize(localename)

Expand Down
13 changes: 12 additions & 1 deletion Doc/library/sys.rst
Original file line number Diff line number Diff line change
Expand Up @@ -313,6 +313,9 @@ always available.
has caught :exc:`SystemExit` (such as an error flushing buffered data
in the standard streams), the exit status is changed to 120.

.. versionchanged:: 3.7
Added ``utf8_mode`` attribute for the new :option:`-X` ``utf8`` flag.


.. data:: flags

Expand All @@ -335,6 +338,7 @@ always available.
:const:`quiet` :option:`-q`
:const:`hash_randomization` :option:`-R`
:const:`dev_mode` :option:`-X` ``dev``
:const:`utf8_mode` :option:`-X` ``utf8``
============================= =============================

.. versionchanged:: 3.2
Expand All @@ -347,7 +351,8 @@ always available.
Removed obsolete ``division_warning`` attribute.

.. versionchanged:: 3.7
Added ``dev_mode`` attribute for the new :option:`-X` ``dev`` flag.
Added ``dev_mode`` attribute for the new :option:`-X` ``dev`` flag
and ``utf8_mode`` attribute for the new :option:`-X` ``utf8`` flag.


.. data:: float_info
Expand Down Expand Up @@ -492,6 +497,8 @@ always available.
:func:`os.fsencode` and :func:`os.fsdecode` should be used to ensure that
the correct encoding and errors mode are used.

* In the UTF-8 mode, the encoding is ``utf-8`` on any platform.

* On Mac OS X, the encoding is ``'utf-8'``.

* On Unix, the encoding is the locale encoding.
Expand All @@ -506,6 +513,10 @@ always available.
Windows is no longer guaranteed to return ``'mbcs'``. See :pep:`529`
and :func:`_enablelegacywindowsfsencoding` for more information.

.. versionchanged:: 3.7
Return 'utf-8' in the UTF-8 mode.


.. function:: getfilesystemencodeerrors()

Return the name of the error mode used to convert between Unicode filenames
Expand Down
13 changes: 12 additions & 1 deletion Doc/using/cmdline.rst
Original file line number Diff line number Diff line change
Expand Up @@ -439,6 +439,9 @@ Miscellaneous options
* Set the :attr:`~sys.flags.dev_mode` attribute of :attr:`sys.flags` to
``True``

* ``-X utf8`` enables the UTF-8 mode, whereas ``-X utf8=0`` disables the
UTF-8 mode.

It also allows passing arbitrary values and retrieving them through the
:data:`sys._xoptions` dictionary.

Expand All @@ -455,7 +458,7 @@ Miscellaneous options
The ``-X showalloccount`` option.

.. versionadded:: 3.7
The ``-X importtime`` and ``-X dev`` options.
The ``-X importtime``, ``-X dev`` and ``-X utf8`` options.


Options you shouldn't use
Expand Down Expand Up @@ -816,6 +819,14 @@ conflict.

.. versionadded:: 3.7

.. envvar:: PYTHONUTF8

If set to ``1``, enable the UTF-8 mode. If set to ``0``, disable the UTF-8
mode. Any other non-empty string cause an error.

.. versionadded:: 3.7


Debug-mode variables
~~~~~~~~~~~~~~~~~~~~

Expand Down
21 changes: 21 additions & 0 deletions Doc/whatsnew/3.7.rst
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,23 @@ resolution on Linux and Windows.
PEP written and implemented by Victor Stinner


PEP 540: Add a new UTF-8 mode
-----------------------------

Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and change
:data:`sys.stdin` and :data:`sys.stdout` error handlers to ``surrogateescape``.
This mode is enabled by default in the POSIX locale, but otherwise disabled by
default.

The new :option:`-X` ``utf8`` command line option and :envvar:`PYTHONUTF8`
environment variable are added to control the UTF-8 mode.

.. seealso::

:pep:`540` -- Add a new UTF-8 mode
PEP written and implemented by Victor Stinner


New Development Mode: -X dev
----------------------------

Expand Down Expand Up @@ -353,6 +370,10 @@ Added another argument *monetary* in :meth:`format_string` of :mod:`locale`.
If *monetary* is true, the conversion uses monetary thousands separator and
grouping strings. (Contributed by Garvit in :issue:`10379`.)

The :func:`locale.getpreferredencoding` function now always returns ``'UTF-8'``
on Android or in the UTF-8 mode (:option:`-X` ``utf8`` option), the locale and
the *do_setlocale* argument are ignored.

math
----

Expand Down
4 changes: 4 additions & 0 deletions Include/fileobject.h
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,10 @@ PyAPI_DATA(const char *) Py_FileSystemDefaultEncodeErrors;
#endif
PyAPI_DATA(int) Py_HasFileSystemDefaultEncoding;

#if !defined(Py_LIMITED_API) || Py_LIMITED_API+0 >= 0x03070000
PyAPI_DATA(int) Py_UTF8Mode;
#endif

/* Internal API
The std printer acts as a preliminary sys.stderr until the new io
Expand Down
1 change: 1 addition & 0 deletions Include/pystate.h
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ typedef struct {
int show_alloc_count; /* -X showalloccount */
int dump_refs; /* PYTHONDUMPREFS */
int malloc_stats; /* PYTHONMALLOCSTATS */
int utf8_mode; /* -X utf8 or PYTHONUTF8 environment variable */
} _PyCoreConfig;

#define _PyCoreConfig_INIT (_PyCoreConfig){.use_hash_seed = -1}
Expand Down
6 changes: 6 additions & 0 deletions Lib/_bootlocale.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@

if sys.platform.startswith("win"):
def getpreferredencoding(do_setlocale=True):
if sys.flags.utf8_mode:
return 'UTF-8'
return _locale._getdefaultlocale()[1]
else:
try:
Expand All @@ -21,13 +23,17 @@ def getpreferredencoding(do_setlocale=True):
return 'UTF-8'
else:
def getpreferredencoding(do_setlocale=True):
if sys.flags.utf8_mode:
return 'UTF-8'
# This path for legacy systems needs the more complex
# getdefaultlocale() function, import the full locale module.
import locale
return locale.getpreferredencoding(do_setlocale)
else:
def getpreferredencoding(do_setlocale=True):
assert not do_setlocale
if sys.flags.utf8_mode:
return 'UTF-8'
result = _locale.nl_langinfo(_locale.CODESET)
if not result and sys.platform == 'darwin':
# nl_langinfo can return an empty string
Expand Down
5 changes: 3 additions & 2 deletions Lib/encodings/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,8 +158,9 @@ def search_function(encoding):
if sys.platform == 'win32':
def _alias_mbcs(encoding):
try:
import _bootlocale
if encoding == _bootlocale.getpreferredencoding(False):
import _winapi
ansi_code_page = "cp%s" % _winapi.GetACP()
if encoding == ansi_code_page:
import encodings.mbcs
return encodings.mbcs.getregentry()
except ImportError:
Expand Down
6 changes: 6 additions & 0 deletions Lib/locale.py
Original file line number Diff line number Diff line change
Expand Up @@ -617,6 +617,8 @@ def resetlocale(category=LC_ALL):
# On Win32, this will return the ANSI code page
def getpreferredencoding(do_setlocale = True):
"""Return the charset that the user is likely using."""
if sys.flags.utf8_mode:
return 'UTF-8'
import _bootlocale
return _bootlocale.getpreferredencoding(False)
else:
Expand All @@ -634,6 +636,8 @@ def getpreferredencoding(do_setlocale = True):
def getpreferredencoding(do_setlocale = True):
"""Return the charset that the user is likely using,
by looking at environment variables."""
if sys.flags.utf8_mode:
return 'UTF-8'
res = getdefaultlocale()[1]
if res is None:
# LANG not set, default conservatively to ASCII
Expand All @@ -643,6 +647,8 @@ def getpreferredencoding(do_setlocale = True):
def getpreferredencoding(do_setlocale = True):
"""Return the charset that the user is likely using,
according to the system configuration."""
if sys.flags.utf8_mode:
return 'UTF-8'
import _bootlocale
if do_setlocale:
oldloc = setlocale(LC_CTYPE)
Expand Down
2 changes: 1 addition & 1 deletion Lib/subprocess.py
Original file line number Diff line number Diff line change
Expand Up @@ -280,7 +280,7 @@ def _args_from_interpreter_flags():
if dev_mode:
args.extend(('-X', 'dev'))
for opt in ('faulthandler', 'tracemalloc', 'importtime',
'showalloccount', 'showrefcount'):
'showalloccount', 'showrefcount', 'utf8'):
if opt in xoptions:
value = xoptions[opt]
if value is True:
Expand Down
1 change: 1 addition & 0 deletions Lib/test/test_builtin.py
Original file line number Diff line number Diff line change
Expand Up @@ -1022,6 +1022,7 @@ def test_open(self):
self.assertRaises(ValueError, open, 'a\x00b')
self.assertRaises(ValueError, open, b'a\x00b')

@unittest.skipIf(sys.flags.utf8_mode, "utf-8 mode is enabled")
def test_open_default_encoding(self):
old_environ = dict(os.environ)
try:
Expand Down
2 changes: 1 addition & 1 deletion Lib/test/test_c_locale_coercion.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ def get_child_details(cls, env_vars):
that.
"""
result, py_cmd = run_python_until_end(
"-c", cls.CHILD_PROCESS_SCRIPT,
"-X", "utf8=0", "-c", cls.CHILD_PROCESS_SCRIPT,
__isolated=True,
**env_vars
)
Expand Down
10 changes: 2 additions & 8 deletions Lib/test/test_codecs.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import sys
import unittest
import encodings
from unittest import mock

from test import support

Expand Down Expand Up @@ -3180,16 +3181,9 @@ def test_incremental(self):
def test_mbcs_alias(self):
# Check that looking up our 'default' codepage will return
# mbcs when we don't have a more specific one available
import _bootlocale
def _get_fake_codepage(*a):
return 'cp123'
old_getpreferredencoding = _bootlocale.getpreferredencoding
_bootlocale.getpreferredencoding = _get_fake_codepage
try:
with mock.patch('_winapi.GetACP', return_value=123):
codec = codecs.lookup('cp123')
self.assertEqual(codec.name, 'mbcs')
finally:
_bootlocale.getpreferredencoding = old_getpreferredencoding


class ASCIITest(unittest.TestCase):
Expand Down
2 changes: 2 additions & 0 deletions Lib/test/test_io.py
Original file line number Diff line number Diff line change
Expand Up @@ -2580,6 +2580,7 @@ def test_reconfigure_line_buffering(self):
t.reconfigure(line_buffering=None)
self.assertEqual(t.line_buffering, True)

@unittest.skipIf(sys.flags.utf8_mode, "utf-8 mode is enabled")
def test_default_encoding(self):
old_environ = dict(os.environ)
try:
Expand All @@ -2599,6 +2600,7 @@ def test_default_encoding(self):
os.environ.update(old_environ)

@support.cpython_only
@unittest.skipIf(sys.flags.utf8_mode, "utf-8 mode is enabled")
def test_device_encoding(self):
# Issue 15989
import _testcapi
Expand Down
8 changes: 5 additions & 3 deletions Lib/test/test_sys.py
Original file line number Diff line number Diff line change
Expand Up @@ -527,14 +527,16 @@ def test_sys_flags(self):
"inspect", "interactive", "optimize", "dont_write_bytecode",
"no_user_site", "no_site", "ignore_environment", "verbose",
"bytes_warning", "quiet", "hash_randomization", "isolated",
"dev_mode")
"dev_mode", "utf8_mode")
for attr in attrs:
self.assertTrue(hasattr(sys.flags, attr), attr)
attr_type = bool if attr == "dev_mode" else int
self.assertEqual(type(getattr(sys.flags, attr)), attr_type, attr)
self.assertTrue(repr(sys.flags))
self.assertEqual(len(sys.flags), len(attrs))

self.assertIn(sys.flags.utf8_mode, {0, 1, 2})

def assert_raise_on_new_sys_type(self, sys_attr):
# Users are intentionally prevented from creating new instances of
# sys.flags, sys.version_info, and sys.getwindowsversion.
Expand Down Expand Up @@ -710,8 +712,8 @@ def test_c_locale_surrogateescape(self):
# have no any effect
out = self.c_locale_get_error_handler(encoding=':')
self.assertEqual(out,
'stdin: surrogateescape\n'
'stdout: surrogateescape\n'
'stdin: strict\n'
'stdout: strict\n'
'stderr: backslashreplace\n')
out = self.c_locale_get_error_handler(encoding='')
self.assertEqual(out,
Expand Down
Loading

0 comments on commit 91106cd

Please sign in to comment.