bpo-29240: PEP 540: Add a new UTF-8 mode #855

vstinner · 2017-03-27T22:03:04Z

Add -X utf8 command line option
Add PYTHONUTF8 environment variable
Add sys.flags.utf8_mode
If the LC_CTYPE is "C" at startup: enables automatically the UTF-8
mode
In UTF-8 mode, open() now uses UTF-8 encoding by default
subprocess._args_from_interpreter_flags(): inherit the UTF-8 mode
using -X utf8 or -X utf8=strict command line options.
Skip some tests relying on the current locale if the UTF-8 mode is
enabled.
Add test_utf8mode.py
_Py_DecodeUTF8_surrogateescape() gets a new optional parameter to
return also the length (number of wide characters).
Update Py_DecodeLocale() and Py_EncodeLocale() to support the UTF-8
mode

https://bugs.python.org/issue29240

mention-bot · 2017-03-27T22:03:16Z

@Haypo, thanks for your PR! By analyzing the history of the files in this pull request, we identified @zooba, @benjaminp and @serhiy-storchaka to be potential reviewers.

zooba

Would like to see some improved documentation before this is merged, but I think the rest of the change looks okay.

zooba · 2017-03-27T23:19:53Z

Doc/library/sys.rst

@@ -463,6 +468,10 @@ always available.
      Windows is no longer guaranteed to return ``'mbcs'``. See :pep:`529`
      and :func:`_enablelegacywindowsfsencoding` for more information.

+   .. versionchanged:: 3.7
+      The UTF-8 mode can now changes the encoding.


Either: "UTF-8 mode can now change the encoding" or "UTF-8 mode now changes the encoding", neither of which are very good at explaining what the impact is.

Is the point that specifying both UTF-8 mode and legacy encoding is not supported? Or specifying UTF-8 mode will override the legacy encoding setting?

zooba · 2017-03-27T23:20:06Z

Doc/using/cmdline.rst

@@ -423,6 +424,9 @@ Miscellaneous options
   .. versionadded:: 3.6
      The ``-X showalloccount`` option.

+   .. versionchanged:: 3.7


"versionadded"

zooba · 2017-03-27T23:21:08Z

Include/fileobject.h

@@ -28,6 +28,10 @@ PyAPI_DATA(const char *) Py_FileSystemDefaultEncodeErrors;
 #endif
 PyAPI_DATA(int) Py_HasFileSystemDefaultEncoding;

+#if !defined(Py_LIMITED_API) || Py_LIMITED_API+0 >= 0x03070000
+PyAPI_DATA(int) Py_UTF8Mode;


Is this an important piece of information? Is it so important that it has to be in the limited API? What would I use this for?

zooba · 2017-03-27T23:23:28Z

Modules/main.c

-            break;
+        }
+        else if (c == 'X') {
+            if (wcscmp(_PyOS_optarg, L"utf8") == 0) {


wcsicmp? Any need to be case-sensitive here?

zooba · 2017-03-27T23:27:03Z

Ping @ncoghlan - more changes to startup sequence :)

vstinner · 2017-03-27T23:34:38Z

FYI I didn't look again at my code. I only recreated this pull request to not loose the code, since my old PR was on the old CPython repository.

I still have to update the PEP and post it to python-dev.

Questions like "wcsicmp? Any need to be case-sensitive here?" should be asked on the PEP directly, not on the implementation.

You should wait until the PEP is accepted before reviewing the implementation. The PEP may still change.

zooba · 2017-03-28T15:57:14Z

@Haypo Ah fair enough :) I haven't been paying attention to all the email recently, so I figured it was accepted already.

vstinner · 2017-12-05T22:15:11Z

The implementation is not complete: the command line arguments and environment variables are not re-decoded from UTF-8 once Py_Main() detects that the user requested the UTF-8 mode (-X utf8 or PYTHONUTF8). But thanks to my refactoring work on Modules/main.c ( bugs.python.org/issue32030 ), fixing this issue should now be much simpler.

* Add -X utf8 command line option, PYTHONUTF8 environment variable and a new sys.flags.utf8_mode flag. * If the LC_CTYPE locale is "C" at startup: enable automatically the UTF-8 mode. * Add _winapi.GetACP(). encodings._alias_mbcs() now calls _winapi.GetACP() to get the ANSI code page * locale.getpreferredencoding() now returns 'UTF-8' in the UTF-8 mode. As a side effect, open() now uses the UTF-8 encoding by default in this mode. * Py_DecodeLocale() and Py_EncodeLocale() now use the UTF-8 encoding in the UTF-8 mode. * Update subprocess._args_from_interpreter_flags() to handle -X utf8 * Skip some tests relying on the current locale if the UTF-8 mode is enabled. * Add test_utf8mode.py. * _Py_DecodeUTF8_surrogateescape() gets a new optional parameter to return also the length (number of wide characters). * pymain_get_global_config() and pymain_set_global_config() now always copy flag values, rather than only copying if the new value is greater than the old value.

vstinner · 2017-12-08T15:15:26Z

I updated the PR for the 4th version of the PEP 540: getpreferredencoding() now returns UTF-8 in the UTF-8 Mode.

methane · 2017-12-10T17:24:25Z

Programs/python.c

+            Py_UTF8Mode = 1;
+        }
+        setlocale(LC_CTYPE, ctype);
+        PyMem_RawFree(ctype);


What is this setlocale() for?

The line 64 changes the locale to the user LC_CTYPE locale. This line restores the locale to its previous value.

Shouldn't previous value be got by setlocale(LC_CTYPE, NULL) instead of setlocale(LC_CTYPE, "")?

Oh, you're right. I fixed it in a new commit.

Fix conflict in Modules/main.c: pymain_set_global_config() and _Py_CheckHashBasedPycsMode.

Mock _winapi.GetACP(), not _bootlocale.getpreferredencoding().

vstinner · 2017-12-12T10:05:44Z

@methane: You approved my PEP. This is implementation, all tests pass on Linux and Windows. While the implementation is not perfect, I propose to merge it right now to be get more time before the Python 3.7 final to test it.

Remaining issue: when the UTF-8 mode is enabled manually (-X utf8), the command line arguments and environment variables are decoded from the locale encoding instead of being decoded from UTF-8. The fix should be easy, but I didn't have time yet to implement it.

Another less important issue: the code to check if the LC_CTYPE locale is the POSIX locale is not currently shared between locale coercion (PEP 538) and UTF-8 Mode (PEP 540).

methane · 2017-12-12T13:56:26Z

Programs/python.c

+    /* UTF-8 mode */
+    char *old_ctype = setlocale(LC_CTYPE, NULL);
+    if (old_ctype != NULL) {
+        old_ctype = _PyMem_RawStrdup(old_ctype);


Maybe, it can be:

if (_Py_LegacyLocaleDetected()) { Py_UTF8Mode = 1; _Py_CoerceLegacyLocale(); }

See here

cpython/Programs/python.c

Lines 75 to 77 in 4ae06c5

if (_Py_LegacyLocaleDetected()) {

_Py_CoerceLegacyLocale();

}

"if (_Py_LegacyLocaleDetected()) { Py_UTF8Mode = 1;"

Right. It works as expected, I updated my PR.

methane · 2017-12-12T15:41:27Z

I feel adding more code to Program/python.c is bad smell.
Py_Main() is very high level API, but very hard to use in right way for Unix users.
How about adding Py_UnixMain() and move all code in main() to it?

And when -X utf8 option is found, we can decode from char **argv again.
Since mbstowcs() doesn't guarantee round tripping, it is better than re-encode
wchar_t **argv.

vstinner · 2017-12-13T01:35:14Z

I feel adding more code to Program/python.c is bad smell.

It is.

Py_Main() is very high level API, but very hard to use in right way for Unix users. How about adding Py_UnixMain() and move all code in main() to it?

main() and Py_Main() are very complex. With the PEP 432, Nick Coghlan, Eric Snow and me are working on making this code better. See for example https://bugs.python.org/issue32030

Currently, Py_Main() (Modules/main.c) and _PyPathConfig_Calculate() (Modules/getpath.c and PC/getpathp.c) are fully implemented with wchar_t*.

Parsing the command line options using char* on Unix but wchar_t* would make the code more complex. I expect that many lines of code would have to be duplicate, one version for char*, one version for wchar_t*.

For all these reasons, I propose to merge this uncomplete PR and write a different PR for the most complex part, re-encode wchar_t* command line arguments, implement Py_UnixMain() or another even better option?

vstinner · 2017-12-13T10:56:44Z

Oops, I forgot to push my main.c change. It's now done.

the-knights-who-say-ni added the CLA signed label Mar 27, 2017

vstinner changed the title ~~bpo-: 29240: Implement the PEP 540~~ bpo-29240: Implement the PEP 540 Mar 27, 2017

brettcannon added the type-feature A feature request or enhancement label Mar 27, 2017

zooba reviewed Mar 27, 2017

View reviewed changes

vstinner changed the title ~~bpo-29240: Implement the PEP 540~~ [WIP don't review yet!] bpo-29240: Implement the PEP 540 Mar 27, 2017

vstinner changed the title ~~[WIP don't review yet!] bpo-29240: Implement the PEP 540~~ [WIP] bpo-29240: Implement the PEP 540 Aug 10, 2017

Mariatta added needs rebase and removed needs rebase labels Oct 9, 2017

vstinner requested a review from gpshead as a code owner December 5, 2017 22:09

vstinner changed the title ~~[WIP] bpo-29240: Implement the PEP 540~~ bpo-29240: Implement the PEP 540 Dec 5, 2017

vstinner changed the title ~~bpo-29240: Implement the PEP 540~~ bpo-29240: PEP 540: Add a new UTF-8 mode Dec 5, 2017

methane reviewed Dec 10, 2017

View reviewed changes

vstinner added 3 commits December 10, 2017 18:59

Merge branch 'master' into pep540

700abba

Fix conflict in Modules/main.c: pymain_set_global_config() and _Py_CheckHashBasedPycsMode.

Fix test_codecs.test_mbcs_alias()

53c03f7

Mock _winapi.GetACP(), not _bootlocale.getpreferredencoding().

Fix main(): copy the current LC_CTYPE

24fc140

methane reviewed Dec 12, 2017

View reviewed changes

Merge branch 'master' into pep540

359930d

main.c: reuse _Py_LegacyLocaleDetected()

fa8ed15

vstinner merged commit 91106cd into python:master Dec 13, 2017

vstinner deleted the pep540 branch December 13, 2017 11:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpo-29240: PEP 540: Add a new UTF-8 mode #855

bpo-29240: PEP 540: Add a new UTF-8 mode #855

vstinner commented Mar 27, 2017 •

edited

Loading

mention-bot commented Mar 27, 2017

zooba left a comment

zooba Mar 27, 2017

zooba Mar 27, 2017

zooba Mar 27, 2017

zooba Mar 27, 2017

zooba Mar 27, 2017

zooba commented Mar 27, 2017

vstinner commented Mar 27, 2017

zooba commented Mar 28, 2017

vstinner commented Dec 5, 2017

vstinner commented Dec 8, 2017

methane Dec 10, 2017

vstinner Dec 10, 2017

methane Dec 10, 2017

vstinner Dec 10, 2017

vstinner commented Dec 12, 2017

methane Dec 12, 2017

vstinner Dec 13, 2017

methane commented Dec 12, 2017

vstinner commented Dec 13, 2017

vstinner commented Dec 13, 2017

bpo-29240: PEP 540: Add a new UTF-8 mode #855

bpo-29240: PEP 540: Add a new UTF-8 mode #855

Conversation

vstinner commented Mar 27, 2017 • edited Loading

mention-bot commented Mar 27, 2017

zooba left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zooba commented Mar 27, 2017

vstinner commented Mar 27, 2017

zooba commented Mar 28, 2017

vstinner commented Dec 5, 2017

vstinner commented Dec 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vstinner commented Dec 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

methane commented Dec 12, 2017

vstinner commented Dec 13, 2017

vstinner commented Dec 13, 2017

vstinner commented Mar 27, 2017 •

edited

Loading