Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-29240: PEP 540: Add a new UTF-8 mode #855

Merged
merged 6 commits into from
Dec 13, 2017
Merged

bpo-29240: PEP 540: Add a new UTF-8 mode #855

merged 6 commits into from
Dec 13, 2017

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Mar 27, 2017

  • Add -X utf8 command line option
  • Add PYTHONUTF8 environment variable
  • Add sys.flags.utf8_mode
  • If the LC_CTYPE is "C" at startup: enables automatically the UTF-8
    mode
  • In UTF-8 mode, open() now uses UTF-8 encoding by default
  • subprocess._args_from_interpreter_flags(): inherit the UTF-8 mode
    using -X utf8 or -X utf8=strict command line options.
  • Skip some tests relying on the current locale if the UTF-8 mode is
    enabled.
  • Add test_utf8mode.py
  • _Py_DecodeUTF8_surrogateescape() gets a new optional parameter to
    return also the length (number of wide characters).
  • Update Py_DecodeLocale() and Py_EncodeLocale() to support the UTF-8
    mode

https://bugs.python.org/issue29240

@mention-bot
Copy link

@Haypo, thanks for your PR! By analyzing the history of the files in this pull request, we identified @zooba, @benjaminp and @serhiy-storchaka to be potential reviewers.

@vstinner vstinner changed the title bpo-: 29240: Implement the PEP 540 bpo-29240: Implement the PEP 540 Mar 27, 2017
@brettcannon brettcannon added the type-feature A feature request or enhancement label Mar 27, 2017
Copy link
Member

@zooba zooba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to see some improved documentation before this is merged, but I think the rest of the change looks okay.

@@ -463,6 +468,10 @@ always available.
Windows is no longer guaranteed to return ``'mbcs'``. See :pep:`529`
and :func:`_enablelegacywindowsfsencoding` for more information.

.. versionchanged:: 3.7
The UTF-8 mode can now changes the encoding.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either: "UTF-8 mode can now change the encoding" or "UTF-8 mode now changes the encoding", neither of which are very good at explaining what the impact is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the point that specifying both UTF-8 mode and legacy encoding is not supported? Or specifying UTF-8 mode will override the legacy encoding setting?

@@ -423,6 +424,9 @@ Miscellaneous options
.. versionadded:: 3.6
The ``-X showalloccount`` option.

.. versionchanged:: 3.7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"versionadded"

@@ -28,6 +28,10 @@ PyAPI_DATA(const char *) Py_FileSystemDefaultEncodeErrors;
#endif
PyAPI_DATA(int) Py_HasFileSystemDefaultEncoding;

#if !defined(Py_LIMITED_API) || Py_LIMITED_API+0 >= 0x03070000
PyAPI_DATA(int) Py_UTF8Mode;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an important piece of information? Is it so important that it has to be in the limited API? What would I use this for?

Modules/main.c Outdated
break;
}
else if (c == 'X') {
if (wcscmp(_PyOS_optarg, L"utf8") == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wcsicmp? Any need to be case-sensitive here?

@zooba
Copy link
Member

zooba commented Mar 27, 2017

Ping @ncoghlan - more changes to startup sequence :)

@vstinner
Copy link
Member Author

FYI I didn't look again at my code. I only recreated this pull request to not loose the code, since my old PR was on the old CPython repository.

I still have to update the PEP and post it to python-dev.

Questions like "wcsicmp? Any need to be case-sensitive here?" should be asked on the PEP directly, not on the implementation.

You should wait until the PEP is accepted before reviewing the implementation. The PEP may still change.

@vstinner vstinner changed the title bpo-29240: Implement the PEP 540 [WIP don't review yet!] bpo-29240: Implement the PEP 540 Mar 27, 2017
@zooba
Copy link
Member

zooba commented Mar 28, 2017

@Haypo Ah fair enough :) I haven't been paying attention to all the email recently, so I figured it was accepted already.

@vstinner vstinner changed the title [WIP don't review yet!] bpo-29240: Implement the PEP 540 [WIP] bpo-29240: Implement the PEP 540 Aug 10, 2017
@vstinner vstinner changed the title [WIP] bpo-29240: Implement the PEP 540 bpo-29240: Implement the PEP 540 Dec 5, 2017
@vstinner vstinner changed the title bpo-29240: Implement the PEP 540 bpo-29240: PEP 540: Add a new UTF-8 mode Dec 5, 2017
@vstinner
Copy link
Member Author

vstinner commented Dec 5, 2017

The implementation is not complete: the command line arguments and environment variables are not re-decoded from UTF-8 once Py_Main() detects that the user requested the UTF-8 mode (-X utf8 or PYTHONUTF8). But thanks to my refactoring work on Modules/main.c ( bugs.python.org/issue32030 ), fixing this issue should now be much simpler.

* Add -X utf8 command line option, PYTHONUTF8 environment variable
  and a new sys.flags.utf8_mode flag.
* If the LC_CTYPE locale is "C" at startup: enable automatically the
  UTF-8 mode.
* Add _winapi.GetACP(). encodings._alias_mbcs() now calls
  _winapi.GetACP() to get the ANSI code page
* locale.getpreferredencoding() now returns 'UTF-8' in the UTF-8
  mode. As a side effect, open() now uses the UTF-8 encoding by
  default in this mode.
* Py_DecodeLocale() and Py_EncodeLocale() now use the UTF-8 encoding
  in the UTF-8 mode.
* Update subprocess._args_from_interpreter_flags() to handle -X utf8
* Skip some tests relying on the current locale if the UTF-8 mode is
  enabled.
* Add test_utf8mode.py.
* _Py_DecodeUTF8_surrogateescape() gets a new optional parameter to
  return also the length (number of wide characters).
* pymain_get_global_config() and pymain_set_global_config() now
  always copy flag values, rather than only copying if the new value
  is greater than the old value.
@vstinner
Copy link
Member Author

vstinner commented Dec 8, 2017

I updated the PR for the 4th version of the PEP 540: getpreferredencoding() now returns UTF-8 in the UTF-8 Mode.

Py_UTF8Mode = 1;
}
setlocale(LC_CTYPE, ctype);
PyMem_RawFree(ctype);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this setlocale() for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line 64 changes the locale to the user LC_CTYPE locale. This line restores the locale to its previous value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't previous value be got by setlocale(LC_CTYPE, NULL) instead of setlocale(LC_CTYPE, "")?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you're right. I fixed it in a new commit.

Fix conflict in Modules/main.c: pymain_set_global_config() and
_Py_CheckHashBasedPycsMode.
Mock _winapi.GetACP(), not _bootlocale.getpreferredencoding().
@vstinner
Copy link
Member Author

@methane: You approved my PEP. This is implementation, all tests pass on Linux and Windows. While the implementation is not perfect, I propose to merge it right now to be get more time before the Python 3.7 final to test it.

Remaining issue: when the UTF-8 mode is enabled manually (-X utf8), the command line arguments and environment variables are decoded from the locale encoding instead of being decoded from UTF-8. The fix should be easy, but I didn't have time yet to implement it.

Another less important issue: the code to check if the LC_CTYPE locale is the POSIX locale is not currently shared between locale coercion (PEP 538) and UTF-8 Mode (PEP 540).

/* UTF-8 mode */
char *old_ctype = setlocale(LC_CTYPE, NULL);
if (old_ctype != NULL) {
old_ctype = _PyMem_RawStrdup(old_ctype);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, it can be:

    if (_Py_LegacyLocaleDetected()) {
        Py_UTF8Mode = 1;
        _Py_CoerceLegacyLocale();
    }

See here

if (_Py_LegacyLocaleDetected()) {
_Py_CoerceLegacyLocale();
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"if (_Py_LegacyLocaleDetected()) { Py_UTF8Mode = 1;"

Right. It works as expected, I updated my PR.

@methane
Copy link
Member

methane commented Dec 12, 2017

I feel adding more code to Program/python.c is bad smell.
Py_Main() is very high level API, but very hard to use in right way for Unix users.
How about adding Py_UnixMain() and move all code in main() to it?

And when -X utf8 option is found, we can decode from char **argv again.
Since mbstowcs() doesn't guarantee round tripping, it is better than re-encode
wchar_t **argv.

@vstinner
Copy link
Member Author

I feel adding more code to Program/python.c is bad smell.

It is.

Py_Main() is very high level API, but very hard to use in right way for Unix users. How about adding Py_UnixMain() and move all code in main() to it?

main() and Py_Main() are very complex. With the PEP 432, Nick Coghlan, Eric Snow and me are working on making this code better. See for example https://bugs.python.org/issue32030

Currently, Py_Main() (Modules/main.c) and _PyPathConfig_Calculate() (Modules/getpath.c and PC/getpathp.c) are fully implemented with wchar_t*.

Parsing the command line options using char* on Unix but wchar_t* would make the code more complex. I expect that many lines of code would have to be duplicate, one version for char*, one version for wchar_t*.

For all these reasons, I propose to merge this uncomplete PR and write a different PR for the most complex part, re-encode wchar_t* command line arguments, implement Py_UnixMain() or another even better option?

@vstinner
Copy link
Member Author

Oops, I forgot to push my main.c change. It's now done.

@vstinner vstinner merged commit 91106cd into python:master Dec 13, 2017
@vstinner vstinner deleted the pep540 branch December 13, 2017 11:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-feature A feature request or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants