-
-
Notifications
You must be signed in to change notification settings - Fork 31k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PEP 540: Add a new UTF-8 mode #73426
Comments
This issue tracks the implementation of the PEP-540. Attached pep540_cli.py script can be used to play with it. |
PEP-540.patch: first draft Changes:
Allowed options:
Prioririties (highest to lowest):
TODO:
|
Examples with pep540_cli.py. Python 3.5: $ python3 pep540_cli.py
sys.argv: ['pep540_cli.py']
stdin: UTF-8/strict
stdout: UTF-8/strict
stderr: UTF-8/backslashreplace
open(): UTF-8/strict
$ LC_ALL=C python3 pep540_cli.py
sys.argv: ['pep540_cli.py']
stdin: ANSI_X3.4-1968/surrogateescape
stdout: ANSI_X3.4-1968/surrogateescape
stderr: ANSI_X3.4-1968/backslashreplace
open(): ANSI_X3.4-1968/strict Patched Python 3.7: $ ./python pep540_cli.py
UTF-8 mode: 0
sys.argv: ['pep540_cli.py']
stdin: UTF-8/strict
stdout: UTF-8/strict
stderr: UTF-8/backslashreplace
open(): UTF-8/strict
$ LC_ALL=C ./python pep540_cli.py
UTF-8 mode: 1
sys.argv: ['pep540_cli.py']
stdin: utf-8/surrogateescape
stdout: utf-8/surrogateescape
stderr: utf-8/backslashreplace
open(): utf-8/surrogateescape
$ ./python -X utf8 pep540_cli.py
UTF-8 mode: 1
sys.argv: ['pep540_cli.py']
stdin: utf-8/surrogateescape
stdout: utf-8/surrogateescape
stderr: utf-8/backslashreplace
open(): utf-8/surrogateescape
$ ./python -X utf8=strict pep540_cli.py
UTF-8 mode: 2
sys.argv: ['pep540_cli.py']
stdin: utf-8/strict
stdout: utf-8/strict
stderr: utf-8/backslashreplace
open(): utf-8/strict |
PEP-540-2.patch: Patch version 2, updated to the latest version of the PEP-540. It has no more FIXME/TODO and has more unit tests. The main change is that the strict mode doesn't use strict anymore for OS data, but keeps surrogateescape. See the PEP for the rationale (especially the "Use the strict error handler for operating system data" alternative). |
Oops, I introduced an obvious bug in my latest refactoring. It's now fixed in the patch version 3: PEP-540-3.patch. |
Hum, PEP-540-3.patch doesn't work if the locale encoding is different than ASCII and UTF-8. argv must be reencoded: $ LC_ALL=fr_FR ./python -X utf8 -c 'import sys; print(ascii(sys.argv))' $(echo -ne "\xff")
['-c', '\xff'] The result should not depend on the locale, it should be the same than: $ LC_ALL=fr_FR.utf8 ./python -X utf8 -c 'import sys; print(ascii(sys.argv))' $(echo -ne "\xff")
['-c', '\udcff']
$ LC_ALL=C ./python -X utf8 -c 'import sys; print(ascii(sys.argv))' $(echo -ne "\xff")
['-c', '\udcff'] |
I only tested the the PEP-540 implementation on Linux. The PEP and its implementation should adjusted for Windows, especially Windows-only env vars like PYTHONLEGACYWINDOWSFSENCODING. Changes are maybe also needed for Mac OS X and Android, which always use UTF-8. Currently, the locale encoding is still used on these platforms (ex: by open()). Is it possible to a locale encoding different than UTF-8 on Android for example? |
I want to skip reencoding. I don't trust wcstombs/mbstowcs. It may not guarantee round tripping of arbitrary bytes. Can -X utf8 option be processed before Py_Main()? |
I'm trying to implement that, but it's hard to factorize the code. I will probably have to duplicate the code handling -E, -X utf8, PYTHONMALLOC and PYTHONUTF8 for wchar_t* (UCS4 or UTF-16) and char* (bytes). |
Hum, test_utf8mode lacks an unit test on the -E command line option: |
Patch version 4:
Note: This patch still has the sys.argv encoding bug with locale encodings different than ASCII and UTF-8. |
How about locale.getpreferredencoding() returns 'utf-8' in utf8 mode? |
Oh, I just noticed that os.environ uses the hardcoded error handler "surrogateescape": it should be replaced with sys.getfilesystemencodeerrors() to support UTF-8 Strict mode. |
I did that in the patch for bpo-28188. The focus of the patch is to add bytes support on Windows for os.putenv and os.environb, but I also tried to maximize consistency (at least parallel structure) between the POSIX and Windows implementations. |
I rebased my PR on master. |
I removed old patches in favor of the now up to date PR 855. |
Oh, PYTHONCOERCECLOCALE env var is read very early in main() by _Py_CoerceLegacyLocale(), it ignores -E command line option.
|
test_readline failed. It seems to be related to my commit: http://buildbot.python.org/all/#/builders/87/builds/360 ====================================================================== Traceback (most recent call last):
File "/usr/home/buildbot/python/3.x.koobs-freebsd10/build/Lib/test/test_readline.py", line 219, in test_nonascii
self.assertIn(b"text 't\\xeb'\r\n", output)
AssertionError: b"text 't\\xeb'\r\n" not found in bytearray(b"^A^B^B^B^B^B^B^B\t\tx\t\r\n[\\303\\257nserted]|t\x07\x08\x08\x08\x08\x08\x08\x08\x07\x07xrted]|t\x08\x08\x08\x08\x08\x08\x08\x07\r\nresult \'[\\xefnsexrted]|t\'\r\nhistory \'[\\xefnsexrted]|t\'\r\n") |
IHMO test_readline should be fixed by ignoring the UTF-8 mode in Py_EncodeLocale/Py_DecodeLocale, but only when called from the Python readline module. We need maybe new functions, something like: Py_EncodeCurrentLocale/Py_DecodeCurrentLocale. I will work on a patch when I will be back from holiday. In the meanwhile, I skipped the test to repair FreeBSD 3.x buildbots. |
Attached test_all_locales.py is a test suite for locale functions: os.strerror(), locale.localeconv(), time.strftime(). I tested it on Linux Fedora 27, FreeBSD 11.0 and macOS 10.13.2. The test should always pass on Python 2.7. On Python 3.6 and the master branch with PR 5170, 2 tests on numeric localeconv() fail because Python uses the wrong encoding: see bpo-31900. master with PR 5170 now has less encoding bugs than Python 3.6. |
Oh, this change broke test_nonascii() of test_readline() on FreeBSD. Previsously, readline used ASCII/surrogateescape encoding for the POSIX locale. Now, mbstowcs() / wcstombs() is called directly, with the surrogateescape error handler. |
test_readline pass again on all buildbots, especially on FreeBSD 3.6 and 3.x buildbots. There are no more known issues, the implementation of the PEP-540 (UTF-8 Mode) is now complete! |
I partially reverted the commit 7ed7aea: on Android, UTF-8 is now always used, again. Paul Peny (aka pmpp) confirmed me that my commit broke Python on Android, at least with API 19 (locales don't work properly before API 21). |
Oh, this change broke the mbcs alias on Windows and the test_codecs and test_site tests (2 tests!) missed the bug :-( I fixed it in: New changeset 04dd60e by Victor Stinner in branch 'main': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: