-
-
Notifications
You must be signed in to change notification settings - Fork 31.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite import machinery to work with unicode paths #53671
Comments
Python (2 and 3) is unable to load a module installed in a directory containing characters not encodable to the locale encoding. And Python doesn't work if it's installed in non-ASCII directory on Windows or with a locale encoding different than UTF-8. On Windows, the locale encoding is "mbcs", which is a small charset, unable to mix different languages, whereas the file system is fully unicode compatible (it uses UTF-16). Python should work with unicode strings (wchar_t*, Py_UNICODE* or PyUnicodeObject) instead of byte strings (char* or PyBytesObject), especially while loading a Python module. It's not an easy task because it requires to change a lot of code, especially in Python/import.c. I am working on this topic since some months and I have now a working patch. It's now possible to run Python from the source tree containing a non-ASCII character in C locale (ASCII encoding). Except just a minor bug in test_gdb, all tests of the test suite pass. I posted the whole patch on Rietveld for a review: The patch is huge because it fixes different things: a) import machinery (import.c, getpath.c, importdl.c, ...) (b), (c) and (d) can be fixed before/without (a). But (a) requires other parts to work correctly. If it's not possible to review the patch, I can try to split it in smaller parts. -- Related issues: bpo-3080: Full unicode import system -- See also my email sent to python-dev for more information: |
Oh, I forgot to say that I created an svn branch including my work: import_unicode. You can try it if you prefer svn to an huge patch. I created a branch so you can follow my work commit by commit using svn history. -- The patch is not completly done. There are still remaining FIXMEs. Some FIXME are not bugs, but improvments. The most important FIXME is to restore the support of bytes path in sys.path. I removed it temporary, because it was easier for me. |
I wrote a few minor comments on codereview. |
Which kind of test? Run the test suite in a non-ASCII directory with encoding different than utf-8 is enough. If the patch is accepted, the solution is maybe a specific buildbot. |
Another important TODO: use weak references for the code objects list. -- I tested my patch on Windows. I fixes bpo-8988 because non-ASCII characters are now correctly decoded with mbcs and not UTF-8. But it doesn't work with characters not encodable to mbcs. It looks like there are some remaining code using byte string. I fixed some of them in import_unicode branch, but it's not enough. It is not easy to investigate because Visual Studio refuse to compile the project if the project directory contains a character not encodable to mbcs. And it is unable to debug python if the project directory is renamed after the compilation. I will maybe retry with Cygwin or with the old school "printf" method. It looks like few Windows applications support characters not encodable to mbcs (locale encoding): MinGW and WinSCP do neither support such characters. |
After some tests on Windows, I realized that my patch is not enough to be fully unicode compliant (on Windows). Some functions are still using PyUnicode_DecodeFSDefault() or PyUnicode_EncodeFSDefault(). Until all functions are patched to use unicode strings, Python3 will not be fully unicode compliant *on Windows*. The problem is specific to Windows, because Python uses mbcs codec which doesn't support surrogateescape error handler. I think that this patch is already huge and complex, and it will be difficult to fix all issues at the same time. This patch does improve the situation: with the patch, Python is fully unicode compliant (except on Windows), and it fixes at least one issue on Windows (bpo-8988, it now uses the right encoding). |
The patch is too huge to be commited at once. I will split it again into smaller parts. First related commit: r83778 fixes tests for not encodable filenames. |
r83779 creates run_command(), it's just a refactorization. |
_Py_wchar2char.patch: create _Py_wchar2char() private function, and _wstat() and _wfopen() use it. _Py_wchar2char() function has been improved since the previous version posted to Rietveld: it now computes the exact length of the output buffer, instead of using wcslen(text)*10+1. Alone, this patch isn't really useful, but it prepares the code for next patches. |
r83783 creates run_file() subfunction. |
pyerr_warnformat.patch: create PyErr_WarnFormat() function, and use it in PyType_Ready() and PyUnicode_AsEncodedString(). The patch fixes also setup_context(): work on the unicode filename, not the encoded (bytes) filename. It does fix a bug because len is a number of characters, not a number of bytes: the number of bytes is bigger than the number of characters if the filename contains a non-ASCII character. Advantages of PyErr_WarnFormat() over PyOS_snprintf() + PyErr_WarnEx():
Differences with Rietveld's version: rename PyErr_WarnUnicode() to warn_unicode() (it's now a static function) and document PyErr_WarnFormat(). |
nullimporter_unicode.patch: patch NullImporter_init():
|
It looks like you are a fixing a bug in setup_context() at the same time as you introduce PyErr_WarnFormat(). Both changes should probably go in separately. The PyErr_WarnFormat() doc needs a "versionadded" tag. |
pitrou> It looks like you are a fixing a bug in setup_context() Right. r83860 fixes the bug, and I attached a new version of the patch (with :versionadded:). |
gutworth's comment about r83860: "Test?" |
Py_UNICODE_strrchr.patch: Create Py_UNICODE_strrchr() function. It will be used for zipimport to work on unicode paths instead of bytes paths. Antoine noticed that the input string is const whereas the output string is not const, which is unusual. I copy/pasted Py_UNICODE_strchr() prototype. I suppose that const input and non const input is required to be able to use the function on const strings. The GNU libc uses the same strchr() prototype in its C version of string.h. In the C++ version of the header, it defines the strchr() twice: once with const input and output, once with non const input and ouput. The right solution is the C++ way, but C doesn't support polymophism. |
I created a separated issue, bpo-9542, to add the new function PyUnicode_FSDecoder(). |
_Py_stat.patch: create _Py_stat() function. It will be used in import.c and zipimport.c. I added the function to import.c because, initially, I only used it there. But it's maybe not the best place for such function. posixmodule.c doesn't fit because it is not part of the bootstrap process. I created this function to get full unicode support on Windows (don't fallback to bytes using the evil mbcs encoding). In import.c and zipimport.c, it is used to check if the path is a regular file or if the path is a directory. That's why _Py_stat() only fills st_mode attribute (it's just enough). A better API would be maybe functions checking directly these properties? Maybe Py_isdir() (as os.path.isdir()) and Py_isreg()? Or if you prefer longer names: Py_is_directory() ad Py_is_regular_file()? Such functions can be implemented differently, eg. use GetFileAttributesW on Windows. I say that because of a comment found in NullImporter_init():
|
r83870 creates load_builtin() subfunction in import.c to prepare and simplify the big patch. |
I commited Py_UNICODE_strrchr.patch as r83933 after removing the useless start variable. |
_PyFile_FromFdUnicode.patch: create _PyFile_FromFdUnicode() function. It will be used in import.c to open a file using an unicode filename. For _PyFile_FromFd(), I kept the previous behaviour: clear the exception on PyUnicode_DecodeFSDefault() error. For fileobject.h: I used the same style than unicodeobject.h, one argument per line with their name. I prefer to write the argument name because the header can be used as a quick documentation. As _PyFile_FromFd(), name is optional (can be NULL) for _PyFile_FromFdUnicode(). |
Actually, I'm not sure there's much point since the "name" attribute is currently read-only: >>> f = open(1, "wb")
>>> f.name = "foo"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: attribute 'name' of '_io.BufferedWriter' objects is not writable
>>>
>>> g = open(1, "w")
>>> g.name = "bar"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: attribute 'name' of '_io.TextIOWrapper' objects is not writable |
(About PyFile_FromFd) Oh, it remembers me bpo-4762. I closed this issue with the message "The last problem occurs with imp.find_module(). But imp.find_module() also returns a "filename" argument, so I don't think that the issue really matters. Let's close it ;-)". Even if it would be possible to set f(.buffer).raw.name, the solution is maybe just to ignore the argument (don't set any name attribute). Can we change such public function? |
r83971 enables test.support.TESTFN_UNDECODEABLE on non-Windows OSes. |
I commited nullimporter_unicode.patch with an unit test as r83972. |
r83973 ignores the name argument of PyFile_FromFd() because it was already ignored (it did always produce an error) and it avoids my complex _PyFile_FromFdUnicode.patch. Thanks Antoine to having notice that name was ignored. |
Note about _Py_wchar2char(): it is possible to convert character by character (instead of working on substrings) because the input string doesn't contain surrogate pairs. _Py_char2wchar() ensures the the output string doens't contain surrogate pairs: if a byte sequence produces a surrogate pairs, the byte sequence is encoded using the surrogateescape error handler (U+DC00..U+DCFF range). I should add this note in _Py_wchar2char() comment. |
r83981 closes bpo-9560: avoid the filename in _syscmd_file() to fix a bug with non encodable filenames in platform.architecture(). |
I know this is not introduced by your patch, just moved, but couldn’t |
I wasn't sure that it was a typo, so I kept it unchanged. It's now fixed by |
r83989 creates _Py_wchar2char() function (_Py_wchar2char-2.patch). |
r83990 closes bpo-9542 by creating the PyUnicode_FSDecoder() PyArg_ParseTuple parser. |
r83976 adds PyErr_WarnFormat() (pyerr_warnformat-2.patch). |
I created bpo-9599: Add PySys_FormatStdout and PySys_FormatStderr functions. |
r84012 creates _Py_stat(). It is a little bit different than the attached patch (_Py_stat.patch): it doesn't clear Python exception on unicode conversion error. |
r84012 patchs zipimporter_init() to use the new PyUnicode_FSDecoder() and use Py_UNICODE* (unicode) strings instead of char* (byte) strings. |
r84030 creates _Py_fopen() for PyUnicodeObject path. |
zipimport_read_directory.patch: patch for read_directory() function of the zipimport module to support unencodable filenames. This patch requires bpo-9599 (PySys_FormatStderr). The patch changes the encoding of the name: decode name byte string using the file system encoding (and the PEP-383 on POSIX) instead of the utf-8 in strict mode. |
r83972 breaks OS X buildbots: support.TESTFN_UNENCODABLE is not defined if sys.platform == 'darwin'. File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/test_imp.py", line 309, in <module> |
It breaks test_unicode_file on OS X, too: File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/test_unicode_file.py", line 8, in <module> |
I tried to fix Mac OS X (TESTFN_UNENCODABLE) with r84035, but I don't have access to Mac OS X to test and my patch was not correct. It should now be ok with r84080. |
zipimport_read_directory.patch commited as r84095. |
Py_UNICODE_strncmp.patch: create Py_UNICODE_strncmp() function. |
Py_UNICODE_strncmp.patch was wrong for n=0. New version based on libiberty/strncmp.c source code. |
Py_UNICODE_strncmp-2.patch commited as r84111. |
r84120: get_data() function of zipimport uses an unicode path. |
r84121: repr() method zipimporter object uses unicode. |
r84122 saves/restores the exception around "filename = _PyUnicode_AsString(co->co_filename);" because it raises an unicode error on unencodable filename. |
r84168 creates PyModule_GetFilenameObject(). I created a separated issue for the patch reencoding all filenames when setting the filesystem encoding: bpo-9630. |
See also bpo-1552880. |
Yes. in bpo-1552880 I tried to make as minimal a change as possible. This particular patch is still in use in EVE Online, which is installed in various strange and exotic paths in the orient.. The trick I employed there was to encode everything to utf-8 at the earliest oppertunity (current working directory, any unicode members in sys.path, etc.) and let the import.c machinery crunch that utf-8 code. This works because path separators and other such stuff doesn't change under the utf-8 encoding. As a final step, the utf8 encoded working string is converted back to unicode and native unicode API calls (on windows) are used to stat() and open() files. A similar trick could be used on unix by converting from utf-8 to whatever native encoding the stat() and open() calls expect. My patch never got accepted because I didn't have the time to put in the extra effort to make it cross platform. |
oops, it's r84013 (not r84012) |
Py_UNICODE_strcat.patch: create Py_UNICODE_strcat() function. Py_UNICODE_strdup.patch: create Py_UNICODE_strdup() function. |
r84429 creates Py_UNICODE_strcat() (change with the patch: return the right value). r84430 creates PyUnicode_strdup() (change with the patch: rename the function from Py_UNICODE_strdup() to PyUnicode_strdup() and mangle the function name). |
r85115 closes bpo-9630: an important patch for bpo-9425, redecode all filenames when setting the filesystem encoding. Next tasks (maybe not in this order):
|
Starting at r85691, the full test suite of Python 3.2 pass with ASCII, ISO-8859-1 and UTF-8 locale encodings in a non-ascii directory. The work on this issue is done. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: