New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unicode support for os.listdir() #37949
Comments
The attached patch makes os.listdir() return unicode strings, on plaforms that have Py_FileSystemDefaultEncoding defined as non-NULL. I'm by no means sure this is the right thing to do; it does seem right on OSX where Py_FileSystemDefaultEncoding is (or rather: will be real soon, I'm waiting for Jack's approval) utf-8. I'd be happy to add the code in an OSX-specific switch. A more subtle variant could perhaps only return unicode strings if the file name is not ASCII. |
Logged In: YES At the very least, I'd like it to return Unicode only when |
Logged In: YES The code which uses unicode APIs should probably be wrapped #ifdef Py_USING_UNICODE
/* code */
#endif |
Logged In: YES Applied both suggestions. However, I'm not sure if my ASCII test does the right thing, or at least I don't think it does if Py_FileSystemDefaultEncoding is not a superset of ASCII. |
Logged In: YES Your test will probably catch most cases, but it could fail The only true test would be to first convert to Unicode and then |
Logged In: YES I don't see hot UTF-16 could be a valid value for Py_FileSystemDefaultEncoding, as for most platforms the file name can't contain null bytes. My looking at the NAMELEN() spaghetti, it seems platforms without HAVE_DIRENT_H might still support embedded null bytes. Any wisdom on this? |
Logged In: YES The file system does not need to support embedded \0 chars I'm not sure whether other variable length encodings can result There's also the possibility of the If you absolutely want to use the simple test, I'd at least Note that isalnum() can be locale dependent on some |
Logged In: YES Ok, I went for your original suggestion: always convert to unicode and then try to convert to ascii. See new patch. Or should this use the default encoding? Hm. |
Logged In: YES Good question. The default encoding would better fit Instead of PyUnicode_AsASCIIString(v) you'd |
Logged In: YES On the other hand, if it's not ASCII, wouldn't a unicode string be more appropriate to begin with? If it's encodable with the default encoding, this will happen as soon as the string is used in a piece of unicode-unaware code, right? |
Logged In: YES Right, except that injecting Unicode into Unicode-unaware code E.g. if someone sets the default encoding to Latin-1 he wouldn't This may be a problem in general for the change to os.listdir(). |
Logged In: YES Here's an argument for ASCII and against the default encoding: if the default encoding is different from Py_FileSystemDefaultEncoding, things go wrong: an 8-bit string passed to file() will be interpreted as Py_FileSystemDefaultEncoding (more precisely: will not be interpreted at all), not the default encoding... |
Logged In: YES Ok, let's look at it from a different That said, I think you're right about the ASCII approach What I worry about is that if os.listdir() gives back I really don't know what to do here... |
Logged In: YES I'm pretty sure os.path deals just fine with unicode strings (it's all pure string manipulations, isn't it?) Worries: well, apparently on Windows os.listdir() has been returning unicode for some time, so it's not like we're breaking completely new grounds here. If anything breaks it's probably good this happens, as it gives an opportunity to fix things... I just found several example of potential breakage: _bsddb.c parses a filename arg with the "z" format specifier. gdbmmodule.c uses "s". bsddbmodule.c and dbmmodule.c as well. I'm not sure the above modules work on Windows with non-ascii filenames at all, but it doesn't look like it. Besides Windows (for which my patch is not relevant), only OSX sets Py_FileSystemDefaultEncoding, so any new breakage won't reach a mass market right away <wink>. |
Logged In: YES Having missed 2.3a2, I'd like to get this in way ahead of 2.3b1. Any objections? |
Logged In: YES OK, check it in, just be prepared for contingencies. I |
Logged In: YES Checked in as rev. 2.287 of Modules/posixmodule.c. Leaving this item open for now, in case MvL has comments when he gets back. |
Logged In: YES I think this patch does more bad than good. A practical problem is that os.path.walk doesn't work anymore if there are My preferred solution would be to do the unicode trick everywhere. Second best would be to retract the whole thing and think about it a bit more for Python 2.4. |
Logged In: YES I dislike this change, as it introduces inconsistency across I'm also unsure about the exception handling. If there is a |
Logged In: YES Jack, as noted on #bug 696261, the bug is that os.listdir() doesn't do the right thing with a Unicode string argument (it should use Py_FileSystemDefaultEncoding but it doesn't; I'm working on it. Martin: I now see that PEP-277 says "Under this proposal, [os.listdir] will return a list of Unicode strings when its path argument is Unicode". I don't like this much (I really think we should push Unicode a little harder onto the users), but I'll look into changing the unix end of os.listdir() to do the same. I'll also review your exception comment. |
Logged In: YES I've attached a patch that fixes the bug as well as addresses the unicode arg vs. return value inconsistency that Martin noted. The exception behavior has not yet been changed. |
Logged In: YES Looks good, but incomplete: If the argument is Unicode, |
Logged In: YES Ok, done, including a minor patch to Doc/lib/libos.tex. I also adapted the Misc/NEWS items. I'm not sure how to change the os.listdir() doco to better reflect the actual situation without mentioning Py_FileSystemDefaultEncoding... |
Logged In: YES I see. The right thing, IMO, is to always return Unicode There might be a function already to fall back to the system There should be a documentation section on Unicode file |
Logged In: YES I think this could be achieved by removing the "Py_FileSystemDefaultEncoding != NULL" part of the condition on line 1805, as indeed passing NULL as the encoding to PyUnicode_FromEncodedObject causes the default encoding to be used. Shall I check it in like that? I'm not quite happy with the fact that exceptions are silently dropped: should a warning be issued instead? Especially when using the default encoding, exceptions are not unlikely I suppose. |
Logged In: YES Clearing the error is bad, I agree. I see two options: With these changes, the patch is fine with me. |
Logged In: YES Applied to CVS as: Unicode errors are propagated as in the original version of the patch, libos.tex mentions Win NT/2k/XP and Unix. |
Logged In: YES Martin, assigning this item to you. Please close it if you deem the changes in CVS correct. |
Logged In: YES The current code looks fine to me. Closing this patch. |
Logged In: YES I haven't seen the code, but I have a complaint. On Linux, when I have a file named '\xff' (i.e. its name is Is that really progress? |
Logged In: YES Would you prefer the error be silenced and a byte string be used instead? If so, should there be a warning? |
Logged In: YES Guido's scenario was precisely the reason why Unix was left However, it is better than it sounds: There is a good chance It might be useful to set the file system encoding on Unix |
Logged In: YES It would seem that even with a user's locale there's a chance os.listdir() fails when passed a unicode argument. I'm not sure it's reasonable for os.listdir() to fail at all (if the directory to be listed exists and we the right permissions). If it's all too difficult to get right, I'm happy to put the listdir unicode support in a MacOSX switch. I know nothing about locales so I'm really not in a position to straighten this out. All I know is that if Py_FileSystemDefaultEncoding is known to be utf-8, it's just dumb _not_ to return unicode. You guys figure out the rest. |
Logged In: YES The setlocale call indeed works. I think I'd be happier if this was set by default, but I |
Logged In: YES Maybe the filesystem default encoding should be set to |
Logged In: YES I think it would be better to simply return byte strings if the file system encoding isn't know. (This btw. was what my original patch did.) |
Logged In: YES I disagree with the last assertion: In *particular* if the OS X seems to make a guarantee to always return UTF-8 from I see the following options:
|
Logged In: YES Setting the file system encoding on startup should be fine, |
Logged In: YES I just did a test (created 254 files with all bytes except / and null in their names on a linux server, mounted the partition over NFS on MacOSX) and indeed MacOSX tries to interpret the bytes as UTF-8 and fails. I know that conversion works for HFS and HFS+ volumes (which carry a filename encoding with them, or you have to specify it when mounting). I assume it works for AFP and SMB (which also carries encoding info, IIRC) but I can't test this. I haven't a clue about webdav and such. Something to keep in mind is that we are really trying to solve someone else's problem: the inability of NFS and most unixen to handle file system encodings. If I'm on a latin-1 machine and I nfs-mount your latin-2 partition I will see garbage filenames. |
Logged In: YES Here's a note about file system encodings on OSX, including a few words about NFS: http://developer.apple.com/qa/qa2001/qa1173.html. I propose to fall back to a byte string if conversion to unicode fails. |
Logged In: YES I only partially agree that this is somebody else's problem: Therefore, I'm in favour of jvr's latest proposal (use byte |
Logged In: YES On the one hand a user who isn't interested in encodings FWIW, I like Just's "fall back to bytestrings" aproach. |
Logged In: YES I've committed the "fallback-to-byte-strings" behavior. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: