Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode support for os.listdir() #37949

Closed
jvr mannequin opened this issue Feb 9, 2003 · 43 comments
Closed

unicode support for os.listdir() #37949

jvr mannequin opened this issue Feb 9, 2003 · 43 comments
Assignees
Labels
stdlib Python modules in the Lib dir

Comments

@jvr
Copy link
Mannequin

jvr mannequin commented Feb 9, 2003

BPO 683592
Nosy @malemburg, @gvanrossum, @loewis, @jackjansen
Files
  • listdir_unicode.patch: unicode support for os.listdir, take 3
  • listdir_unicode_arg.patch: only return unicode if the argument was unicode, + doc
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/loewis'
    closed_at = <Date 2003-03-04.19:43:47.000>
    created_at = <Date 2003-02-09.21:43:09.000>
    labels = ['library']
    title = 'unicode support for os.listdir()'
    updated_at = <Date 2003-03-04.19:43:47.000>
    user = 'https://bugs.python.org/jvr'

    bugs.python.org fields:

    activity = <Date 2003-03-04.19:43:47.000>
    actor = 'jvr'
    assignee = 'loewis'
    closed = True
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2003-02-09.21:43:09.000>
    creator = 'jvr'
    dependencies = []
    files = ['5013', '5014']
    hgrepos = []
    issue_num = 683592
    keywords = ['patch']
    message_count = 43.0
    messages = ['42747', '42748', '42749', '42750', '42751', '42752', '42753', '42754', '42755', '42756', '42757', '42758', '42759', '42760', '42761', '42762', '42763', '42764', '42765', '42766', '42767', '42768', '42769', '42770', '42771', '42772', '42773', '42774', '42775', '42776', '42777', '42778', '42779', '42780', '42781', '42782', '42783', '42784', '42785', '42786', '42787', '42788', '42789']
    nosy_count = 6.0
    nosy_names = ['lemburg', 'gvanrossum', 'loewis', 'nnorwitz', 'jackjansen', 'jvr']
    pr_nums = []
    priority = 'normal'
    resolution = 'accepted'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue683592'
    versions = []

    @jvr
    Copy link
    Mannequin Author

    jvr mannequin commented Feb 9, 2003

    The attached patch makes os.listdir() return unicode strings, on plaforms that have Py_FileSystemDefaultEncoding defined as non-NULL.

    I'm by no means sure this is the right thing to do; it does seem right on OSX where Py_FileSystemDefaultEncoding is (or rather: will be real soon, I'm waiting for Jack's approval) utf-8. I'd be happy to add the code in an OSX-specific switch.

    A more subtle variant could perhaps only return unicode strings if the file name is not ASCII.

    @jvr jvr mannequin closed this as completed Feb 9, 2003
    @jvr jvr mannequin assigned loewis Feb 9, 2003
    @jvr jvr mannequin added the stdlib Python modules in the Lib dir label Feb 9, 2003
    @jvr jvr mannequin closed this as completed Feb 9, 2003
    @jvr jvr mannequin assigned loewis Feb 9, 2003
    @jvr jvr mannequin added the stdlib Python modules in the Lib dir label Feb 9, 2003
    @gvanrossum
    Copy link
    Member

    Logged In: YES
    user_id=6380

    At the very least, I'd like it to return Unicode only when
    the original string isn't just ASCII.

    @nnorwitz
    Copy link
    Mannequin

    nnorwitz mannequin commented Feb 10, 2003

    Logged In: YES
    user_id=33168

    The code which uses unicode APIs should probably be wrapped
    with:

    #ifdef Py_USING_UNICODE
     /* code */
    #endif

    @jvr
    Copy link
    Mannequin Author

    jvr mannequin commented Feb 10, 2003

    Logged In: YES
    user_id=92689

    Applied both suggestions.

    However, I'm not sure if my ASCII test does the right thing, or at least I don't think it does if Py_FileSystemDefaultEncoding is not a superset of ASCII.

    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    Your test will probably catch most cases, but it could fail
    for e.g. UTF-16.

    The only true test would be to first convert to Unicode and then
    try to convert back to ASCII. If you get an error you can be
    sure that
    the text is not ASCII compatible. Given that .listdir()
    involves lots of
    IO I think the added performance hit wouldn't be noticable.

    @jvr
    Copy link
    Mannequin Author

    jvr mannequin commented Feb 10, 2003

    Logged In: YES
    user_id=92689

    I don't see hot UTF-16 could be a valid value for Py_FileSystemDefaultEncoding, as for most platforms the file name can't contain null bytes. My looking at the NAMELEN() spaghetti, it seems platforms without HAVE_DIRENT_H might still support embedded null bytes. Any wisdom on this?

    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    The file system does not need to support embedded \0 chars
    even if it supports UTF-16. It only happens that your test
    assumes
    that you have one byte per characters encodings which may not
    always be true. With UTF-16 your test will see lots of \0 bytes
    but not necessarily ones which are ord(x)>=128.

    I'm not sure whether other variable length encodings can result
    in \0 bytes, e.g. the Asian ones.

    There's also the possibility of the
    encoding mapping the ASCII range to other non-ASCII characters,
    e.g. ShiftJIS does this for the Yen sign.

    If you absolutely want to use the simple test, I'd at least
    restrict
    the test to an ASCII isalnum(x) test and then try the
    encode/decode
    method I described if this test fails.

    Note that isalnum() can be locale dependent on some
    platforms, so
    you have to hard-code it.

    @jvr
    Copy link
    Mannequin Author

    jvr mannequin commented Feb 10, 2003

    Logged In: YES
    user_id=92689

    Ok, I went for your original suggestion: always convert to unicode and then try to convert to ascii. See new patch. Or should this use the default encoding? Hm.

    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    Good question. The default encoding would better fit
    into the concept, I guess.

    Instead of PyUnicode_AsASCIIString(v) you'd
    have to use PyUnicode_AsEncodedString(v, NULL, "strict").

    @jvr
    Copy link
    Mannequin Author

    jvr mannequin commented Feb 10, 2003

    Logged In: YES
    user_id=92689

    On the other hand, if it's not ASCII, wouldn't a unicode string be more appropriate to begin with? If it's encodable with the default encoding, this will happen as soon as the string is used in a piece of unicode-unaware code, right?

    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    Right, except that injecting Unicode into Unicode-unaware code
    can be dangerous (e.g. some code might require a string object
    to work on).

    E.g. if someone sets the default encoding to Latin-1 he wouldn't
    expect os.listdir() to suddenly return Unicode for him.

    This may be a problem in general for the change to os.listdir().
    We'll just have to see what happens during the alpha and beta
    phases.

    @jvr
    Copy link
    Mannequin Author

    jvr mannequin commented Feb 10, 2003

    Logged In: YES
    user_id=92689

    Here's an argument for ASCII and against the default encoding: if the default encoding is different from Py_FileSystemDefaultEncoding, things go wrong: an 8-bit string passed to file() will be interpreted as Py_FileSystemDefaultEncoding (more precisely: will not be interpreted at all), not the default encoding...

    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    Ok, let's look at it from a different
    angle: things that you get from os.listdir() should be
    compatible
    to (at least) all the os.path tools and os itself.
    Converting to
    Unicode has the advantage that slicing and indexing into the
    path names will not break the paths (unlike UTF-8 encoded 8-bit
    strings which tend to break when you slice them).

    That said, I think you're right about the ASCII approach
    provided
    that the os, os.path tools can actually properly cope with
    Unicode.

    What I worry about is that if os.listdir() gives back
    Unicode for
    e.g. Latin-1 filenames and the application then passes the
    Unicode
    names to a C API using "s", prefectly working code will break...
    then again the C code should really use "es" for decoding to
    the Py_FileSystemDefaultEncoding as is done in e.g.
    fileobject.c.

    I really don't know what to do here...

    @jvr
    Copy link
    Mannequin Author

    jvr mannequin commented Feb 10, 2003

    Logged In: YES
    user_id=92689

    I'm pretty sure os.path deals just fine with unicode strings (it's all pure string manipulations, isn't it?)

    Worries: well, apparently on Windows os.listdir() has been returning unicode for some time, so it's not like we're breaking completely new grounds here.

    If anything breaks it's probably good this happens, as it gives an opportunity to fix things... I just found several example of potential breakage: _bsddb.c parses a filename arg with the "z" format specifier. gdbmmodule.c uses "s". bsddbmodule.c and dbmmodule.c as well.

    I'm not sure the above modules work on Windows with non-ascii filenames at all, but it doesn't look like it. Besides Windows (for which my patch is not relevant), only OSX sets Py_FileSystemDefaultEncoding, so any new breakage won't reach a mass market right away <wink>.

    @jvr
    Copy link
    Mannequin Author

    jvr mannequin commented Feb 25, 2003

    Logged In: YES
    user_id=92689

    Having missed 2.3a2, I'd like to get this in way ahead of 2.3b1. Any objections?

    @gvanrossum
    Copy link
    Member

    Logged In: YES
    user_id=6380

    OK, check it in, just be prepared for contingencies. I
    really cannot judge whether this is right on all platforms.

    @jvr
    Copy link
    Mannequin Author

    jvr mannequin commented Feb 25, 2003

    Logged In: YES
    user_id=92689

    Checked in as rev. 2.287 of Modules/posixmodule.c. Leaving this item open for now, in case MvL has comments when he gets back.

    @jackjansen
    Copy link
    Member

    Logged In: YES
    user_id=45365

    I think this patch does more bad than good.

    A practical problem is that os.path.walk doesn't work anymore if there are
    non-ascii directories in the directory tree (os.listdir will return these as unicode names, but doesn't accept unicode on input). See bug bpo-696261. An additional problem is that various other methods in posix don't do the unicode conversion, so for instance os.getcwd() will return 8-bit strings in Py_FileSystemDefaultEncoding which are incompatible with the unicode returned by listdir.

    My preferred solution would be to do the unicode trick everywhere. Second best would be to retract the whole thing and think about it a bit more for Python 2.4.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 3, 2003

    Logged In: YES
    user_id=21627

    I dislike this change, as it introduces inconsistency across
    platforms. On Win32, as a result of PEP-277, Unicode file
    names are only returned for Unicode directory names. There
    was an explicit discussion about this aspect of PEP-277, and
    this interface was accepted as The Right Thing. So I think
    Unix should follow here: return byte string file names for
    byte string directory names, and Unicode file names for
    Unicode directory names. Support for Unicode directory names
    should also invoke the file system encoding for the
    directory name.

    I'm also unsure about the exception handling. If there is a
    file name that doesn't decode according to the file system
    encoding, it raises the Unicode error. This means that all
    other file names are lost. This might be acceptable if the
    Unicode-in-Unicode-out strategy is used; in its current
    form, the change can and will break existing applications
    (which find all kinds of funny byte sequences on disk that
    don't work with the user's file system encoding).

    @jvr
    Copy link
    Mannequin Author

    jvr mannequin commented Mar 3, 2003

    Logged In: YES
    user_id=92689

    Jack, as noted on #bug 696261, the bug is that os.listdir() doesn't do the right thing with a Unicode string argument (it should use Py_FileSystemDefaultEncoding but it doesn't; I'm working on it.

    Martin: I now see that PEP-277 says "Under this proposal, [os.listdir] will return a list of Unicode strings when its path argument is Unicode". I don't like this much (I really think we should push Unicode a little harder onto the users), but I'll look into changing the unix end of os.listdir() to do the same. I'll also review your exception comment.

    @jvr
    Copy link
    Mannequin Author

    jvr mannequin commented Mar 3, 2003

    Logged In: YES
    user_id=92689

    I've attached a patch that fixes the bug as well as addresses the unicode arg vs. return value inconsistency that Martin noted. The exception behavior has not yet been changed.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 3, 2003

    Logged In: YES
    user_id=21627

    Looks good, but incomplete: If the argument is Unicode,
    *all* results should be Unicode. There should also be
    documentation changes.

    @jvr
    Copy link
    Mannequin Author

    jvr mannequin commented Mar 3, 2003

    Logged In: YES
    user_id=92689

    Ok, done, including a minor patch to Doc/lib/libos.tex. I also adapted the Misc/NEWS items. I'm not sure how to change the os.listdir() doco to better reflect the actual situation without mentioning Py_FileSystemDefaultEncoding...

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 3, 2003

    Logged In: YES
    user_id=21627

    I see. The right thing, IMO, is to always return Unicode
    objects for Unicode arguments, just the same way the "et"
    parser works: if the file system encoding is NULL, fall back
    to the system default encoding. Then, you can generalize the
    docs to [NT and Unix] (with OS X being a flavour of Unix),
    or drop the OS reference completely (in which case the other
    os modules are effectively buggy).

    There might be a function already to fall back to the system
    default encoding; perhaps just passing NULL works.

    There should be a documentation section on Unicode file
    names; I volunteer to write it (Summary: NT+ uses Unicode
    natively, W9x uses "mbcs", OS X uses UTF-8, which equates to
    "Unicode natively", Unices with nl_langinfo(CODEPAGE) use
    that, all others use the system default encoding).

    @jvr
    Copy link
    Mannequin Author

    jvr mannequin commented Mar 3, 2003

    Logged In: YES
    user_id=92689

    I think this could be achieved by removing the "Py_FileSystemDefaultEncoding != NULL" part of the condition on line 1805, as indeed passing NULL as the encoding to PyUnicode_FromEncodedObject causes the default encoding to be used. Shall I check it in like that?

    I'm not quite happy with the fact that exceptions are silently dropped: should a warning be issued instead? Especially when using the default encoding, exceptions are not unlikely I suppose.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 3, 2003

    Logged In: YES
    user_id=21627

    Clearing the error is bad, I agree. I see two options:
    reraise the exception, deleting the result obtained so far
    (i.e. as the code did that the latest patch removes), OR add
    a byte string instead of the Unicode string into the result.
    Even though I have proposed the latter in the past, I could
    also accept the former; applications that anticipate that
    exception then just need to re-invoke listdir with a byte
    string, and deal with the result themselves.

    With these changes, the patch is fine with me.

    @jvr
    Copy link
    Mannequin Author

    jvr mannequin commented Mar 3, 2003

    Logged In: YES
    user_id=92689

    Applied to CVS as:
    Modules/posixmodule.c: 2.288
    Doc/lib/libos.tex: 1.115
    Misc/NEWS: 1.687

    Unicode errors are propagated as in the original version of the patch, libos.tex mentions Win NT/2k/XP and Unix.

    @jvr
    Copy link
    Mannequin Author

    jvr mannequin commented Mar 3, 2003

    Logged In: YES
    user_id=92689

    Martin, assigning this item to you. Please close it if you deem the changes in CVS correct.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 4, 2003

    Logged In: YES
    user_id=21627

    The current code looks fine to me. Closing this patch.

    @gvanrossum
    Copy link
    Member

    Logged In: YES
    user_id=6380

    I haven't seen the code, but I have a complaint.

    On Linux, when I have a file named '\xff' (i.e. its name is
    the single byte with value 255), os.listdir(u'.') gives me a
    UnicodeDecodeError.

    Is that really progress?

    @jvr
    Copy link
    Mannequin Author

    jvr mannequin commented Mar 4, 2003

    Logged In: YES
    user_id=92689

    Would you prefer the error be silenced and a byte string be used instead? If so, should there be a warning?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 4, 2003

    Logged In: YES
    user_id=21627

    Guido's scenario was precisely the reason why Unix was left
    out from consideration for PEP-277.

    However, it is better than it sounds: There is a good chance
    that invoking locale.setlocale(locale.LC_CTYPE, "") prior to
    invoking listdir will overcome the problem, as the setlocale
    call will set the file system encoding to the user's
    preference. If \xff is a valid file name in the user's
    preferred encoding, then listdir will succeed in converting
    this file name to a Unicode string.

    It might be useful to set the file system encoding on Unix
    to the user's preferred encoding unconditionally (i.e. not
    as a side effect of invoking setlocale). It might also be
    useful to expose the file system encoding read-only for
    inspection.

    @jvr
    Copy link
    Mannequin Author

    jvr mannequin commented Mar 4, 2003

    Logged In: YES
    user_id=92689

    It would seem that even with a user's locale there's a chance os.listdir() fails when passed a unicode argument. I'm not sure it's reasonable for os.listdir() to fail at all (if the directory to be listed exists and we the right permissions).

    If it's all too difficult to get right, I'm happy to put the listdir unicode support in a MacOSX switch. I know nothing about locales so I'm really not in a position to straighten this out. All I know is that if Py_FileSystemDefaultEncoding is known to be utf-8, it's just dumb _not_ to return unicode. You guys figure out the rest.

    @gvanrossum
    Copy link
    Member

    Logged In: YES
    user_id=6380

    The setlocale call indeed works.

    I think I'd be happier if this was set by default, but I
    don't know what other consequences there would be.

    @gvanrossum
    Copy link
    Member

    Logged In: YES
    user_id=6380

    Maybe the filesystem default encoding should be set to
    Latin-1 by default (when nothing better is known about it)?
    Then it's hard to imagine how the conversion could fail,
    since every Latin-1 byte maps 1-1 to the corresponding
    Unicode code point.

    @jvr
    Copy link
    Mannequin Author

    jvr mannequin commented Mar 4, 2003

    Logged In: YES
    user_id=92689

    I think it would be better to simply return byte strings if the file system encoding isn't know. (This btw. was what my original patch did.)

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 4, 2003

    Logged In: YES
    user_id=21627

    I disagree with the last assertion: In *particular* if the
    file system encoding is UTF-8, there is a good chance that
    decoding will fail (unlike if it is latin-1; decoding will
    then never fail - it may just produce mojibake).

    OS X seems to make a guarantee to always return UTF-8 from
    its low-level API, but I distrust this guarantee until I see
    it with my own eyes :-) E.g. what happens if you mount an
    NFS tree, and the NFS server gives file names in some other
    encoding?

    I see the following options:

    • only enable the code for OS X. I dislike this option, as
      it essentially freezes the Unix status to non-Unicode (we
      won't get further insights, the de jure status won't change,
      de facto, all files will be encoded in the locale's encoding).

    • leave the code as-is, documenting the possibility of
      exceptions.

    • add byte strings instead of Unicode strings into the
      result for non-decodable strings. This gives a mixed-type
      result, which is fine if you only pass the resulting file
      names to stat() or open(), and will likely break the
      application if it tries to display the file names somehow.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 4, 2003

    Logged In: YES
    user_id=21627

    Setting the file system encoding on startup should be fine,
    except that we need another setlocale/query/restore locale
    sequence. This is, in principle, bad, as there is no
    guarantee that the restore locale operation really produces
    the original state, and may cause problems if other threads
    are already running. In practice, it appears to work out
    just fine, as we use such sequences already (e.g. to undo
    the readline initialization).

    @jackjansen
    Copy link
    Member

    Logged In: YES
    user_id=45365

    I just did a test (created 254 files with all bytes except / and null in their names on a linux server, mounted the partition over NFS on MacOSX) and indeed MacOSX tries to interpret the bytes as UTF-8 and fails.

    I know that conversion works for HFS and HFS+ volumes (which carry a filename encoding with them, or you have to specify it when mounting). I assume it works for AFP and SMB (which also carries encoding info, IIRC) but I can't test this. I haven't a clue about webdav and such.

    Something to keep in mind is that we are really trying to solve someone else's problem: the inability of NFS and most unixen to handle file system encodings. If I'm on a latin-1 machine and I nfs-mount your latin-2 partition I will see garbage filenames.

    @jvr
    Copy link
    Mannequin Author

    jvr mannequin commented Mar 4, 2003

    Logged In: YES
    user_id=92689

    Here's a note about file system encodings on OSX, including a few words about NFS: http://developer.apple.com/qa/qa2001/qa1173.html.

    I propose to fall back to a byte string if conversion to unicode fails.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 4, 2003

    Logged In: YES
    user_id=21627

    I only partially agree that this is somebody else's problem:
    On Unix, it is always considered application responsibility
    to interpret file names as characters if they need to -
    hence the lack of a system-provided encoding strategy. So it
    is the problem of Python or the Python application, and I
    think we should try to shield the application from these
    issues as good as we can.

    Therefore, I'm in favour of jvr's latest proposal (use byte
    strings as the last resort), hoping that the error case will
    be unfrequent.

    @gvanrossum
    Copy link
    Member

    Logged In: YES
    user_id=6380

    On the one hand a user who isn't interested in encodings
    shouldn't be passing a Unicode argument. On the other hand,
    Unicode strings have a way of sneaking into your application
    when you least suspect them. E.g. Tkinter returns them, so
    does IDLE, and I see them used more and more in Zope 3.

    FWIW, I like Just's "fall back to bytestrings" aproach.

    @jvr
    Copy link
    Mannequin Author

    jvr mannequin commented Mar 4, 2003

    Logged In: YES
    user_id=92689

    I've committed the "fallback-to-byte-strings" behavior.
    It's in posixmodule.c rev. 2.290.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants