Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect os.path.supports_unicode_filenames #38817

Closed
jvr mannequin opened this issue Jul 8, 2003 · 30 comments
Closed

incorrect os.path.supports_unicode_filenames #38817

jvr mannequin opened this issue Jul 8, 2003 · 30 comments
Labels
stdlib Python modules in the Lib dir tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error

Comments

@jvr
Copy link
Mannequin

jvr mannequin commented Jul 8, 2003

BPO 767645
Nosy @loewis, @ronaldoussoren, @vstinner, @ned-deily, @ezio-melotti, @bitdancer, @florentx
Files
  • test_supports_unicode_filenames.patch: Tests supports_unicode_filenames against its documented value - fails on Linux
  • posixpath_darwin.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-09-17.23:37:39.272>
    created_at = <Date 2003-07-08.09:42:15.000>
    labels = ['tests', 'type-bug', 'library']
    title = 'incorrect os.path.supports_unicode_filenames'
    updated_at = <Date 2010-09-17.23:37:39.270>
    user = 'https://bugs.python.org/jvr'

    bugs.python.org fields:

    activity = <Date 2010-09-17.23:37:39.270>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-09-17.23:37:39.272>
    closer = 'vstinner'
    components = ['Library (Lib)', 'Tests']
    creation = <Date 2003-07-08.09:42:15.000>
    creator = 'jvr'
    dependencies = []
    files = ['15843', '18879']
    hgrepos = []
    issue_num = 767645
    keywords = ['patch']
    message_count = 30.0
    messages = ['16955', '16956', '16957', '16958', '16959', '16960', '16961', '16962', '16963', '16964', '16965', '97652', '97655', '97658', '97660', '101132', '114252', '116064', '116065', '116068', '116069', '116214', '116215', '116347', '116348', '116354', '116366', '116386', '116429', '116740']
    nosy_count = 10.0
    nosy_names = ['loewis', 'jvr', 'ronaldoussoren', 'vstinner', 'ned.deily', 'ezio.melotti', 'r.david.murray', 'joe.amenta', 'flox', 'BreamoreBoy']
    pr_nums = []
    priority = 'low'
    resolution = 'fixed'
    stage = 'patch review'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue767645'
    versions = ['Python 2.6', 'Python 3.1', 'Python 2.7', 'Python 3.2']

    @jvr
    Copy link
    Mannequin Author

    jvr mannequin commented Jul 8, 2003

    At least on OSX, unicode file names are pretty much fully
    supported, yet os.path.supports_unicode_filenames is False
    (it comes from posixpath.py, which hard codes it). What
    would be a proper way to detect unicode filename support
    for posix platforms?

    @jvr jvr mannequin added stdlib Python modules in the Lib dir labels Jul 8, 2003
    @brettcannon
    Copy link
    Member

    Logged In: YES
    user_id=357491

    What happens if you try to create a file using Unicode names?
    Could a test get the temp directory for the platform, write a file
    with Unicode in it, and then check for an error? Or if it always
    succeeds, write it, and then see if the results match?

    In other words, does writing Unicode to an ASCII file system ever
    lead to a mangling of the name?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jul 10, 2003

    Logged In: YES
    user_id=21627

    On POSIX platforms in general, detecting Unicode file name
    support is not possible. Posix uses open(2), and only
    open(2) (alon with creat(2), stat(2) etc) to access files.
    There is no open_w, or open_utf8, or the like. So file names
    are byte strings on Posix, and it will stay that way forever.
    (There is actually also fopen, but that doesn't change the
    situation at all).

    On OSX, the situation is somewhat different from POSIX, as
    you have additional functions to open files (which Python
    apparently does not use, though), and because OSX specifies
    that the byte strings have to be NFD UTF-8 (which Python
    violates AFAICT).

    The documentation for supports_unicode_filenames says

    True if arbitrary Unicode strings can be used as file names
    (within limitations imposed by the file system), and if
    \function{os.listdir()} returns Unicode strings for a Unicode
    argument.

    While the first part is true for OSX, I don't think the
    second part is. If that ever gets corrected (or verified),
    no further detection is necessary - just set
    macpath.supports_unicode_filenames for darwin (assuming you
    use macpath.py on that system).

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jul 10, 2003

    Logged In: YES
    user_id=21627

    Brett: As for "writing Unicode to an ASCII file system":
    there is no such thing. POSIX file systems accept arbitrary
    bytes, and don't interpret them except by looking at the
    path separator (in ASCII).

    So you can put Latin-1, KOI8-r, EUC-JP, UTF-8, gb2312, etc
    all on a single file system, and people actually do that.
    The convention is that bytes in file names are interpreted
    according to the locale's encoding. This is just a
    convention, and it has some significant flaws. Python
    follows that convention, meaning that you can use arbitrary
    Unicode strings in open(), as long as they are supported in
    the locale's encoding.

    @jvr
    Copy link
    Mannequin Author

    jvr mannequin commented Jul 10, 2003

    Logged In: YES
    user_id=92689

    On OSX, the situation is somewhat different from POSIX, as
    you have additional functions to open files (which Python
    apparently does not use, though), and because OSX specifies
    that the byte strings have to be NFD UTF-8 (which Python
    violates AFAICT).

    (I'm not 100% sure, but I think the OS corrects that)

    True if arbitrary Unicode strings can be used as file names
    (within limitations imposed by the file system), and if
    \function{os.listdir()} returns Unicode strings for a Unicode
    argument.

    While the first part is true for OSX, I don't think the
    second part is.

    It is, we had a long discussion about that back when I
    implemented that ;-)

    If that ever gets corrected (or verified),
    no further detection is necessary - just set
    macpath.supports_unicode_filenames for darwin (assuming you
    use macpath.py on that system).

    Darwin is a posix platform, so I'll have to add a switch to
    posixpath.py. Unless you object to that, I will do that.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jul 10, 2003

    Logged In: YES
    user_id=21627

    I'm not 100% sure, but I think the OS corrects that

    I'm relatively sure that the OS doesn't. The OS won't
    complain if you pass a file name that isn't UTF-8 at all -
    Finder will then fail to display the file correctly. There
    are CoreFoundationsBasicServicesSomething functions that you
    are supposed to call to correct that; Python does not use them.

    If you think setting the flag for darwin is fine in
    posixpath, just go ahead.

    @jvr
    Copy link
    Mannequin Author

    jvr mannequin commented Jul 11, 2003

    Logged In: YES
    user_id=92689

    Done in rev. 1.61 of posixpath.py.

    (Actually, OSX does complain when you feed open() a non-valid
    utf-8 string (albeit with a misleading error message). The OS also
    makes sure the name is converted to its preferred form, eg. if I
    create a file named u'\xc7', I can also open it as u'C\u0327', and
    os.listdir() will always show the latter, no matter how you created
    the file.)

    @jvr
    Copy link
    Mannequin Author

    jvr mannequin commented Jul 17, 2003

    Logged In: YES
    user_id=92689

    Reopeing as the fix I checked in caused problems in
    test_pep277.py. Postpone work on this until after 2.3 is released.

    @jvr
    Copy link
    Mannequin Author

    jvr mannequin commented Jul 17, 2003

    Logged In: YES
    user_id=92689

    (forgot to mention: my checkin was backed out)

    @jvr
    Copy link
    Mannequin Author

    jvr mannequin commented Jun 28, 2005

    Logged In: YES
    user_id=92689

    Hmm, two years later and this still hasn't been resolved. Is anyone
    interested to take a stab at it? It would be nice if it could be fixed for 2.5.

    (Btw. the only code using os.path.supports_unicode_filenames that I'm
    aware of is Jason Orendorff's path module.)

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jun 28, 2005

    Logged In: YES
    user_id=21627

    I don't care about this issue, as I think
    supports_unicode_filenames is a pretty useless property
    these days. If somebody changes the current value from False
    to True, just make sure that the testsuite still passes.

    @ezio-melotti
    Copy link
    Member

    Maybe os.path.supports_unicode_filenames should be deprecated.
    The doc currently says:
    "True if arbitrary Unicode strings can be used as file names (within limitations imposed by the file system), and if os.listdir() returns Unicode strings for a Unicode argument."

    On Linux both the things work, even if the value of os.path.supports_unicode_filenames is still False:
    >>> os.path.supports_unicode_filenames
    False
    >>> open(u'fòòbàr', 'w')
    <open file u'f\xf2\xf2b\xe0r', mode 'w' at 0x9470778>
    >>> os.listdir(u'.')
    [u'f\xf2\xf2b\xe0r', ...]
    >>> open(u'fòòbàr')
    <open file u'f\xf2\xf2b\xe0r', mode 'r' at 0x9470778>

    @bitdancer
    Copy link
    Member

    In addition, whether or not true unicode filenames are supported really depends, at least on Linux, on the *filesystem*, not on the OS (for some definition of support). In other words, I think os.path.supports_unicode_filenames is an API design that is broken and should probably be dropped.

    @florentx
    Copy link
    Mannequin

    florentx mannequin commented Jan 12, 2010

    Additionally it filters out test_pep277 on some platforms.

    But seemingly, it is not needed anymore with this patch.

    @joeamenta
    Copy link
    Mannequin

    joeamenta mannequin commented Jan 12, 2010

    If it is decided to keep supports_unicode_filenames, here is a patch for test_os.py that verifies the value of supports_unicode_filenames against the following line from the documentation:
    "True if arbitrary Unicode strings can be used as file names (within limitations imposed by the file system), and if os.listdir() returns Unicode strings for a Unicode argument."

    @florentx florentx mannequin added tests Tests in the Lib/test dir labels Jan 28, 2010
    @florentx
    Copy link
    Mannequin

    florentx mannequin commented Mar 15, 2010

    With r78594, test_pep277 is active on all platforms having Unicode-friendly filesystem encoding.

    @florentx florentx mannequin added type-bug An unexpected behavior, bug, or error labels Mar 15, 2010
    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Aug 18, 2010

    There are at least three messages stating that os.path.supports_unicode_filenames should go so can someone please provide a definitive statement regarding its future.

    @vstinner
    Copy link
    Member

    test_pep277.patch removes the usage of os.path.supports_unicode_filenames from test_pep277: the test still pass on Debian Sid (Linux). Can someone test the patch on Mac OS X, FreeBSD and Solaris (and maybe other POSIX/UNIX OSes)?

    About Windows: supports_unicode_filenames is False if sys.getwindowsversion().platform < 2: win32s (0) or Windows 9x/ME (1). I don't know win32s, but I know that Windows 9x/ME is not more supported.

    @vstinner
    Copy link
    Member

    Oops, forget test_pep277.patch: I misunderstood r81149 (new way to detect if the filesystem supports unicode or not). test_pep277 fails with my patch on Linux with LC_CTYPE=C.

    @vstinner
    Copy link
    Member

    r84701 fixes supports_unicode_filenames's definition in Python 3.2 (and r84702 in Python 3.1): os.listdir(str) now always return unicode filenames (including non-ascii characters).

    @vstinner
    Copy link
    Member

    Maybe os.path.supports_unicode_filenames should be deprecated.
    The doc currently says:
    "True if arbitrary Unicode strings can be used as file names
    (within limitations imposed by the file system), and if os.listdir()
    returns Unicode strings for a Unicode argument."

    On Linux both the things work, even if the value of
    os.path.supports_unicode_filenames is still False:
    (...)

    It depends on the locale encoding:

    $ LC_CTYPE=C ./python
    Python 3.2a2+ (py3k, Sep 11 2010, 01:48:43) 
    >>> import sys; sys.getfilesystemencoding()
    'ascii'
    >>> open('\xe9', 'w').close()
    ...
    UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)

    With utf-8, surrogates are forbidden. Eg.

    $ ./python
    Python 3.2a2+ (py3k, Sep 11 2010, 01:48:43) 
    >>> import sys; sys.getfilesystemencoding()
    'utf-8'
    >>> open('\uDC00', 'w').close()
    ...
    UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed

    On Windows, Python uses the unicode API and so the unicode support doesn't depend on the locale encoding (on the ansi code page). Surrogates are accepted on Windows: '\uDC00' is a valid filename.

    I think that supports_unicode_filenames is still useful to check if the filesystem API uses bytes (Linux, FreeBSD, Solaris, ...) or characters (Mac OS X, Windows). Mac OS X is a special case because the C API uses char* (byte string), but the filesystem encoding is fixed to utf-8 and it doesn't accept invalid utf-8 filenames. So I would like to say that supports_unicode_filenames should be True on Mac OS X (which was the initial request).

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Sep 12, 2010

    About Windows: supports_unicode_filenames is False if
    sys.getwindowsversion().platform < 2: win32s (0) or Windows 9x/ME
    (1). I don't know win32s, but I know that Windows 9x/ME is not more
    supported.

    Win32s is long gone. It was an emulation layer to support Win32 on
    Windows 3.1.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Sep 12, 2010

    I think that supports_unicode_filenames is still useful to check if
    the filesystem API uses bytes (Linux, FreeBSD, Solaris, ...) or
    characters (Mac OS X, Windows). Mac OS X is a special case because
    the C API uses char* (byte string), but the filesystem encoding is
    fixed to utf-8 and it doesn't accept invalid utf-8 filenames. So I
    would like to say that supports_unicode_filenames should be True on
    Mac OS X (which was the initial request).

    Sounds reasonable.

    @vstinner
    Copy link
    Member

    r84784 sets os.path.supports_unicode_filenames to True on Mac OS X (macpath module).

    About test_supports_unicode_filenames.patch. test_unicode_listdir() is wrong: os.listdir(str) always return str (see r84701). "verify that the new file's name is equal to the name we tried" check of test_unicode_filename() is also wrong: newfile.name is always equal to fname, it doesn't depend on support_unicode_filenames. Since the test is wrong, I don't want to commit it. test_pep277 is enough to test the creation of files with unicode names.

    I don't see anything else to do now, so I close this issue. Reopen it if I forgot something, or open a new issue.

    @vstinner
    Copy link
    Member

    I backported r84701 and r84784 to Python 2.7 (r84787).

    @ned-deily
    Copy link
    Member

    There seems to be some confusion about the macpath.py module. I'm not sure why it even exists in Python 3. Note it has to do with obsolete Classic MacOS-style paths (colon-separated paths) which are available on Mac OS X only through deprecated Carbon interfaces. I'm not even sure that those style paths do support unicode. More importantly, the underlying Carbon interfaces that macpath.py uses were removed for Python 3. AFAIK, virtually nothing on OS X uses these style paths anymore and, with the removal of all the old Mac Carbon support in Python 3, I don't think there is any Python module that can use these paths other than macpath. I think this module should be marked for deprecation and removed. There is no reason to modify it nor add a NEWS note, even for 2.7.

    @ned-deily
    Copy link
    Member

    (I've opened bpo-9850 to document the brokenness of macpath and suggest its deprecation and removal.)

    @vstinner
    Copy link
    Member

    There seems to be some confusion about the macpath.py module. (...)

    Oops. I thought that Mac OS X uses macpath, but in fact it is posixpath. Can you try my new patch posixpath_darwin.patch? I reopen the issue because I patched the wrong module. I suppose that Python 2.7 has the same issue: posixpath should be patched, not macpath.

    My patch leaves macpath with supports_unicode_filenames=True. If I understood correctly: macpath should be removed (bpo-9850).

    @vstinner vstinner reopened this Sep 14, 2010
    @vstinner vstinner reopened this Sep 14, 2010
    @ned-deily
    Copy link
    Member

    No problems noted with a quick test of posixpath_darwin.patch on 10.6 so looks good. It will get regression tested on more configurations sometime later.

    @vstinner
    Copy link
    Member

    No problems noted with a quick test of posixpath_darwin.patch
    on 10.6 so looks good.

    Ok thanks. Fix commited to 3.2 (r84866) and 2.7 (r84868). I kept my patch on macpath (supports_unicode_filenames=True) because it is still valid (even if it is not used). Or is it wrong that Mac OS 9 speaks unicode?

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants