Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change sys.getfilesystemencoding() on Windows to UTF-8 #71968

Closed
zooba opened this issue Aug 17, 2016 · 24 comments
Closed

Change sys.getfilesystemencoding() on Windows to UTF-8 #71968

zooba opened this issue Aug 17, 2016 · 24 comments
Assignees
Labels
3.7 OS-windows release-blocker type-bug An unexpected behavior, bug, or error

Comments

@zooba
Copy link
Member

zooba commented Aug 17, 2016

BPO 27781
Nosy @brettcannon, @pfmoore, @mdickinson, @ncoghlan, @vstinner, @tjguk, @jkloth, @ned-deily, @zware, @zooba, @yan12125, @AraHaan
Files
  • fsencoding.diff
  • fsencoding.diff
  • test_cmd_line_unicode.py: Possible new test case for command line Unicode handling
  • 27781_1.patch
  • osx_failed_compile.txt
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/zooba'
    closed_at = <Date 2016-11-07.03:36:05.023>
    created_at = <Date 2016-08-17.03:49:40.936>
    labels = ['type-bug', '3.7', 'OS-windows', 'release-blocker']
    title = 'Change sys.getfilesystemencoding() on Windows to UTF-8'
    updated_at = <Date 2016-11-07.03:36:05.021>
    user = 'https://github.com/zooba'

    bugs.python.org fields:

    activity = <Date 2016-11-07.03:36:05.021>
    actor = 'python-dev'
    assignee = 'steve.dower'
    closed = True
    closed_date = <Date 2016-11-07.03:36:05.023>
    closer = 'python-dev'
    components = ['Windows']
    creation = <Date 2016-08-17.03:49:40.936>
    creator = 'steve.dower'
    dependencies = []
    files = ['44130', '44131', '44369', '44414', '44491']
    hgrepos = []
    issue_num = 27781
    keywords = ['patch']
    message_count = 24.0
    messages = ['272899', '272900', '272916', '272917', '272935', '272949', '272950', '272961', '272962', '272963', '274392', '274691', '274887', '274910', '275063', '275075', '275078', '275095', '275289', '275290', '275326', '275331', '279847', '280186']
    nosy_count = 13.0
    nosy_names = ['brett.cannon', 'paul.moore', 'mark.dickinson', 'ncoghlan', 'vstinner', 'tim.golden', 'jkloth', 'ned.deily', 'python-dev', 'zach.ware', 'steve.dower', 'yan12125', 'Decorater']
    pr_nums = []
    priority = 'release blocker'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue27781'
    versions = ['Python 3.6', 'Python 3.7']

    @zooba
    Copy link
    Member Author

    zooba commented Aug 17, 2016

    I've attached my first pass at a patch to change the file system encoding on Windows to UTF-8 and remove use of the *A APIs.

    It would be trivial to change the encoding from UTF-8 back to CP_ACP and change the error mode if that's what we decide is better, but my vote is strongly for an encoding that never drops characters when converted from UTF-16.

    Discussion is still ongoing on python-ideas, so let's argue about yes/no and utf-8/mbcs there and just discuss the patch here.

    @zooba zooba self-assigned this Aug 17, 2016
    @zooba zooba added OS-windows type-bug An unexpected behavior, bug, or error labels Aug 17, 2016
    @AraHaan
    Copy link
    Mannequin

    AraHaan mannequin commented Aug 17, 2016

    I personally hate ansi myself so +1 to UTF-8/UTF-16.

    @vstinner
    Copy link
    Member

    vstinner commented Aug 17, 2016

    Would it be acceptable for you to add a new option to switch to UTF-8 in Python 3.6, and discuss later if it's ok to enable it by default?

    In the python-ideas threed, you wrote that Windows allow surrogate characters in filenames, but not the UTF-8/strict Python codec. Would it make sense to use UTF-8/surrogatepass codec to avoid any unicode error?

    @vstinner
    Copy link
    Member

    vstinner commented Aug 17, 2016

    Steve Dower: Please don't use git format for diff, or the bug tracker is unable to create reviews. I regenerated the patch.

    By the way, you introduced a bug in posix_do_stat(): you added a new "else" in the !MS_WINDOWS path which leads to a compilation error. I fixed it.

    @zooba
    Copy link
    Member Author

    zooba commented Aug 17, 2016

    Thanks for the regen. I don't think git format is the problem as most of my patches are fine, it's probably because it was in a patch queue and so the parent isn't actually a known commit. I haven't tested whether this works without my other console patches but I think it should.

    Is there a surrogatepass option? If so, I'll definitely use that, as that'll fix the one remaining edge case.

    I suspect we'll have to go to Guido to get a ruling on the default, but I'll add an environment variable to switch.

    @vstinner
    Copy link
    Member

    vstinner commented Aug 17, 2016

    Is there a surrogatepass option?

    I'm talking about error handlers of Python codecs: text.encode('utf8',
    'surrogatepass')

    @vstinner
    Copy link
    Member

    vstinner commented Aug 17, 2016

    I suspect we'll have to go to Guido to get a ruling on the default, but I'll add an environment variable to switch.

    If you go in this direction, I would like to follow you for the
    UNIX/BSD side to make the switch portable. I was thinking about "-X
    utf8" which avoids to change the command line parser.

    If we agree on a plan, I would like to write it down as a PEP since I
    expect a lot of complains and questions which I would prefer to only
    answer once (see for example the length of your thread on python-ideas
    where each people repeated the same things multiple times ;-))

    @zooba
    Copy link
    Member Author

    zooba commented Aug 17, 2016

    By portable, do you mean not using an environment variable?

    Command line parsing is potentially affected by this on Windows - I'd have to look deeper - as command lines are provided as UTF-16. But we may not ever expose them as bytes.

    I don't even know that this matters on the UNIX/BSD side as the file system encoding provided there is correct, no? It's just Windows where the file system encoding used for bytes doesn't match what the file system actually uses.

    I was afraid a PEP would be necessary out of this, but I want to see how the python-dev discussion goes first.

    @vstinner
    Copy link
    Member

    vstinner commented Aug 17, 2016

    Steve Dower added the comment:

    By portable, do you mean not using an environment variable?

    I mean that "python3 -X utf8" should force sys.getfilesystemencoding()
    to UTF-8 on UNIX/BSD, it would ignore the current locale setting.

    @zooba
    Copy link
    Member Author

    zooba commented Aug 17, 2016

    Ah I see, if we end up sticking with MBCS and offering a switch to enable UTF-8. In that case, we'll definitely ensure the flag is the same (but I'm hopeful we will just make the reliable behavior on Windows the default, so it won't matter).

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Sep 5, 2016

    I belatedly remembered I've had this new test case hanging around for a while, and never got around to getting it into shape for inclusion in the standard library.

    With the prospect of reasonable cross-platform consistency in this area, it could be a good thing to add as part of this PEP.

    @zooba
    Copy link
    Member Author

    zooba commented Sep 7, 2016

    Also see PEP-529 for the latest updates there.

    This is likely to be accepted as experimental for 3.6.0b1-3, and we'll commit to either the new default or a compatible default for b4.

    @zooba
    Copy link
    Member Author

    zooba commented Sep 7, 2016

    PEP-529 has been accepted, so this really needs a review now. But since it's experimental and all the tests pass, I'll be committing it shortly anyway and will be tidying up issues during beta.

    @zooba
    Copy link
    Member Author

    zooba commented Sep 7, 2016

    One minor change - I removed the unused definition of Py_FileSystemDefaultDecodeErrors.

    @zooba
    Copy link
    Member Author

    zooba commented Sep 8, 2016

    Thanks for that review, Eryk, but I'm going to defer those to other issues (specifically bpo-27998 for scandir and we should file a new issue for the symlink concerns).

    I've got some more doc updates to do though, and then I'll check in if there are no other concerns.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 8, 2016

    New changeset e20c7d8a8187 by Steve Dower in branch 'default':
    Issue bpo-27781: Change file system encoding on Windows to UTF-8 (PEP-529)
    https://hg.python.org/cpython/rev/e20c7d8a8187

    @zooba
    Copy link
    Member Author

    zooba commented Sep 8, 2016

    This is pushed now - let the bug fixing begin :)

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 8, 2016

    New changeset faca0730270b by Steve Dower in branch 'default':
    Fixes tests broken by issue bpo-27781.
    https://hg.python.org/cpython/rev/faca0730270b

    @mdickinson
    Copy link
    Member

    mdickinson commented Sep 9, 2016

    It looks as though this change might have broken the compile on OS X. On my OS X 10.9 machine, building from a clean Git checkout of the master branch fails; the tail of the failed build looks like this:

    ./python.exe -E -S -m sysconfig --generate-posix-vars ;\
    	if test $? -ne 0 ; then \
    		echo "generate-posix-vars failed" ; \
    		rm -f ./pybuilddir.txt ; \
    		exit 1 ; \
    	fi
    Fatal Python error: Py_Initialize: unable to load the file system codec
    Traceback (most recent call last):
      File "<frozen importlib._bootstrap>", line 962, in _find_and_load
      File "<frozen importlib._bootstrap>", line 951, in _find_and_load_unlocked
      File "<frozen importlib._bootstrap>", line 656, in _load_unlocked
      File "<frozen importlib._bootstrap_external>", line 668, in exec_module
      File "<frozen importlib._bootstrap_external>", line 782, in get_code
      File "<frozen importlib._bootstrap_external>", line 842, in _cache_bytecode
      File "<frozen importlib._bootstrap_external>", line 867, in set_data
      File "<frozen importlib._bootstrap_external>", line 117, in _write_atomic
    ValueError: negative file descriptor
    /bin/sh: line 1: 35829 Abort trap: 6           ./python.exe -E -S -m sysconfig --generate-posix-vars
    generate-posix-vars failed
    make: *** [pybuilddir.txt] Error 1

    Full build output attached.

    @mdickinson
    Copy link
    Member

    mdickinson commented Sep 9, 2016

    It looks as though this change in posixmodule.c is the cause:

     #ifdef MS_WINDOWS
    -        if (path->wide)
    -            fd = _wopen(path->wide, flags, mode);
    -        else
    +        fd = _wopen(path->wide, flags, mode);
     #endif
     #ifdef HAVE_OPENAT
             if (dir_fd != DEFAULT_DIR_FD)
                 fd = openat(dir_fd, path->narrow, flags, mode);
             else
    -#endif
                 fd = open(path->narrow, flags, mode);
    +#endif

    The move of the final #endif means that fd is not defined on OS X. If I move the #endif back again, the compile succeeds.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 9, 2016

    New changeset 801634d3c105 by Steve Dower in branch 'default':
    Issue bpo-27781: Fixes uninitialized fd when !MS_WINDOWS and !HAVE_OPENAT
    https://hg.python.org/cpython/rev/801634d3c105

    @mdickinson
    Copy link
    Member

    mdickinson commented Sep 9, 2016

    That seems to have done the trick. Thanks!

    @zooba
    Copy link
    Member Author

    zooba commented Nov 1, 2016

    Before 3.6.0 beta 4 I need to make this change permanent. From memory, it's just an exception message that needs changing (and PEP-529 becomes final), but I'll review the changeset to be sure.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 7, 2016

    New changeset b26c8104e54f by Steve Dower in branch '3.6':
    Closes bpo-27781: Removes special cases for the experimental aspect of PEP-529
    https://hg.python.org/cpython/rev/b26c8104e54f

    New changeset b8233c779ff7 by Steve Dower in branch 'default':
    Closes bpo-27781: Removes special cases for the experimental aspect of PEP-529
    https://hg.python.org/cpython/rev/b8233c779ff7

    @python-dev python-dev mannequin closed this as completed Nov 7, 2016
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 OS-windows release-blocker type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants