try fix encode error #4251 #4280

robinxb · 2017-02-11T03:46:50Z

mmyjona · 2017-02-25T18:47:06Z

Reviewed 1 of 1 files at r1.
Review status: all files reviewed at latest revision, all discussions resolved.

Comments from Reviewable

dstufft · 2017-04-01T03:36:29Z

Hi! You need to update your master branch and merge it into this PR. You'll also need to write a new files for this PR.

robinxb · 2017-04-01T03:59:12Z

@dstufft It's done. Thanks

dstufft · 2017-04-01T04:07:39Z

Thanks, I don't know enough about Windows to meaningfully review the efficacy of this PR. Hopefully @pfmoore has some spare time and knowledge about this bit. I also see there is #4310 which appears to also be trying to fix this issue.

Worst case scenario I'll try and do some research to figure it out.

robinxb · 2017-04-01T04:16:41Z

@dstufft I see. For me, I couldn't tell which way is better. repr surely also works not generate error, but I don't know if its result is correct on most situations.

pfmoore · 2017-04-01T08:53:27Z

pip/compat.py

+            if WINDOWS:
+                try:
+                    from ctypes import cdll
+                    return s.decode("cp" + str(cdll.kernel32.GetACP()))


Should we be using the ANSI codepage here? As we're typically running a CLI subprocess, should it not be the OEM codepage that it will use? It's also worth noting that under Python 3.6, sys.__stdout__.encoding will be UTF-8 (as Python 3.6 changed to use Unicode output on the console) but this will not necessarily match what a process run via subprocess.call would see.

This whole area is something of a gross hack anyway, as there's really no way of knowing what encoding a subprocess might have chosen to use for its output. The best we can do is guess.

Maybe what would be better would be:

Try sys.__stdout__.encoding.

Try UTF-8 (just because we already do that, I don't really know the justification, maybe it's for Unix)

Use ASCII, with a suitable error handler (backslashreplace for Python 3.5+, replace for earlier versions)

Use the first result that doesn't give UnicodeDecodeError (option 3 never should).

As far as I can see, the only way we use this function is to display the output of the setup.py invocation, so a little bit of information loss or unreadability (the replace/backslashreplace option) is OK - certainly better than a decoding error.

You're right. But I wonder if before we try to use ASCII, maybe we should try OEM codepage at least.
The information loss may cause it's unable to read at all.

Yeah, I have no problem with that. You might get the wrong characters, but it's not likely to do any harm.

The only possible issue (and this occurs if we try UTF-8 and it works, as well) is that the output contains some characters which are decodable using whatever encoding we end up succeeding with, but which aren't encodable in sys.stdout.encoding. But that's likely to be very rare in practice (and will go away in Python 3.6, which uses Unicode for console output, and so can print anything).

According to docs the backslashreplace error handling is only for encoding,

So I think maybe we could just change the code from
return s.decode("cp" + str(cdll.kernel32.GetACP()))
to
return s.decode("cp" + str(cdll.kernel32.GetACP()), "replace")
to prevent error.

backslashreplace is usable for decode in Python 3.5+ (from the docs) hence my comment about only using it there. replace is OK, though - it just loses a bit more information, but that's OK.

xavfernandez · 2017-04-02T17:27:15Z

pip/compat.py

+                try:
+                    from ctypes import cdll
+                    return s.decode("cp" + str(cdll.kernel32.GetACP()))
+                except ImportError:


What could cause an ImportError here ?

Good catch. As this code is Windows-only, ctypes should always be present (I think it can be omitted in certain types of Unix build, but it never will be on Windows).

I'm not familiar with IronPython, so I don't know if the ctypes module could be imported properly.
So I just add the try-catch to prevent more error.
If it's unnecessary, I'm glad to remove it.

Hmm, I hadn't thought about IronPython (or jython for that matter). Quite possibly ctypes isn't available there.

Well, ctypes doesn't exist in IronPython until version 2.6. This try-catch should be there instead of removing it.

robinxb

@pfmoore @xavfernandez Thanks for your suggestions, I've made some changes to the codes. It should be better than origin.

pfmoore · 2017-04-05T08:00:42Z

pip/compat.py

-            return s.decode('utf_8')
+            if WINDOWS:
+                from ctypes import cdll
+                return s.decode("cp" + str(cdll.kernel32.GetACP()), 'replace')


I still think OEMCP is more likely to be correct here.

I've check the subprocess use locale.getpreferredencoding() as its default encoding.
So I follow the code to here, and here, it seems python use GetACP instead of GetOEMCP on Windows.
Maybe better way to solve this problem, is try to change the console_to_str function directly to

return s.decode(locale.getpreferredencoding(), 'replace') no matter if it's windows or not?

Well, that's for Python programs, but what do other programs (like the C compiler) use? I thought that console programs were supposed to use the OEM code page.

I guess it doesn't matter that much either way, though. If we guess wrong (and it's never better than a guess) then the worst that happens is that the output is a bit garbled. But that's still better than what currently happens.

Yeah I think it doesn't matter that much too.
After such a long discussion, I think we ended up with some solutions together.

if windows, return s.decode("cp" + str(cdll.kernel32.GetACP()), 'replace') with try-catch for ImportError of ctypes.

if windows, return s.decode("cp" + str(cdll.kernel32.GetOEMCP()), 'replace') with try-catch for ImportError of ctypes.

Regardless of whether it is windows or not, use s.decode(locale.getpreferredencoding(), 'replace')

I prefer the third choice for now. It's more clean and no need to handle ImportError
If you think there's better way just feel free to let me know, or I will try to change it to the third solution.
I wonder if is OK with you?

I'm OK with the third option. Like you say, it seems cleaner. Thanks for sticking with this!

You are welcome, I've learned a lot from our discussion.
Thanks for your patience

sakurai-youhei · 2017-05-10T06:58:15Z

Just FYI: In #4430, new test is being added, for one case that locale.getpreferredencoding() may return None.

AraHaan

I think this is a bit better than how #4310 does it.

pfmoore · 2017-05-15T12:08:13Z

(Note: This is mostly a brain dump of my thoughts on the whole encoding situation here. It's not specific to this PR, but does include some points that we should consider as part of reviewing this PR. I'll come back and do a review of the specific changes proposed here when I have a little more time to think things through).

Ultimately, console_to_str is used in only one place in pip - to take the output from a subprocess call (call_subprocess in pip/utils/__init__.py) and make a string out of it, for the sole purpose of displaying that output to the user. The subprocess output is a byte string, so to do that we need an encoding. The problem is that what that encoding should be is difficult to establish, platform dependent, and theoretically impossible to get (because technically programs emit byte streams on stdout, and it's purely a matter of convention what encoding they use).

This is also compounded by the fact that on Python 2, console_to_str is a no-op, which is really dangerous - it's working on the "strings are bytes" confusion that is why Python 2 enconding handling is so hard, and it makes it impossible to cleanly document the data flow (in terms of when objects are logically considered to be encoded as bytes and when they are considered to be abstract strings). Unfortunately, our string model on Python 2 is not clean, so using unicode as the output here would likely cause more problems than it solves. It's likely the best we can do is either ignore Python 2, or re-encode the input to a known encoding.

So we have some choices to make:

locale.getpreferredencoding() should be a good default, but there seems to be a lot of weirdness here. particularly on Unix (hence I don't really understand it). On Windows, from my limited experiments, it's the ANSI codepage, but console programs typically output in the OEM codepage (because that's what the C runtime does), which is different. There's also a locale.getpreferredencoding(False) vs locale.getpreferredencoding() distinction, which seems to only make a difference on Unix (?)
sys.stdout.encoding works on the basis of "if that's how Python writes to the console, then it's probably how other console programs work, too". But Python 3.6 on Windows breaks this rule, as it uses utf-8 (because we don't use the C runtime to output strings, unlike most "other programs"). Also, the PYTHONIOENCODING variable affects this, even though it won't affect external non-Python programs.

My instinct is that we should be saying:

On Unix, we always use locale.getpreferredencoding(False), as that matches Python's own behaviour. (I'm willing to be overridden on this by a Unix expert, as long as we clearly document what we did choose, and why. For example this PR seems to imply that None is a valid return value for getpreferredencoding and that utf-8 is a sensible default in that case - where's the evidence for that? The documentation doesn't mention a None return value as far as I can see...).
On Windows we use the OEM code page. I'm not sure Python offers a way to get this prior to 3.6 (the "oem" encoding is available in Python 3.6) so we may have to write some OS-specific code for this. I'd be willing to do this.
On Python 2, we stick with the current behaviour, and if we get bug reports for encoding issues on Python 2, we simply close them as "won't fix - upgrade to Python 3 if you want cleaner Unicode handling".

We should always use the "replace" error handler, because this is for display to the user, and replacing invalid characters is better than an error. Unfortunately, from what I recall, none of the "better" error handlers like "backslashreplace" are available in all the Python versions pip supports, but we should switch to a better one as soon as we can.

I don't particularly like the approach of trying an encoding, and falling back to a different one if we get an error. That seems to me to be perpetuating the "guess and hope" approach for little benefit over using errors="replace".

Also, once we have decided what encoding we expect build tools to produce their output in, we should clearly document it once and for all and we should make the definition part of the "pluggable build tools" spec - this doesn't make the problem go away, but it makes it clearly the responsibility of build tools going forward to deal with the vagaries of C compilers, etc, in how they encode their output. We should also document in the interim that if build programs like compilers don't follow the documented encoding rules, then pip will potentially display mojibake or replace incorrectly encoded characters.

As a helper for people reporting encoding bugs, we could write a script that reports the encoding pip expects to be using, plus the various other encoding settings in the user's system. That might help us understand the edge cases where people are hitting problems (typically, we can't expect them to have a detailed understanding of encodings, so it's often hard to get a clear description of the issue - "My system is set to use Swedish" is generally as much as the user knows, but the pip devs don't know what "set to use Swedish" implies).

Any thoughts? @dstufft, @xavfernandez? I'm willing to write a PR implementing the above, but only if it's an agreed long-term resolution. I don't want to spend the time doing so if we get another "let's try it this way instead" PR the next time we get a bug report - let's fix it properly once and for all.

(All of the above comments would affect #4310 as well)

dstufft · 2017-05-15T13:20:24Z

I think that sounds entirely reasonable, the only things I'd have to add are:

On Unix, we always use locale.getpreferredencoding(False), as that matches Python's own behaviour. (I'm willing to be overridden on this by a Unix expert, as long as we clearly document what we did choose, and why. For example this PR seems to imply that None is a valid return value for getpreferredencoding and that utf-8 is a sensible default in that case - where's the evidence for that? The documentation doesn't mention a None return value as far as I can see...).

I don't know how the False parameter changes it, but I have definitely seen None as a return value in the wild for locale.getpreferredencoding(). I have no idea what situation arises that makes that the case though.

As a helper for people reporting encoding bugs, we could write a script that reports the encoding pip expects to be using, plus the various other encoding settings in the user's system. That might help us understand the edge cases where people are hitting problems (typically, we can't expect them to have a detailed understanding of encodings, so it's often hard to get a clear description of the issue - "My system is set to use Swedish" is generally as much as the user knows, but the pip devs don't know what "set to use Swedish" implies).

Seems like instead of a script, we could make a separate command, something like pip debuginfo that just outputs a bunch of information about the state of their system. We could add to it overtime and instead of asking people to list out things like Python version, pip version, etc etc in our tickets, we could change the instructions to "Run pip debuginfo and include the results".

pfmoore · 2017-05-15T13:44:26Z

I don't know how the False parameter changes it, but I have definitely seen None as a return value in the wild for locale.getpreferredencoding(). I have no idea what situation arises that makes that the case though.

OK. I can do some digging to see if I can find a bit more explanation. It'd be nice to capture more of the details in the Python docs. I guess that if None is returned, then falling back to UTF8 (as this PR does) would be OK?

As for the False parameter, the docs say "on some systems, it is necessary to invoke setlocale() to obtain the user preferences, so this function is not thread-safe. If invoking setlocale is not necessary or desired, do_setlocale should be set to False." I have no way of knowing whether the "some systems" referred to are ones we care about, and if they are, whether skipping setlocale is a good idea for pip. I guess we don't have thread issues to worry about at this point, so maybe we can skip the False (it may cause issues for the people who insist on using pip in-process, but that's not supported of course...)

Overall, the getpreferredencoding docs say "this function only returns a guess" so I guess we're never going to be perfect. I've no idea if there's any better option in Unix, or if the cases where it guesses wrong are a problem to our users in practice. From what I recall, most of the encoding issues for pip are from Windows users, and so are likely to be related to the ANSI/OEM codepage issue.

Seems like instead of a script, we could make a separate command, something like pip debuginfo that just outputs a bunch of information about the state of their system

That might be nice. I wasn't sure people would be OK with building this into pip, but if you're OK with it in principle, I definitely agree it'd make getting better issue reports easier :-)

pfmoore · 2017-05-15T14:06:56Z

How does this code look?

if sys.version_info >= (3,):
    def subprocess_encoding():
        if WINDOWS:
            if sys.version_info >= (3, 6):
                return "oem"
            # Prior to Python 3.6, sys.__stdout__ is opened
            # with the OEM code page (at least in the console
            # interpreter, which is what pip uses). This changed
            # in Python 3.6 to be UTF8, but we don't care as we
            # use the explicit "oem" encoding in that case via
            # the code above
            return sys.__stdout__.encoding
        else:
            import locale
            # Note that the use of getpreferredencoding here calls
            # setlocale, which isn't thread safe. This is OK, as we
            # never call this code in a multi-threaded context.
            # If thread safety was important, we could call
            # getpreferredencoding(False), but there are apparently
            # some systems where that will not give the correct answer.
            #
            # We fall back to UTF8 if locale can't provide an answer,
            # as UTF8 is the most common encoding used nowadays.
            return locale.getpreferredencoding() or "utf-8"

    def console_to_str(s):
        return s.decode(subprocess_encoding(), errors="replace")

Usual provisos have to apply here - I don't have the means to test this on any sort of non-English environment (at least, not without going to more effort than I can afford at the moment).

If it looks OK, I'll add some docs explaining that this is how we expect build systems to work (as described above) and turn it into a proper PR.

pfmoore · 2017-05-16T11:30:49Z

There's a problem with my above code. On Windows, if the user redirects stdout, then sys.__stdout__.encoding is the ANSI codepage, not the OEM one.

Actually, the problem is (far) worse than that. Because we use subprocess.Popen when we run setup.py, the Python run in the subprocess will output using the ANSI codepage. But any command line tools run by setup.py (e.g., the C compiler) will do their own thing, and I don't believe setuptools (or distutils) does anything to deal with that. In particular, tools written to Windows standards (such as Visual C) will probably use the OEM code page, but tools written with a Unix view of the world (such as GNU tools - I only tested echo.exe but I suspect the mingw would do the same) use the ANSI codepage.

I could claim this is a build tool issue, as setuptools is returning data in a mixed encoding. But that's just passing the buck, and doesn't help our users.

What I'm going to do, I think, is submit a PR as above, that partially fixes the issue. (I'll even ignore the problem around the user redirecting pip's stdout for now). The main point of it is to get errors="replace" into the codebase, so we (hopefully) stop getting Unicode errors. Whether we can reasonably deal with the mixed-encoding problem is something we can debate in the longer term, depending on what sort of bug reports we receive about corrupted output.

pfmoore · 2017-05-27T21:08:12Z

#4486 has now been merged, which should address this issue. I believe this PR can now be closed.

BrownTruck · 2017-05-28T08:16:21Z

Hello!

I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the master branch into this pull request or rebase this pull request against master then it will eligible for code review and hopefully merging!

xavfernandez · 2017-05-29T22:42:11Z

#4486 has now been merged, which should address this issue. I believe this PR can now be closed.

Agreed. Thanks for the PR though 👍

robinxb force-pushed the patch-1 branch from 196f840 to f037461 Compare February 11, 2017 04:09

try fix encode error pypa#4251

6cbee88

robinxb force-pushed the patch-1 branch from f037461 to 6cbee88 Compare February 11, 2017 04:13

xavfernandez added C: encoding Related to text encoding and likely, UnicodeErrors OS: windows Windows specific labels Feb 11, 2017

dstufft requested a review from pfmoore April 1, 2017 03:35

dstufft added the needs rebase or merge PR has conflicts with current master label Apr 1, 2017

robinxb added 2 commits April 1, 2017 11:51

Merge branch 'master' into patch-1

2fd6951

add new entry

0d6c235

pfmoore reviewed Apr 1, 2017

View reviewed changes

dstufft removed the needs rebase or merge PR has conflicts with current master label Apr 1, 2017

xavfernandez reviewed Apr 2, 2017

View reviewed changes

xavfernandez mentioned this pull request Apr 2, 2017

Add one more fail-safe when decoding console to str in Py3 #4310

Closed

xavfernandez added this to the 10.0 milestone Apr 2, 2017

add replace error handler to decode, remove catch ImportError on windows

d1de4fc

robinxb commented Apr 5, 2017

View reviewed changes

pfmoore reviewed Apr 5, 2017

View reviewed changes

use getpreferredencoding for console_to_str

c7d43df

xavfernandez mentioned this pull request Apr 23, 2017

pip issues UnicodeDecodeError on Windows 10 for Russian language #4251

Closed

AraHaan approved these changes May 14, 2017

View reviewed changes

prevent exception while local.getpreferredencoding return None

e7db742

pfmoore mentioned this pull request May 16, 2017

Another attempt to fix encoding issues #4486

Merged

BrownTruck added the needs rebase or merge PR has conflicts with current master label May 28, 2017

xavfernandez closed this May 29, 2017

lock bot added the auto-locked Outdated issues that have been locked by automation label Jun 3, 2019

lock bot locked as resolved and limited conversation to collaborators Jun 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

try fix encode error #4251 #4280

try fix encode error #4251 #4280

robinxb commented Feb 11, 2017 •

edited by dstufft

Loading

mmyjona commented Feb 25, 2017

dstufft commented Apr 1, 2017

robinxb commented Apr 1, 2017

dstufft commented Apr 1, 2017

robinxb commented Apr 1, 2017

pfmoore Apr 1, 2017

robinxb Apr 1, 2017

pfmoore Apr 1, 2017

robinxb Apr 5, 2017

pfmoore Apr 5, 2017

xavfernandez Apr 2, 2017

pfmoore Apr 2, 2017

robinxb Apr 5, 2017

pfmoore Apr 5, 2017

robinxb Apr 5, 2017 •

edited

Loading

robinxb left a comment

pfmoore Apr 5, 2017

robinxb Apr 5, 2017 •

edited

Loading

pfmoore Apr 5, 2017

robinxb Apr 5, 2017

pfmoore Apr 5, 2017

robinxb Apr 5, 2017

sakurai-youhei commented May 10, 2017

AraHaan left a comment

pfmoore commented May 15, 2017

dstufft commented May 15, 2017

pfmoore commented May 15, 2017

pfmoore commented May 15, 2017

pfmoore commented May 16, 2017

pfmoore commented May 27, 2017

BrownTruck commented May 28, 2017

xavfernandez commented May 29, 2017

try fix encode error #4251 #4280

try fix encode error #4251 #4280

Conversation

robinxb commented Feb 11, 2017 • edited by dstufft Loading

mmyjona commented Feb 25, 2017

dstufft commented Apr 1, 2017

robinxb commented Apr 1, 2017

dstufft commented Apr 1, 2017

robinxb commented Apr 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robinxb Apr 5, 2017 • edited Loading

Choose a reason for hiding this comment

robinxb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robinxb Apr 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sakurai-youhei commented May 10, 2017

AraHaan left a comment

Choose a reason for hiding this comment

pfmoore commented May 15, 2017

dstufft commented May 15, 2017

pfmoore commented May 15, 2017

pfmoore commented May 15, 2017

pfmoore commented May 16, 2017

pfmoore commented May 27, 2017

BrownTruck commented May 28, 2017

xavfernandez commented May 29, 2017

robinxb commented Feb 11, 2017 •

edited by dstufft

Loading

robinxb Apr 5, 2017 •

edited

Loading

robinxb Apr 5, 2017 •

edited

Loading