-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
try fix encode error #4251 #4280
Conversation
Reviewed 1 of 1 files at r1. Comments from Reviewable |
Hi! You need to update your |
@dstufft It's done. Thanks |
Thanks, I don't know enough about Windows to meaningfully review the efficacy of this PR. Hopefully @pfmoore has some spare time and knowledge about this bit. I also see there is #4310 which appears to also be trying to fix this issue. Worst case scenario I'll try and do some research to figure it out. |
@dstufft I see. For me, I couldn't tell which way is better. |
pip/compat.py
Outdated
if WINDOWS: | ||
try: | ||
from ctypes import cdll | ||
return s.decode("cp" + str(cdll.kernel32.GetACP())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be using the ANSI codepage here? As we're typically running a CLI subprocess, should it not be the OEM codepage that it will use? It's also worth noting that under Python 3.6, sys.__stdout__.encoding
will be UTF-8 (as Python 3.6 changed to use Unicode output on the console) but this will not necessarily match what a process run via subprocess.call
would see.
This whole area is something of a gross hack anyway, as there's really no way of knowing what encoding a subprocess might have chosen to use for its output. The best we can do is guess.
Maybe what would be better would be:
- Try
sys.__stdout__.encoding
. - Try UTF-8 (just because we already do that, I don't really know the justification, maybe it's for Unix)
- Use ASCII, with a suitable error handler (backslashreplace for Python 3.5+, replace for earlier versions)
Use the first result that doesn't give UnicodeDecodeError
(option 3 never should).
As far as I can see, the only way we use this function is to display the output of the setup.py
invocation, so a little bit of information loss or unreadability (the replace/backslashreplace option) is OK - certainly better than a decoding error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. But I wonder if before we try to use ASCII, maybe we should try OEM codepage at least.
The information loss may cause it's unable to read at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I have no problem with that. You might get the wrong characters, but it's not likely to do any harm.
The only possible issue (and this occurs if we try UTF-8 and it works, as well) is that the output contains some characters which are decodable using whatever encoding we end up succeeding with, but which aren't encodable in sys.stdout.encoding
. But that's likely to be very rare in practice (and will go away in Python 3.6, which uses Unicode for console output, and so can print anything).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to docs the backslashreplace error handling is only for encoding,
So I think maybe we could just change the code from
return s.decode("cp" + str(cdll.kernel32.GetACP()))
to
return s.decode("cp" + str(cdll.kernel32.GetACP()), "replace")
to prevent error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backslashreplace is usable for decode in Python 3.5+ (from the docs) hence my comment about only using it there. replace
is OK, though - it just loses a bit more information, but that's OK.
pip/compat.py
Outdated
try: | ||
from ctypes import cdll | ||
return s.decode("cp" + str(cdll.kernel32.GetACP())) | ||
except ImportError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What could cause an ImportError
here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. As this code is Windows-only, ctypes should always be present (I think it can be omitted in certain types of Unix build, but it never will be on Windows).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with IronPython, so I don't know if the ctypes module could be imported properly.
So I just add the try-catch to prevent more error.
If it's unnecessary, I'm glad to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I hadn't thought about IronPython (or jython for that matter). Quite possibly ctypes isn't available there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, ctypes doesn't exist in IronPython until version 2.6. This try-catch should be there instead of removing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pfmoore @xavfernandez Thanks for your suggestions, I've made some changes to the codes. It should be better than origin.
pip/compat.py
Outdated
return s.decode('utf_8') | ||
if WINDOWS: | ||
from ctypes import cdll | ||
return s.decode("cp" + str(cdll.kernel32.GetACP()), 'replace') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think OEMCP is more likely to be correct here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've check the subprocess use locale.getpreferredencoding() as its default encoding.
So I follow the code to here, and here, it seems python use GetACP instead of GetOEMCP on Windows.
Maybe better way to solve this problem, is try to change the console_to_str
function directly to
return s.decode(locale.getpreferredencoding(), 'replace')
no matter if it's windows or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that's for Python programs, but what do other programs (like the C compiler) use? I thought that console programs were supposed to use the OEM code page.
I guess it doesn't matter that much either way, though. If we guess wrong (and it's never better than a guess) then the worst that happens is that the output is a bit garbled. But that's still better than what currently happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think it doesn't matter that much too.
After such a long discussion, I think we ended up with some solutions together.
- if windows,
return s.decode("cp" + str(cdll.kernel32.GetACP()), 'replace')
with try-catch forImportError
of ctypes. - if windows,
return s.decode("cp" + str(cdll.kernel32.GetOEMCP()), 'replace')
with try-catch forImportError
of ctypes. - Regardless of whether it is windows or not, use
s.decode(locale.getpreferredencoding(), 'replace')
I prefer the third choice for now. It's more clean and no need to handle ImportError
If you think there's better way just feel free to let me know, or I will try to change it to the third solution.
I wonder if is OK with you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with the third option. Like you say, it seems cleaner. Thanks for sticking with this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are welcome, I've learned a lot from our discussion.
Thanks for your patience
Just FYI: In #4430, new test is being added, for one case that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bit better than how #4310 does it.
(Note: This is mostly a brain dump of my thoughts on the whole encoding situation here. It's not specific to this PR, but does include some points that we should consider as part of reviewing this PR. I'll come back and do a review of the specific changes proposed here when I have a little more time to think things through). Ultimately, This is also compounded by the fact that on Python 2, So we have some choices to make:
My instinct is that we should be saying:
We should always use the "replace" error handler, because this is for display to the user, and replacing invalid characters is better than an error. Unfortunately, from what I recall, none of the "better" error handlers like "backslashreplace" are available in all the Python versions pip supports, but we should switch to a better one as soon as we can. I don't particularly like the approach of trying an encoding, and falling back to a different one if we get an error. That seems to me to be perpetuating the "guess and hope" approach for little benefit over using Also, once we have decided what encoding we expect build tools to produce their output in, we should clearly document it once and for all and we should make the definition part of the "pluggable build tools" spec - this doesn't make the problem go away, but it makes it clearly the responsibility of build tools going forward to deal with the vagaries of C compilers, etc, in how they encode their output. We should also document in the interim that if build programs like compilers don't follow the documented encoding rules, then pip will potentially display mojibake or replace incorrectly encoded characters. As a helper for people reporting encoding bugs, we could write a script that reports the encoding pip expects to be using, plus the various other encoding settings in the user's system. That might help us understand the edge cases where people are hitting problems (typically, we can't expect them to have a detailed understanding of encodings, so it's often hard to get a clear description of the issue - "My system is set to use Swedish" is generally as much as the user knows, but the pip devs don't know what "set to use Swedish" implies). Any thoughts? @dstufft, @xavfernandez? I'm willing to write a PR implementing the above, but only if it's an agreed long-term resolution. I don't want to spend the time doing so if we get another "let's try it this way instead" PR the next time we get a bug report - let's fix it properly once and for all. (All of the above comments would affect #4310 as well) |
I think that sounds entirely reasonable, the only things I'd have to add are:
I don't know how the
Seems like instead of a script, we could make a separate command, something like |
OK. I can do some digging to see if I can find a bit more explanation. It'd be nice to capture more of the details in the Python docs. I guess that if As for the Overall, the
That might be nice. I wasn't sure people would be OK with building this into pip, but if you're OK with it in principle, I definitely agree it'd make getting better issue reports easier :-) |
How does this code look? if sys.version_info >= (3,):
def subprocess_encoding():
if WINDOWS:
if sys.version_info >= (3, 6):
return "oem"
# Prior to Python 3.6, sys.__stdout__ is opened
# with the OEM code page (at least in the console
# interpreter, which is what pip uses). This changed
# in Python 3.6 to be UTF8, but we don't care as we
# use the explicit "oem" encoding in that case via
# the code above
return sys.__stdout__.encoding
else:
import locale
# Note that the use of getpreferredencoding here calls
# setlocale, which isn't thread safe. This is OK, as we
# never call this code in a multi-threaded context.
# If thread safety was important, we could call
# getpreferredencoding(False), but there are apparently
# some systems where that will not give the correct answer.
#
# We fall back to UTF8 if locale can't provide an answer,
# as UTF8 is the most common encoding used nowadays.
return locale.getpreferredencoding() or "utf-8"
def console_to_str(s):
return s.decode(subprocess_encoding(), errors="replace") Usual provisos have to apply here - I don't have the means to test this on any sort of non-English environment (at least, not without going to more effort than I can afford at the moment). If it looks OK, I'll add some docs explaining that this is how we expect build systems to work (as described above) and turn it into a proper PR. |
There's a problem with my above code. On Windows, if the user redirects stdout, then Actually, the problem is (far) worse than that. Because we use I could claim this is a build tool issue, as setuptools is returning data in a mixed encoding. But that's just passing the buck, and doesn't help our users. What I'm going to do, I think, is submit a PR as above, that partially fixes the issue. (I'll even ignore the problem around the user redirecting pip's stdout for now). The main point of it is to get |
#4486 has now been merged, which should address this issue. I believe this PR can now be closed. |
Hello! I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the |
Agreed. Thanks for the PR though 👍 |
Fixes #4110