Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try fix encode error #4251 #4280

Closed
wants to merge 6 commits into from
Closed

try fix encode error #4251 #4280

wants to merge 6 commits into from

Conversation

robinxb
Copy link

@robinxb robinxb commented Feb 11, 2017

Fixes #4110

@xavfernandez xavfernandez added C: encoding Related to text encoding and likely, UnicodeErrors OS: windows Windows specific labels Feb 11, 2017
@mmyjona
Copy link

mmyjona commented Feb 25, 2017

Reviewed 1 of 1 files at r1.
Review status: all files reviewed at latest revision, all discussions resolved.


Comments from Reviewable

@dstufft dstufft requested a review from pfmoore April 1, 2017 03:35
@dstufft
Copy link
Member

dstufft commented Apr 1, 2017

Hi! You need to update your master branch and merge it into this PR. You'll also need to write a new files for this PR.

@dstufft dstufft added the needs rebase or merge PR has conflicts with current master label Apr 1, 2017
@robinxb
Copy link
Author

robinxb commented Apr 1, 2017

@dstufft It's done. Thanks

@dstufft
Copy link
Member

dstufft commented Apr 1, 2017

Thanks, I don't know enough about Windows to meaningfully review the efficacy of this PR. Hopefully @pfmoore has some spare time and knowledge about this bit. I also see there is #4310 which appears to also be trying to fix this issue.

Worst case scenario I'll try and do some research to figure it out.

@robinxb
Copy link
Author

robinxb commented Apr 1, 2017

@dstufft I see. For me, I couldn't tell which way is better. repr surely also works not generate error, but I don't know if its result is correct on most situations.

pip/compat.py Outdated
if WINDOWS:
try:
from ctypes import cdll
return s.decode("cp" + str(cdll.kernel32.GetACP()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be using the ANSI codepage here? As we're typically running a CLI subprocess, should it not be the OEM codepage that it will use? It's also worth noting that under Python 3.6, sys.__stdout__.encoding will be UTF-8 (as Python 3.6 changed to use Unicode output on the console) but this will not necessarily match what a process run via subprocess.call would see.

This whole area is something of a gross hack anyway, as there's really no way of knowing what encoding a subprocess might have chosen to use for its output. The best we can do is guess.

Maybe what would be better would be:

  1. Try sys.__stdout__.encoding.
  2. Try UTF-8 (just because we already do that, I don't really know the justification, maybe it's for Unix)
  3. Use ASCII, with a suitable error handler (backslashreplace for Python 3.5+, replace for earlier versions)

Use the first result that doesn't give UnicodeDecodeError (option 3 never should).

As far as I can see, the only way we use this function is to display the output of the setup.py invocation, so a little bit of information loss or unreadability (the replace/backslashreplace option) is OK - certainly better than a decoding error.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. But I wonder if before we try to use ASCII, maybe we should try OEM codepage at least.
The information loss may cause it's unable to read at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I have no problem with that. You might get the wrong characters, but it's not likely to do any harm.

The only possible issue (and this occurs if we try UTF-8 and it works, as well) is that the output contains some characters which are decodable using whatever encoding we end up succeeding with, but which aren't encodable in sys.stdout.encoding. But that's likely to be very rare in practice (and will go away in Python 3.6, which uses Unicode for console output, and so can print anything).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to docs the backslashreplace error handling is only for encoding,

So I think maybe we could just change the code from
return s.decode("cp" + str(cdll.kernel32.GetACP()))
to
return s.decode("cp" + str(cdll.kernel32.GetACP()), "replace")
to prevent error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backslashreplace is usable for decode in Python 3.5+ (from the docs) hence my comment about only using it there. replace is OK, though - it just loses a bit more information, but that's OK.

@dstufft dstufft removed the needs rebase or merge PR has conflicts with current master label Apr 1, 2017
pip/compat.py Outdated
try:
from ctypes import cdll
return s.decode("cp" + str(cdll.kernel32.GetACP()))
except ImportError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What could cause an ImportError here ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. As this code is Windows-only, ctypes should always be present (I think it can be omitted in certain types of Unix build, but it never will be on Windows).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with IronPython, so I don't know if the ctypes module could be imported properly.
So I just add the try-catch to prevent more error.
If it's unnecessary, I'm glad to remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I hadn't thought about IronPython (or jython for that matter). Quite possibly ctypes isn't available there.

Copy link
Author

@robinxb robinxb Apr 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, ctypes doesn't exist in IronPython until version 2.6. This try-catch should be there instead of removing it.

Copy link
Author

@robinxb robinxb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pfmoore @xavfernandez Thanks for your suggestions, I've made some changes to the codes. It should be better than origin.

pip/compat.py Outdated
return s.decode('utf_8')
if WINDOWS:
from ctypes import cdll
return s.decode("cp" + str(cdll.kernel32.GetACP()), 'replace')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think OEMCP is more likely to be correct here.

Copy link
Author

@robinxb robinxb Apr 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've check the subprocess use locale.getpreferredencoding() as its default encoding.
So I follow the code to here, and here, it seems python use GetACP instead of GetOEMCP on Windows.
Maybe better way to solve this problem, is try to change the console_to_str function directly to

return s.decode(locale.getpreferredencoding(), 'replace') no matter if it's windows or not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that's for Python programs, but what do other programs (like the C compiler) use? I thought that console programs were supposed to use the OEM code page.

I guess it doesn't matter that much either way, though. If we guess wrong (and it's never better than a guess) then the worst that happens is that the output is a bit garbled. But that's still better than what currently happens.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it doesn't matter that much too.
After such a long discussion, I think we ended up with some solutions together.

  1. if windows, return s.decode("cp" + str(cdll.kernel32.GetACP()), 'replace') with try-catch for ImportError of ctypes.
  2. if windows, return s.decode("cp" + str(cdll.kernel32.GetOEMCP()), 'replace') with try-catch for ImportError of ctypes.
  3. Regardless of whether it is windows or not, use s.decode(locale.getpreferredencoding(), 'replace')

I prefer the third choice for now. It's more clean and no need to handle ImportError
If you think there's better way just feel free to let me know, or I will try to change it to the third solution.
I wonder if is OK with you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with the third option. Like you say, it seems cleaner. Thanks for sticking with this!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are welcome, I've learned a lot from our discussion.
Thanks for your patience

@sakurai-youhei
Copy link

Just FYI: In #4430, new test is being added, for one case that locale.getpreferredencoding() may return None.

Copy link
Contributor

@AraHaan AraHaan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bit better than how #4310 does it.

@pfmoore
Copy link
Member

pfmoore commented May 15, 2017

(Note: This is mostly a brain dump of my thoughts on the whole encoding situation here. It's not specific to this PR, but does include some points that we should consider as part of reviewing this PR. I'll come back and do a review of the specific changes proposed here when I have a little more time to think things through).

Ultimately, console_to_str is used in only one place in pip - to take the output from a subprocess call (call_subprocess in pip/utils/__init__.py) and make a string out of it, for the sole purpose of displaying that output to the user. The subprocess output is a byte string, so to do that we need an encoding. The problem is that what that encoding should be is difficult to establish, platform dependent, and theoretically impossible to get (because technically programs emit byte streams on stdout, and it's purely a matter of convention what encoding they use).

This is also compounded by the fact that on Python 2, console_to_str is a no-op, which is really dangerous - it's working on the "strings are bytes" confusion that is why Python 2 enconding handling is so hard, and it makes it impossible to cleanly document the data flow (in terms of when objects are logically considered to be encoded as bytes and when they are considered to be abstract strings). Unfortunately, our string model on Python 2 is not clean, so using unicode as the output here would likely cause more problems than it solves. It's likely the best we can do is either ignore Python 2, or re-encode the input to a known encoding.

So we have some choices to make:

  1. locale.getpreferredencoding() should be a good default, but there seems to be a lot of weirdness here. particularly on Unix (hence I don't really understand it). On Windows, from my limited experiments, it's the ANSI codepage, but console programs typically output in the OEM codepage (because that's what the C runtime does), which is different. There's also a locale.getpreferredencoding(False) vs locale.getpreferredencoding() distinction, which seems to only make a difference on Unix (?)
  2. sys.stdout.encoding works on the basis of "if that's how Python writes to the console, then it's probably how other console programs work, too". But Python 3.6 on Windows breaks this rule, as it uses utf-8 (because we don't use the C runtime to output strings, unlike most "other programs"). Also, the PYTHONIOENCODING variable affects this, even though it won't affect external non-Python programs.

My instinct is that we should be saying:

  • On Unix, we always use locale.getpreferredencoding(False), as that matches Python's own behaviour. (I'm willing to be overridden on this by a Unix expert, as long as we clearly document what we did choose, and why. For example this PR seems to imply that None is a valid return value for getpreferredencoding and that utf-8 is a sensible default in that case - where's the evidence for that? The documentation doesn't mention a None return value as far as I can see...).
  • On Windows we use the OEM code page. I'm not sure Python offers a way to get this prior to 3.6 (the "oem" encoding is available in Python 3.6) so we may have to write some OS-specific code for this. I'd be willing to do this.
  • On Python 2, we stick with the current behaviour, and if we get bug reports for encoding issues on Python 2, we simply close them as "won't fix - upgrade to Python 3 if you want cleaner Unicode handling".

We should always use the "replace" error handler, because this is for display to the user, and replacing invalid characters is better than an error. Unfortunately, from what I recall, none of the "better" error handlers like "backslashreplace" are available in all the Python versions pip supports, but we should switch to a better one as soon as we can.

I don't particularly like the approach of trying an encoding, and falling back to a different one if we get an error. That seems to me to be perpetuating the "guess and hope" approach for little benefit over using errors="replace".

Also, once we have decided what encoding we expect build tools to produce their output in, we should clearly document it once and for all and we should make the definition part of the "pluggable build tools" spec - this doesn't make the problem go away, but it makes it clearly the responsibility of build tools going forward to deal with the vagaries of C compilers, etc, in how they encode their output. We should also document in the interim that if build programs like compilers don't follow the documented encoding rules, then pip will potentially display mojibake or replace incorrectly encoded characters.

As a helper for people reporting encoding bugs, we could write a script that reports the encoding pip expects to be using, plus the various other encoding settings in the user's system. That might help us understand the edge cases where people are hitting problems (typically, we can't expect them to have a detailed understanding of encodings, so it's often hard to get a clear description of the issue - "My system is set to use Swedish" is generally as much as the user knows, but the pip devs don't know what "set to use Swedish" implies).

Any thoughts? @dstufft, @xavfernandez? I'm willing to write a PR implementing the above, but only if it's an agreed long-term resolution. I don't want to spend the time doing so if we get another "let's try it this way instead" PR the next time we get a bug report - let's fix it properly once and for all.

(All of the above comments would affect #4310 as well)

@dstufft
Copy link
Member

dstufft commented May 15, 2017

I think that sounds entirely reasonable, the only things I'd have to add are:

On Unix, we always use locale.getpreferredencoding(False), as that matches Python's own behaviour. (I'm willing to be overridden on this by a Unix expert, as long as we clearly document what we did choose, and why. For example this PR seems to imply that None is a valid return value for getpreferredencoding and that utf-8 is a sensible default in that case - where's the evidence for that? The documentation doesn't mention a None return value as far as I can see...).

I don't know how the False parameter changes it, but I have definitely seen None as a return value in the wild for locale.getpreferredencoding(). I have no idea what situation arises that makes that the case though.

As a helper for people reporting encoding bugs, we could write a script that reports the encoding pip expects to be using, plus the various other encoding settings in the user's system. That might help us understand the edge cases where people are hitting problems (typically, we can't expect them to have a detailed understanding of encodings, so it's often hard to get a clear description of the issue - "My system is set to use Swedish" is generally as much as the user knows, but the pip devs don't know what "set to use Swedish" implies).

Seems like instead of a script, we could make a separate command, something like pip debuginfo that just outputs a bunch of information about the state of their system. We could add to it overtime and instead of asking people to list out things like Python version, pip version, etc etc in our tickets, we could change the instructions to "Run pip debuginfo and include the results".

@pfmoore
Copy link
Member

pfmoore commented May 15, 2017

I don't know how the False parameter changes it, but I have definitely seen None as a return value in the wild for locale.getpreferredencoding(). I have no idea what situation arises that makes that the case though.

OK. I can do some digging to see if I can find a bit more explanation. It'd be nice to capture more of the details in the Python docs. I guess that if None is returned, then falling back to UTF8 (as this PR does) would be OK?

As for the False parameter, the docs say "on some systems, it is necessary to invoke setlocale() to obtain the user preferences, so this function is not thread-safe. If invoking setlocale is not necessary or desired, do_setlocale should be set to False." I have no way of knowing whether the "some systems" referred to are ones we care about, and if they are, whether skipping setlocale is a good idea for pip. I guess we don't have thread issues to worry about at this point, so maybe we can skip the False (it may cause issues for the people who insist on using pip in-process, but that's not supported of course...)

Overall, the getpreferredencoding docs say "this function only returns a guess" so I guess we're never going to be perfect. I've no idea if there's any better option in Unix, or if the cases where it guesses wrong are a problem to our users in practice. From what I recall, most of the encoding issues for pip are from Windows users, and so are likely to be related to the ANSI/OEM codepage issue.

Seems like instead of a script, we could make a separate command, something like pip debuginfo that just outputs a bunch of information about the state of their system

That might be nice. I wasn't sure people would be OK with building this into pip, but if you're OK with it in principle, I definitely agree it'd make getting better issue reports easier :-)

@pfmoore
Copy link
Member

pfmoore commented May 15, 2017

How does this code look?

if sys.version_info >= (3,):
    def subprocess_encoding():
        if WINDOWS:
            if sys.version_info >= (3, 6):
                return "oem"
            # Prior to Python 3.6, sys.__stdout__ is opened
            # with the OEM code page (at least in the console
            # interpreter, which is what pip uses). This changed
            # in Python 3.6 to be UTF8, but we don't care as we
            # use the explicit "oem" encoding in that case via
            # the code above
            return sys.__stdout__.encoding
        else:
            import locale
            # Note that the use of getpreferredencoding here calls
            # setlocale, which isn't thread safe. This is OK, as we
            # never call this code in a multi-threaded context.
            # If thread safety was important, we could call
            # getpreferredencoding(False), but there are apparently
            # some systems where that will not give the correct answer.
            #
            # We fall back to UTF8 if locale can't provide an answer,
            # as UTF8 is the most common encoding used nowadays.
            return locale.getpreferredencoding() or "utf-8"

    def console_to_str(s):
        return s.decode(subprocess_encoding(), errors="replace")

Usual provisos have to apply here - I don't have the means to test this on any sort of non-English environment (at least, not without going to more effort than I can afford at the moment).

If it looks OK, I'll add some docs explaining that this is how we expect build systems to work (as described above) and turn it into a proper PR.

@pfmoore
Copy link
Member

pfmoore commented May 16, 2017

There's a problem with my above code. On Windows, if the user redirects stdout, then sys.__stdout__.encoding is the ANSI codepage, not the OEM one.

Actually, the problem is (far) worse than that. Because we use subprocess.Popen when we run setup.py, the Python run in the subprocess will output using the ANSI codepage. But any command line tools run by setup.py (e.g., the C compiler) will do their own thing, and I don't believe setuptools (or distutils) does anything to deal with that. In particular, tools written to Windows standards (such as Visual C) will probably use the OEM code page, but tools written with a Unix view of the world (such as GNU tools - I only tested echo.exe but I suspect the mingw would do the same) use the ANSI codepage.

I could claim this is a build tool issue, as setuptools is returning data in a mixed encoding. But that's just passing the buck, and doesn't help our users.

What I'm going to do, I think, is submit a PR as above, that partially fixes the issue. (I'll even ignore the problem around the user redirecting pip's stdout for now). The main point of it is to get errors="replace" into the codebase, so we (hopefully) stop getting Unicode errors. Whether we can reasonably deal with the mixed-encoding problem is something we can debate in the longer term, depending on what sort of bug reports we receive about corrupted output.

@pfmoore
Copy link
Member

pfmoore commented May 27, 2017

#4486 has now been merged, which should address this issue. I believe this PR can now be closed.

@BrownTruck
Copy link
Contributor

Hello!

I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the master branch into this pull request or rebase this pull request against master then it will eligible for code review and hopefully merging!

@BrownTruck BrownTruck added the needs rebase or merge PR has conflicts with current master label May 28, 2017
@xavfernandez
Copy link
Member

#4486 has now been merged, which should address this issue. I believe this PR can now be closed.

Agreed. Thanks for the PR though 👍

@lock lock bot added the auto-locked Outdated issues that have been locked by automation label Jun 3, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Jun 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation C: encoding Related to text encoding and likely, UnicodeErrors needs rebase or merge PR has conflicts with current master OS: windows Windows specific
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pip install codepage error (sqlalchemy,windows)
8 participants