More Windows encoding woes #241

Closed
Grk0 opened this Issue May 2, 2015 · 5 comments

Projects

None yet

4 participants

@Grk0
Grk0 commented May 2, 2015

I'm running into problems with the latest (post #201) encoding handling. A "tl;dr" is at the bottom.

Using invoke to run msbuild (the Visual Studio version of make), I get:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "C:\Program Files (x86)\Python34\lib\threading.py", line 920, in _bootstrap_inner
    self.run()
  File "C:\Program Files (x86)\Python34\lib\threading.py", line 868, in run
    self._target(*self._args, **self._kwargs)
  File "C:\...\venv\lib\site-packages\invoke\runner.py", line 211, in display
    dst.write(data)
  File "C:\Program Files (x86)\Python34\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u201d' in position 101: character maps to <undefined>

The problem is the default value of encoding in invoke.runners.Local.run_direct(). Like @pfmoore, I also have a system where locale.getpreferredencoding() and sys.stdout.encoding differ (cp1252 vs cp850). In #201, the default encoding was chosen to be locale.getpreferredencoding(False), because Python uses it when its stdout is redirected and it can't discern the file descriptor locale via os.device_encoding(sys.stdout.fileno()).

C:\>chcp
Aktive Codepage: 850.

C:\>type encodingdemo.py
#!/usr/bin/python3
import sys
import locale
print('getpreferredencoding:', locale.getpreferredencoding(), file=sys.stderr)
print('sys.stdout.encoding:', sys.stdout.encoding, file=sys.stderr)

C:\>py -3.4 encodingdemo.py
locale.getpreferredencoding: cp1252
sys.stdout.encoding: cp850

C:\>py -3.4 encodingdemo.py > some_file
locale.getpreferredencoding: cp1252
sys.stdout.encoding: cp1252

Note how the stdout encoding changes when redirecting. Since invoke does indeed redirect its childs I/O, it's correct to use locale.getpreferredencoding() to get the appropriate locale.

Unfortunately, msbuild, or more generally, .NET's Console class (documentation, source code) doesn't work this way. When it opens stdout, it uses the console output codepage, even when not writing directly to the console (see InitializeStdOutError in the source code link above):

int codePage = (int) Win32Native.GetConsoleOutputCP(); 
Encoding encoding = Encoding.GetEncoding(codePage);

In other words, it recovers the console encoding (sys.stdout.encoding in Python) even when redirected, whereas Python falls back to locale.getpreferredencoding().

C:\>chcp
Aktive Codepage: 850.
C:\>msbuild 2>&1 > msbuild.cp850    # German with CP850 chars

C:\>chcp 1250
Aktive Codepage: 1250.
C:\>msbuild 2>&1 > msbuild.cp1250   # English, ASCII

C:\>chcp 65001
Aktive Codepage: 65001.
C:\>msbuild 2>&1 > msbuild.cp65001  # German with UTF-8 chars

(Chcp changes the active codepage but doesn't alter locale.getpreferredencoding()).

This means there is no way to correctly handle both Python apps (which output with locale.getpreferredencoding() encoding when redirected) and .NET apps (which output with the console encoding when redirected). :-(

Possible solutions:

  • Replace dst.write(data) with dst.buffer.write(data.encode(dst.encoding, errors='replace)) to avoid the UnicodeEncodeError when trying to write incorrectly read input to the console. This should definitely be done, irrespective of trying to get the encoding right.
  • We could try to decode the input data with both possible encodings on Windows and see what works, but this is super hackish.
  • We could set the codepage to locale.getdefaultencoding(), but that's a per-console window setting (not per-processtree), so we'd need to reset it to the previous value on exit (clean or via exception). Also, it's unclear what implications that will have for child processes. As a user, I wouldn't expect invoke to mess with my active codepage.

tl;dr: Python and .NET output with different encodings when run from invoke. Invoke can't force an encoding on them and it can't know which one a called program will use. We're hosed.

@pfmoore
Contributor
pfmoore commented May 3, 2015

Can you not set the correct encoding using the encoding argument of the run() command?

Fundamentally, the problem is that there is no way of determining the encoding of any given stream of bytes without out-of-band information, which simply isn't available.

Regarding the proposed solutions:

  1. errors='replace' may be worth doing, if the data being written is purely for human consumption. Generally I don't like that setting for any other purpose as it loses data. But IIRC (I haven't checked the code) this also affects the captured stdout, and I'm -1 on losing data for that without the user's explicit confirmation.
  2. Trying to guess the encoding is a recipe for disaster (typically both encodings will "work", it's just that one will be mojibake). I don't think this would help, and the cost would be non-trivial.
  3. I'm strongly -1 on changing the code page, for the reasons you give.

My preference is to simply document the encoding that is used, and advise the user to use the encoding argument if the actual encoding is different.

@Grk0
Grk0 commented May 3, 2015

Yes, using the encoding argument is possible, it's just pretty annoying to use it constantly. If you care about getting the correct encoding it's the best pyinvoke can offer, though.

Re (1), it's only for the output to stdout/stderr, the stored data is untouched (source link, note that dst is hardcoded to sys.stdout or sys.stderr on line 251). If we don't manually encode here, we get an exception, the display() thread dies, and the command being run hangs once it has filled its output buffer. This is a pretty suboptimal situation.

@bitprophet
Member

I've only skimmed this so far, but wanted to note that I've been doing a massive reorg of the run() internals which should be merging soon; as part of that, I seem to remember adding a config-level option for encoding which would solve the bulk of the problem, right? (You'd be able to set it at the config file, env var, or context level, and then have it apply to all run calls.)

Re: encoding/decoding applying to the stored data as well as the printed, it feels cleaner to preserve the raw bytes as seen from the subprocess so that they can be introspected and de/en coded as necessary for e.g. debugging purposes. Whether we should extend this to store both copies (raw and decoded) for convenience, I am unsure of.

@lengtche

FWIW

The only command I encountered this issue with was run('tree -I "__pycache__|.env|_build" -h -C --dirsfirst'). Adding parameter encoding='utf-8' fixed it for me.

I'm using v0.10.1, which is currently the latest version on pypi.

@bitprophet
Member

In my environment, the work on #350 has fixed the issues with tree output, and at a glance I think it will have solved the originally-reported issue here too. Going to optimistically close, please reopen once #350 is merged & released (soon).

@bitprophet bitprophet closed this Jun 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment