I'm running into problems with the latest (post #201) encoding handling. A "tl;dr" is at the bottom.
Using invoke to run msbuild (the Visual Studio version of make), I get:
Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Program Files (x86)\Python34\lib\threading.py", line 920, in _bootstrap_inner
File "C:\Program Files (x86)\Python34\lib\threading.py", line 868, in run
File "C:\...\venv\lib\site-packages\invoke\runner.py", line 211, in display
File "C:\Program Files (x86)\Python34\lib\encodings\cp850.py", line 19, in encode
UnicodeEncodeError: 'charmap' codec can't encode character '\u201d' in position 101: character maps to <undefined>
The problem is the default value of encoding in invoke.runners.Local.run_direct(). Like @pfmoore, I also have a system where locale.getpreferredencoding() and sys.stdout.encoding differ (cp1252 vs cp850). In #201, the default encoding was chosen to be locale.getpreferredencoding(False), because Python uses it when its stdout is redirected and it can't discern the file descriptor locale via os.device_encoding(sys.stdout.fileno()).
Aktive Codepage: 850.
print('getpreferredencoding:', locale.getpreferredencoding(), file=sys.stderr)
print('sys.stdout.encoding:', sys.stdout.encoding, file=sys.stderr)
C:\>py -3.4 encodingdemo.py
C:\>py -3.4 encodingdemo.py > some_file
Note how the stdout encoding changes when redirecting. Since invoke does indeed redirect its childs I/O, it's correct to use locale.getpreferredencoding() to get the appropriate locale.
Unfortunately, msbuild, or more generally, .NET's Console class (documentation, source code) doesn't work this way. When it opens stdout, it uses the console output codepage, even when not writing directly to the console (see InitializeStdOutError in the source code link above):
int codePage = (int) Win32Native.GetConsoleOutputCP();
Encoding encoding = Encoding.GetEncoding(codePage);
In other words, it recovers the console encoding (sys.stdout.encoding in Python) even when redirected, whereas Python falls back to locale.getpreferredencoding().
Aktive Codepage: 850.
C:\>msbuild 2>&1 > msbuild.cp850 # German with CP850 chars
Aktive Codepage: 1250.
C:\>msbuild 2>&1 > msbuild.cp1250 # English, ASCII
Aktive Codepage: 65001.
C:\>msbuild 2>&1 > msbuild.cp65001 # German with UTF-8 chars
(Chcp changes the active codepage but doesn't alter locale.getpreferredencoding()).
This means there is no way to correctly handle both Python apps (which output with locale.getpreferredencoding() encoding when redirected) and .NET apps (which output with the console encoding when redirected). :-(
tl;dr: Python and .NET output with different encodings when run from invoke. Invoke can't force an encoding on them and it can't know which one a called program will use. We're hosed.
Can you not set the correct encoding using the encoding argument of the run() command?
Fundamentally, the problem is that there is no way of determining the encoding of any given stream of bytes without out-of-band information, which simply isn't available.
Regarding the proposed solutions:
My preference is to simply document the encoding that is used, and advise the user to use the encoding argument if the actual encoding is different.
Yes, using the encoding argument is possible, it's just pretty annoying to use it constantly. If you care about getting the correct encoding it's the best pyinvoke can offer, though.
Re (1), it's only for the output to stdout/stderr, the stored data is untouched (source link, note that dst is hardcoded to sys.stdout or sys.stderr on line 251). If we don't manually encode here, we get an exception, the display() thread dies, and the command being run hangs once it has filled its output buffer. This is a pretty suboptimal situation.
I've only skimmed this so far, but wanted to note that I've been doing a massive reorg of the run() internals which should be merging soon; as part of that, I seem to remember adding a config-level option for encoding which would solve the bulk of the problem, right? (You'd be able to set it at the config file, env var, or context level, and then have it apply to all run calls.)
Re: encoding/decoding applying to the stored data as well as the printed, it feels cleaner to preserve the raw bytes as seen from the subprocess so that they can be introspected and de/en coded as necessary for e.g. debugging purposes. Whether we should extend this to store both copies (raw and decoded) for convenience, I am unsure of.
The only command I encountered this issue with was run('tree -I "__pycache__|.env|_build" -h -C --dirsfirst'). Adding parameter encoding='utf-8' fixed it for me.
run('tree -I "__pycache__|.env|_build" -h -C --dirsfirst')
I'm using v0.10.1, which is currently the latest version on pypi.
In my environment, the work on #350 has fixed the issues with tree output, and at a glance I think it will have solved the originally-reported issue here too. Going to optimistically close, please reopen once #350 is merged & released (soon).