Prevent UnicodeErrors when forwarding child stdout to the console by ChrisAichinger · Pull Request #242 · pyinvoke/invoke

ChrisAichinger · 2015-05-03T11:09:49Z

In invoke.runners.Local, the stdout/stderr of child processes is read
and forwarded to the console (i.e. invoke's stdout/stderr). The incoming
data is read with locale.getdefaultencoding(), while output happens with
sys.stdout.encoding.

If these two encodings differ, it may not be possible to write the
received data to stdout, e.g. if an UTF-8 Yen character is received, but
sys.stdout.encoding is 'latin1' or 'ascii'. If that happens, the
resulting exception causes the thread to end, and the child process
hangs due to the child filling its output buffer.

This is mostly a concern on Windows, where the preferred encoding and
the console I/O encoding often differ. In my testing, this fixes #241. The
console output has replacement characters, but invoke doesn't crash.

In invoke.runners.Local, the stdout/stderr of child processes is read and forwarded to the console (i.e. invoke's stdout/stderr). The incoming data is read with locale.getdefaultencoding(), while output happens with sys.stdout.encoding. If these two encodings differ, it may not be possible to write the received data to stdout, e.g. if an UTF-8 Yen character is received, but sys.stdout.encoding is 'latin1' or 'ascii'. If that happens, the resulting exception causes the thread to end, and the child process hangs due to the child filling its output buffer. This is mostly a concern on Windows, where the preferred encoding and the console I/O encoding often differ. Cf. #241

pfmoore · 2015-05-03T15:41:02Z

This doesn't work because the test stdio streams don't support writing to the underlying bytes buffer. And it's wrong anyway, as by writing raw bytes to (the buffer under) a text stream you can end up with a stream with mixed encodings.

What you're trying to do is replace unencodable bytes. So you should do exactly that:

    clean_data = data.encode(dst.encoding, errors='replace').decode(dst.encoding)
    dst.write(clean_data)

That should work for any text stream with an encoding.

Avoid accessing sys.stdout.buffer to write raw bytes. Manually decode the bytes and write them as string instead. This approach avoids problems during testing, when sys.stdout can be mocked.

ChrisAichinger · 2015-05-03T19:43:30Z

Thanks for the suggestions, @pfmoore. I updated the PR.

The tests are failing because spec.trap.CarbonCopy has no encoding property. @bitprophet: Should I work around that here or should I send a PR against spec that adds an encoding? The latter could allow us to actually test the issue I'm running into (think tests.runners.Run.funky_characters_in_stdout, only without the hide argument).

bitprophet · 2015-05-05T01:31:36Z

Hey, as per my note in the other ticket, you should wait a couple days (hopefully no more than that...) for me to finish and merge the big run() reorganization, it will at least partly obviate your PR and you'll want to update it again for that.

I am +1 on allowing Spec to learn about encoding, it's been privileged to remain ignorant of such things prior to now, but just because I personally haven't needed it.

pfmoore · 2015-05-05T11:27:40Z

The following would probably be a good utility function for "write some text without the possibility of errors (in human readable form)"

def safe_write(data, file=sys.stdout, errors='backslashreplace'):
    if hasattr(file, 'encoding') and getattr(file, 'errors') == 'strict':
        # If the file object is encoding text in strict mode, make sure
        # no unencodable characters are in the data being written, by
        # encoding with a more permissive error handling method, and then
        # decoding back to text
        # Python 2.6 compatibility - don't use a keyword argument for the
        # 'errors' parameter to encode.
        data = data.encode(file.encoding, errors).decode(file.encoding)

    # Data should now only contain characters that can be safely written
    file.write(data)

There should probably be some specific unit tests for this function, as the way the test suite captures output means that the existing tests won't cover how a "real" stdout (strict mode encoding) would behave.

Assuming the incoming changes don't remove the need for this, we could use this when writing subprocess output to sys.stdout - I would not recommend using it to write to any other file descriptor, as such a fd might need the written data to be unchanged (and hence an error for unencodable characters would be the right thing to do).

The assumption here is that subprocess output is only written to sys.stdout for a human to read (nobody pipes the output of invoke into something else as part of a shell pipeline, for example).

bitprophet · 2015-05-05T19:34:25Z

@pfmoore Unfortunately the genesis of the changes to merge, was to add output stream selection support! I.e. let folks drop in some arbitrary file-like object to be written to, defaulting to sys.stdout/stderr. (Also, I can guarantee some users will be wanting to pipe invoke output elsewhere, either via that feature or via their shell. Not the default use case, but a not-uncommon one.)

So we may need to make the encoding process more selectable on top of that, perhaps? (I.e. have one or more methods sort of like this, which can be swapped out depending on stream or user needs/desires).

Let's revisit once I've merged that stuff and we'll see how it shakes out.

pfmoore · 2015-05-05T19:39:12Z

@bitprophet Ah, the irony :-) Agreed, let's leave it for now. When we get to it, there's probably a good argument for treating sys.stdout as a special case (maybe with an isatty test added as well). I'm not actually convinced that Python's choice to use the strict error handler for console output is such a good idea in the first place...

bitprophet · 2015-05-06T20:20:39Z

That big ol' ticket has been merged to Invoke master now, btw. I'm context switching to Fabric 2 dev to make use of the new class structure & functionality, hopefully I won't find more changes that need doing - but if I do they'll be much smaller in scale :)

bitprophet · 2015-07-14T23:42:56Z

#260 has a reproducible case for this too. On my system, at least, the issue is sort of the inverse of the OP's scenario: we're taking what I assume is a Latin-1-derived string, and trying to print it to a stdout that thinks of itself as UTF-8.

What seems to happen here is that the (Python 2 - Python 3 ends up replacing the characters with junk instead of exploding) codecs.iterdecode() call is turning a str into a unicode with identical contents/bytes (which is not valid Unicode - e.g. under Python 2, unicode("└──") has the same explosion).

That's why the print <funky string> works, but the mirroring of that echoed output via the 2nd 'layer' of Invoke, breaks.

Also, this has come up before, but I really hate how the Python core implementation of unicode() or whatnot, has that implicit fallback to ascii somewhere inside it. It makes for frustrating bug reports as folks assume the ascii is us and not Python. 😭

fgimian · 2015-07-15T10:24:16Z

Hey guys, I'll test out this pull request on my system in a moment since I experienced this bug. Yeah, unicode behaves strangely in Python 2.x, I guess that's what Python 3 was attempting to overcome.

A fallback to latin-1 would have been more logical, but whatcha gonna do 😄

I'll report back soon with my findings
Fotis

fgimian · 2015-07-15T10:29:13Z

I did a dodgy manual merge, so it's possible I made a booboo, but my system doesn't like the errors kwarg in the decode function:

Traceback (most recent call last):

  File "/root/lib/invoke/invoke/runners.py", line 82, in run
    super(_IOThread, self).run()

  File "/usr/lib64/python2.6/threading.py", line 484, in run
    self.__target(*self.__args, **self.__kwargs)

  File "/root/lib/invoke/invoke/runners.py", line 330, in io
    encoded_data = data.encode(output.encoding, errors='replace')

TypeError: encode() takes no keyword arguments

bitprophet · 2015-07-15T19:21:14Z

@fgimian I'd check to see if that is a Python 2 vs Python 3 ism, smells like one offhand.

fgimian · 2015-07-22T23:26:28Z

Apologies for not looking at this again sooner mate, I'm going to try spend some time on it this week as I'd like to incorporate invoke into all my current project's automation and this is causing me some drama 😄

fgimian · 2015-07-23T11:42:37Z

OK, so this problem is very easy to repro. The problem is that sys.stdout and sys.stderr which are set as the default output streams when the user doesn't specify their own, don't seem to be happy to output unicode with Python 2.6, but work fine with Python 2.7.

e.g.

OS X 10.10

Python 2.6

Python 2.6.9 (unknown, Sep  9 2014, 15:05:12) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.write(u'└──')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

Python 2.7

Python 2.7.10 (default, Jul  9 2015, 13:34:07) 
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.write(u'└──\n')
└──

RHEL 6.6

Python 2.6.6 (r266:84292, Nov 21 2013, 10:50:32)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.write(u'└──\n')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

Ubuntu 14.04

Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.write(u'└──\n')
└──

More to follow 😄
Fotis

fgimian · 2015-07-23T12:08:20Z

Ahh, this is a bug in Python 2.6 specifically. I have an idea for a fix but it's a little ugly. Let me think this through further for a moment 😄

fgimian · 2015-07-23T13:28:19Z

Please find my pull request at #262 😄

bitprophet · 2015-07-30T17:27:11Z

@fgimian Nice catch re: the Python 2.6 bug, that's frustrating. Pity 2.6 is still so widespread or I'd start considering dropping it for final release =/

I'm going to reply to #262 here just so we're keeping the conversation in one thread, also because I'd like to compare that approach to the one here in #242.

Specifically:

The two are similar in that both rely on explicitly encoding the data to avoid encoding-related explosions due to implicit (or just wrong) encoding settings on the output stream.
Prevent UnicodeErrors when forwarding child stdout to the console #242's original solution looks similar to Worked around a Python 2.6 bug whereby file streams don't respect encoding #262's, i.e. just encode, then write, no decode step. Presumably @pfmoore's objection to that original commit, applies to Worked around a Python 2.6 bug whereby file streams don't respect encoding #262?
If one presumes the second, decode, step is a Good Thing To Do Here, it means the two PRs end up basically identical, except Worked around a Python 2.6 bug whereby file streams don't respect encoding #262 limits the fix to Python 2.6 only.
- It's unclear to me whether More Windows encoding woes #241/Prevent UnicodeErrors when forwarding child stdout to the console #242 (cc @Grk0) were exposed under 2.6 or another interpreter version.
- If that core Python 2.6 bug is the main cause of all these symptoms (?) then it seems like we'd want to merge these two PRs such that we a) encode AND decode, and b) only do so under Python 2.6, hoping 2.7+ behaves more correctly for these situations.
Side note, I seem to have broken things for appveyor/Windows in recent commit shenanigans, making it hard to tell how these changes affect that platform. I need to un-break that if possible.
Other side note, I also still need to fix up spec so its mock stream does/can behave like a real one, re: the encoding/errors attributes.

pfmoore · 2015-07-30T17:41:45Z

@bitprophet I object in principle to encoding to bytes and then writing, as conceptually the stream is a text stream, and doing so is incorrect. But that's a Python 3 POV, which I apply even to Python 2 because it ensures that the logical separation between bytes and strings is retained.

In this instance, practicality may beat purtity - as long as the abuse is isolated to Python 2.X. But I personally have little interest in 2.X (except to advocate that code avoids bad assumptions that may transalate badly to 3.x) and specifically I don't care at all about 2.6.

The .encode(errors='replace').decode() dance used in the current patch here is the only way I know of spelling "replace unencodable characters with the standard replacement character" and as such is perfectly OK (but deserves a comment, which the patch includes). Of course, someone may say "why did you just replace the bad char with ?, why not use a hex escape?" but that's a bigger question...

I've only superficially looked at the small question you asked, though. I haven't looked at why the tests are failing or anything like that, I'm afraid - sorry.

bitprophet · 2015-07-30T18:21:12Z

Thanks @pfmoore - and I am generally +1 on the "round trip to scrub bad characters" idea, I'd rather have garbled output than explosions. Users will need educating about these hilarious edge cases regardless, so.

(also, if/when I merge some form of these fixes, I would definitely put some comments in :))

pfmoore · 2015-07-30T18:55:51Z

The bit about a comment wasn't a dig :-) What's in the patch looks about right. I just think it's a bit annoying that there's no Python builtin function for doing the errors= handling a codec does when it encodes, and so I'd typically add a comment to the effect that there's no better way do do this.

Actually, maybe a small helper would be useful, for documentation purposes:

def sanitise(s, encoding, errors='replace'):
    """Make sure string s doesn't contain any characters that can't be handled by encoding.
    Use the specified codec error handler to determine how to handle unencodable characters."""

    return s.encode(encoding, errors).decode(encoding)

fgimian · 2015-07-30T22:08:07Z

Hey there guys, firstly thanks for all your help 😄

I'd like to highlight a few points if I may:

This pull request does not appear to address the exact same issue as my pull request, in fact, it still results in an exception with the test script I provided, but something a little worse in Python 2.6 (it seems that the encode function doesn't allow for the errors kwarg in Python 2.6)

> invoke breakit
Traceback (most recent call last):
  File "/Users/fots/invoke-unicode/bin/invoke", line 9, in <module>
    load_entry_point('invoke==0.10.1', 'console_scripts', 'invoke')()
  File "/Users/fots/invoke-unicode/lib/python2.6/site-packages/invoke/cli.py", line 438, in main
    dispatch(sys.argv)
  File "/Users/fots/invoke-unicode/lib/python2.6/site-packages/invoke/cli.py", line 431, in dispatch
    return executor.execute(*tasks)
  File "/Users/fots/invoke-unicode/lib/python2.6/site-packages/invoke/executor.py", line 100, in execute
    task=task, name=name, args=args, kwargs=kwargs, config=config
  File "/Users/fots/invoke-unicode/lib/python2.6/site-packages/invoke/executor.py", line 153, in _execute
    result = task(*args, **kwargs)
  File "/Users/fots/invoke-unicode/lib/python2.6/site-packages/invoke/tasks.py", line 111, in __call__
    result = self.body(*args, **kwargs)
  File "/Users/fots/testing-invoke/tasks.py", line 10, in breakit
    run('invoke output_strange_stuff', encoding='utf-8')
  File "/Users/fots/invoke-unicode/lib/python2.6/site-packages/invoke/__init__.py", line 23, in run
    return Context().run(command, **kwargs)
  File "/Users/fots/invoke-unicode/lib/python2.6/site-packages/invoke/context.py", line 53, in run
    return runner_class(context=self).run(command, **kwargs)
  File "/Users/fots/invoke-unicode/lib/python2.6/site-packages/invoke/runners.py", line 262, in run
    raise ThreadException(exceptions)
invoke.exceptions.ThreadException: 
Saw 1 exceptions within threads (TypeError):


Thread args: {'args': [<functools.partial object at 0x1018011b0>,
          <open file '<stdout>', mode 'w' at 0x101132150>,
          [],
          False],
 'target': <bound method Local.io of <invoke.runners.Local object at 0x1017ff190>>}

Traceback (most recent call last):

  File "/Users/fots/invoke-unicode/lib/python2.6/site-packages/invoke/runners.py", line 82, in run
    super(_IOThread, self).run()

  File "/usr/local/var/pyenv/versions/2.6.6/lib/python2.6/threading.py", line 484, in run
    self.__target(*self.__args, **self.__kwargs)

  File "/Users/fots/invoke-unicode/lib/python2.6/site-packages/invoke/runners.py", line 330, in io
    encoded_data = data.encode(output.encoding, errors='replace')

TypeError: encode() takes no keyword arguments

On the other hand, the fix I provided does work as expected:

> invoke breakit
└──

The whole problem here is that the Python 2.6 file object doesn't respect the encoding used in a unicode object that's passed in. Ultimately, my pull request is just performing the encoding for it so that it receives bytes which it knows how to handle. I personally don't see this as a huge deal at all, after all, everything must be encoded back to bytes to be displayed on a terminal. The only difference is that we're doing it instead of the C code under the hood (please see the patch that fixed the problem in the Python source code).
I spent several hours attempting to add a unit test for the test script I provided in my pull request but honestly had no luck getting it all working. I believe that it's a good idea that we have that unit test in place and confirmed failing in Python 2.6 before a decision is made.
My pull request is very heavily commented and is very targetted to only hit Python 2.6 and only affect file streams because this issue simply does not affect Python 2.7+ which correctly respects the encoding of the unicode string provided to a file stream.
Please don't drop Python 2.6.x support guys. It's still the standard Python installed on RHEL 6 systems which are widely used (including in our organisation). If this library were to drop Python 2.6, I wouldn't be able to use it for my current project for example. With this exception, Python 2.6 is generally easy to support along with Python 2.7 as long as you avoid the 2 - 3 nice additions in 2.7 (dict comprehensions, {} substitution in format function and {...} notation for sets).

Keep in mind that most people looking to use your library likely started with a Makefile which works everywhere 😄 Ideally this wonderful library would also work in as many systems as possible to make it easily to switch from GNU make.

All the best and thanks a lot!
Fotis

bitprophet · 2015-08-20T18:46:30Z

Don't have time to reply to all of that right now (but thanks!!) but I can note that the 2.6 thing was, for me, a "I wish I could drop support" - meaning, I can't, for exactly the reasons you state ;) I am a sysadmin by trade so I place more emphasis on "old version support" when balancing tradeoffs, than many devs do.

fgimian · 2015-08-23T02:13:17Z

Thanks @bitprophet, really appreciate your time and help with this 😄

frol · 2015-09-11T19:43:30Z

I bumped into this issue today, but I haven't noticed this PR, so I created my PR #274, which seems to be cleaner, passes tests for Python 2.x (it seems that 3.x tests are broken not because of my PR), and it works for me like expected.

bitprophet · 2015-12-22T01:57:41Z

Note to self, I believe this problem & its potential solution(s) still apply despite the IO stuff getting a moderate rework recently (for #289). Bumping to next milestone for now but it's still something I want to work out pre 1.0.

frol · 2015-12-22T02:05:04Z

@bitprophet Just a note, I have been heavily using my patched (PR #274, PR #275) PyInvoke on Linux, Windows, and OS X since the time I have fixed it. It works perfectly well.

frol · 2015-12-25T11:09:04Z

@bitprophet I want to point out that this fix doesn't solve the Unicode errors at all. This PR should be closed in favor of #274.

svetlyak40wt · 2016-01-28T09:17:34Z

Hey guys, I found this problem too. Currently I have either to patch this line to make .encode('utf-8') before outputting to stdout, or use this workaround in tasks.py:

from invoke import run as orig_run, task
from functools import partial


run = partial(
    orig_run,
    encoding='utf-8',
    out_stream=codecs.getwriter('utf8')(sys.stdout))

This way, all output from child process is decoded from utf-8 (thanks to encoding argument) and encoded back to the 'utf-8' when printing to stdout (thanks to the codecs writer).

frol · 2016-01-28T09:26:04Z

@svetlyak40wt Please, check out PR #274. It has a bullet-proof fix for this issue. When I started, I also thought, that the fix is simple, but unicode turned out to be a tricky thing especially when Python 2/3 support is required.

bitprophet · 2016-06-03T22:15:11Z

This should be obviated by #350 at this point.

Don't rely on sys.stdout.buffer being available

741f3c9

Avoid accessing sys.stdout.buffer to write raw bytes. Manually decode the bytes and write them as string instead. This approach avoids problems during testing, when sys.stdout can be mocked.

bitprophet mentioned this pull request Jul 14, 2015

Strange drama with the run command when running certain commands #260

Closed

bitprophet added Core CLI behavior Bug labels Jul 30, 2015

bitprophet added this to the 0.11 milestone Jul 30, 2015

bitprophet mentioned this pull request Sep 18, 2015

Fixed a number of encoding issues (in run context) #274

Closed

bitprophet modified the milestones: 0.13, 0.12 Dec 22, 2015

bitprophet mentioned this pull request Apr 10, 2016

Hanging when ssh'ing into a VM w/o pty=True #338

Closed

bitprophet mentioned this pull request May 26, 2016

Catchall 'encoding needs to get un-crap-ified' ticket #350

Closed

11 tasks

bitprophet closed this Jun 3, 2016

Conversation

ChrisAichinger commented May 3, 2015

Uh oh!

pfmoore commented May 3, 2015

Uh oh!

ChrisAichinger commented May 3, 2015

Uh oh!

bitprophet commented May 5, 2015

Uh oh!

pfmoore commented May 5, 2015

Uh oh!

bitprophet commented May 5, 2015

Uh oh!

pfmoore commented May 5, 2015

Uh oh!

bitprophet commented May 6, 2015

Uh oh!

bitprophet commented Jul 14, 2015

Uh oh!

fgimian commented Jul 15, 2015

Uh oh!

fgimian commented Jul 15, 2015

Uh oh!

bitprophet commented Jul 15, 2015

Uh oh!

fgimian commented Jul 22, 2015

Uh oh!

fgimian commented Jul 23, 2015

OS X 10.10

RHEL 6.6

Ubuntu 14.04

Uh oh!

fgimian commented Jul 23, 2015

Uh oh!

fgimian commented Jul 23, 2015

Uh oh!

bitprophet commented Jul 30, 2015

Uh oh!

pfmoore commented Jul 30, 2015

Uh oh!

bitprophet commented Jul 30, 2015

Uh oh!

pfmoore commented Jul 30, 2015

Uh oh!

fgimian commented Jul 30, 2015

Uh oh!

bitprophet commented Aug 20, 2015

Uh oh!

fgimian commented Aug 23, 2015

Uh oh!

frol commented Sep 11, 2015

Uh oh!

bitprophet commented Dec 22, 2015

Uh oh!

frol commented Dec 22, 2015

Uh oh!

frol commented Dec 25, 2015

Uh oh!

svetlyak40wt commented Jan 28, 2016

Uh oh!

frol commented Jan 28, 2016

Uh oh!

bitprophet commented Jun 3, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants