Prevent UnicodeErrors when forwarding child stdout to the console #242

Closed
wants to merge 2 commits into
from

Projects

None yet

6 participants

@Grk0
Grk0 commented May 3, 2015

In invoke.runners.Local, the stdout/stderr of child processes is read
and forwarded to the console (i.e. invoke's stdout/stderr). The incoming
data is read with locale.getdefaultencoding(), while output happens with
sys.stdout.encoding.

If these two encodings differ, it may not be possible to write the
received data to stdout, e.g. if an UTF-8 Yen character is received, but
sys.stdout.encoding is 'latin1' or 'ascii'. If that happens, the
resulting exception causes the thread to end, and the child process
hangs due to the child filling its output buffer.

This is mostly a concern on Windows, where the preferred encoding and
the console I/O encoding often differ. In my testing, this fixes #241. The
console output has replacement characters, but invoke doesn't crash.

@Grk0 Grk0 Prevent UnicodeErrors when forwarding child stdout
In invoke.runners.Local, the stdout/stderr of child processes is read
and forwarded to the console (i.e. invoke's stdout/stderr). The incoming
data is read with locale.getdefaultencoding(), while output happens with
sys.stdout.encoding.

If these two encodings differ, it may not be possible to write the
received data to stdout, e.g. if an UTF-8 Yen character is received, but
sys.stdout.encoding is 'latin1' or 'ascii'. If that happens, the
resulting exception causes the thread to end, and the child process
hangs due to the child filling its output buffer.

This is mostly a concern on Windows, where the preferred encoding and
the console I/O encoding often differ.
Cf. #241
3c6e215
@pfmoore
Contributor
pfmoore commented May 3, 2015

This doesn't work because the test stdio streams don't support writing to the underlying bytes buffer. And it's wrong anyway, as by writing raw bytes to (the buffer under) a text stream you can end up with a stream with mixed encodings.

What you're trying to do is replace unencodable bytes. So you should do exactly that:

    clean_data = data.encode(dst.encoding, errors='replace').decode(dst.encoding)
    dst.write(clean_data)

That should work for any text stream with an encoding.

@Grk0 Grk0 Don't rely on sys.stdout.buffer being available
Avoid accessing sys.stdout.buffer to write raw bytes. Manually decode
the bytes and write them as string instead.

This approach avoids problems during testing, when sys.stdout can be
mocked.
741f3c9
@Grk0
Grk0 commented May 3, 2015

Thanks for the suggestions, @pfmoore. I updated the PR.

The tests are failing because spec.trap.CarbonCopy has no encoding property. @bitprophet: Should I work around that here or should I send a PR against spec that adds an encoding? The latter could allow us to actually test the issue I'm running into (think tests.runners.Run.funky_characters_in_stdout, only without the hide argument).

@bitprophet
Member

Hey, as per my note in the other ticket, you should wait a couple days (hopefully no more than that...) for me to finish and merge the big run() reorganization, it will at least partly obviate your PR and you'll want to update it again for that.

I am +1 on allowing Spec to learn about encoding, it's been privileged to remain ignorant of such things prior to now, but just because I personally haven't needed it.

@pfmoore
Contributor
pfmoore commented May 5, 2015

The following would probably be a good utility function for "write some text without the possibility of errors (in human readable form)"

def safe_write(data, file=sys.stdout, errors='backslashreplace'):
    if hasattr(file, 'encoding') and getattr(file, 'errors') == 'strict':
        # If the file object is encoding text in strict mode, make sure
        # no unencodable characters are in the data being written, by
        # encoding with a more permissive error handling method, and then
        # decoding back to text
        # Python 2.6 compatibility - don't use a keyword argument for the
        # 'errors' parameter to encode.
        data = data.encode(file.encoding, errors).decode(file.encoding)

    # Data should now only contain characters that can be safely written
    file.write(data)

There should probably be some specific unit tests for this function, as the way the test suite captures output means that the existing tests won't cover how a "real" stdout (strict mode encoding) would behave.

Assuming the incoming changes don't remove the need for this, we could use this when writing subprocess output to sys.stdout - I would not recommend using it to write to any other file descriptor, as such a fd might need the written data to be unchanged (and hence an error for unencodable characters would be the right thing to do).

The assumption here is that subprocess output is only written to sys.stdout for a human to read (nobody pipes the output of invoke into something else as part of a shell pipeline, for example).

@bitprophet
Member

@pfmoore Unfortunately the genesis of the changes to merge, was to add output stream selection support! I.e. let folks drop in some arbitrary file-like object to be written to, defaulting to sys.stdout/stderr. (Also, I can guarantee some users will be wanting to pipe invoke output elsewhere, either via that feature or via their shell. Not the default use case, but a not-uncommon one.)

So we may need to make the encoding process more selectable on top of that, perhaps? (I.e. have one or more methods sort of like this, which can be swapped out depending on stream or user needs/desires).

Let's revisit once I've merged that stuff and we'll see how it shakes out.

@pfmoore
Contributor
pfmoore commented May 5, 2015

@bitprophet Ah, the irony :-) Agreed, let's leave it for now. When we get to it, there's probably a good argument for treating sys.stdout as a special case (maybe with an isatty test added as well). I'm not actually convinced that Python's choice to use the strict error handler for console output is such a good idea in the first place...

@bitprophet
Member

That big ol' ticket has been merged to Invoke master now, btw. I'm context switching to Fabric 2 dev to make use of the new class structure & functionality, hopefully I won't find more changes that need doing - but if I do they'll be much smaller in scale :)

@bitprophet
Member

#260 has a reproducible case for this too. On my system, at least, the issue is sort of the inverse of the OP's scenario: we're taking what I assume is a Latin-1-derived string, and trying to print it to a stdout that thinks of itself as UTF-8.

What seems to happen here is that the (Python 2 - Python 3 ends up replacing the characters with junk instead of exploding) codecs.iterdecode() call is turning a str into a unicode with identical contents/bytes (which is not valid Unicode - e.g. under Python 2, unicode("└──") has the same explosion).

That's why the print <funky string> works, but the mirroring of that echoed output via the 2nd 'layer' of Invoke, breaks.


Also, this has come up before, but I really hate how the Python core implementation of unicode() or whatnot, has that implicit fallback to ascii somewhere inside it. It makes for frustrating bug reports as folks assume the ascii is us and not Python. 😭

@fgimian
fgimian commented Jul 15, 2015

Hey guys, I'll test out this pull request on my system in a moment since I experienced this bug. Yeah, unicode behaves strangely in Python 2.x, I guess that's what Python 3 was attempting to overcome.

A fallback to latin-1 would have been more logical, but whatcha gonna do 😄

I'll report back soon with my findings
Fotis

@fgimian
fgimian commented Jul 15, 2015

I did a dodgy manual merge, so it's possible I made a booboo, but my system doesn't like the errors kwarg in the decode function:

Traceback (most recent call last):

  File "/root/lib/invoke/invoke/runners.py", line 82, in run
    super(_IOThread, self).run()

  File "/usr/lib64/python2.6/threading.py", line 484, in run
    self.__target(*self.__args, **self.__kwargs)

  File "/root/lib/invoke/invoke/runners.py", line 330, in io
    encoded_data = data.encode(output.encoding, errors='replace')

TypeError: encode() takes no keyword arguments
@bitprophet
Member

@fgimian I'd check to see if that is a Python 2 vs Python 3 ism, smells like one offhand.

@fgimian
fgimian commented Jul 22, 2015

Apologies for not looking at this again sooner mate, I'm going to try spend some time on it this week as I'd like to incorporate invoke into all my current project's automation and this is causing me some drama 😄

@fgimian
fgimian commented Jul 23, 2015

OK, so this problem is very easy to repro. The problem is that sys.stdout and sys.stderr which are set as the default output streams when the user doesn't specify their own, don't seem to be happy to output unicode with Python 2.6, but work fine with Python 2.7.

e.g.

OS X 10.10

Python 2.6

Python 2.6.9 (unknown, Sep  9 2014, 15:05:12) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.write(u'└──')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

Python 2.7

Python 2.7.10 (default, Jul  9 2015, 13:34:07) 
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.write(u'└──\n')
└──

RHEL 6.6

Python 2.6.6 (r266:84292, Nov 21 2013, 10:50:32)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.write(u'└──\n')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

Ubuntu 14.04

Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.write(u'└──\n')
└──

More to follow 😄
Fotis

@fgimian
fgimian commented Jul 23, 2015

Ahh, this is a bug in Python 2.6 specifically. I have an idea for a fix but it's a little ugly. Let me think this through further for a moment 😄

@fgimian
fgimian commented Jul 23, 2015

Please find my pull request at #262 😄

@bitprophet
Member

@fgimian Nice catch re: the Python 2.6 bug, that's frustrating. Pity 2.6 is still so widespread or I'd start considering dropping it for final release =/

I'm going to reply to #262 here just so we're keeping the conversation in one thread, also because I'd like to compare that approach to the one here in #242.

Specifically:

  • The two are similar in that both rely on explicitly encoding the data to avoid encoding-related explosions due to implicit (or just wrong) encoding settings on the output stream.
  • #242's original solution looks similar to #262's, i.e. just encode, then write, no decode step. Presumably @pfmoore's objection to that original commit, applies to #262?
  • If one presumes the second, decode, step is a Good Thing To Do Here, it means the two PRs end up basically identical, except #262 limits the fix to Python 2.6 only.
    • It's unclear to me whether #241/#242 (cc @Grk0) were exposed under 2.6 or another interpreter version.
    • If that core Python 2.6 bug is the main cause of all these symptoms (?) then it seems like we'd want to merge these two PRs such that we a) encode AND decode, and b) only do so under Python 2.6, hoping 2.7+ behaves more correctly for these situations.
  • Side note, I seem to have broken things for appveyor/Windows in recent commit shenanigans, making it hard to tell how these changes affect that platform. I need to un-break that if possible.
  • Other side note, I also still need to fix up spec so its mock stream does/can behave like a real one, re: the encoding/errors attributes.
@bitprophet bitprophet added this to the 0.11 milestone Jul 30, 2015
@pfmoore
Contributor
pfmoore commented Jul 30, 2015

@bitprophet I object in principle to encoding to bytes and then writing, as conceptually the stream is a text stream, and doing so is incorrect. But that's a Python 3 POV, which I apply even to Python 2 because it ensures that the logical separation between bytes and strings is retained.

In this instance, practicality may beat purtity - as long as the abuse is isolated to Python 2.X. But I personally have little interest in 2.X (except to advocate that code avoids bad assumptions that may transalate badly to 3.x) and specifically I don't care at all about 2.6.

The .encode(errors='replace').decode() dance used in the current patch here is the only way I know of spelling "replace unencodable characters with the standard replacement character" and as such is perfectly OK (but deserves a comment, which the patch includes). Of course, someone may say "why did you just replace the bad char with ?, why not use a hex escape?" but that's a bigger question...

I've only superficially looked at the small question you asked, though. I haven't looked at why the tests are failing or anything like that, I'm afraid - sorry.

@bitprophet
Member

Thanks @pfmoore - and I am generally +1 on the "round trip to scrub bad characters" idea, I'd rather have garbled output than explosions. Users will need educating about these hilarious edge cases regardless, so.

(also, if/when I merge some form of these fixes, I would definitely put some comments in :))

@pfmoore
Contributor
pfmoore commented Jul 30, 2015

The bit about a comment wasn't a dig :-) What's in the patch looks about right. I just think it's a bit annoying that there's no Python builtin function for doing the errors= handling a codec does when it encodes, and so I'd typically add a comment to the effect that there's no better way do do this.

Actually, maybe a small helper would be useful, for documentation purposes:

def sanitise(s, encoding, errors='replace'):
    """Make sure string s doesn't contain any characters that can't be handled by encoding.
    Use the specified codec error handler to determine how to handle unencodable characters."""

    return s.encode(encoding, errors).decode(encoding)
@fgimian
fgimian commented Jul 30, 2015

Hey there guys, firstly thanks for all your help 😄

I'd like to highlight a few points if I may:

  • This pull request does not appear to address the exact same issue as my pull request, in fact, it still results in an exception with the test script I provided, but something a little worse in Python 2.6 (it seems that the encode function doesn't allow for the errors kwarg in Python 2.6)

    > invoke breakit
    Traceback (most recent call last):
      File "/Users/fots/invoke-unicode/bin/invoke", line 9, in <module>
        load_entry_point('invoke==0.10.1', 'console_scripts', 'invoke')()
      File "/Users/fots/invoke-unicode/lib/python2.6/site-packages/invoke/cli.py", line 438, in main
        dispatch(sys.argv)
      File "/Users/fots/invoke-unicode/lib/python2.6/site-packages/invoke/cli.py", line 431, in dispatch
        return executor.execute(*tasks)
      File "/Users/fots/invoke-unicode/lib/python2.6/site-packages/invoke/executor.py", line 100, in execute
        task=task, name=name, args=args, kwargs=kwargs, config=config
      File "/Users/fots/invoke-unicode/lib/python2.6/site-packages/invoke/executor.py", line 153, in _execute
        result = task(*args, **kwargs)
      File "/Users/fots/invoke-unicode/lib/python2.6/site-packages/invoke/tasks.py", line 111, in __call__
        result = self.body(*args, **kwargs)
      File "/Users/fots/testing-invoke/tasks.py", line 10, in breakit
        run('invoke output_strange_stuff', encoding='utf-8')
      File "/Users/fots/invoke-unicode/lib/python2.6/site-packages/invoke/__init__.py", line 23, in run
        return Context().run(command, **kwargs)
      File "/Users/fots/invoke-unicode/lib/python2.6/site-packages/invoke/context.py", line 53, in run
        return runner_class(context=self).run(command, **kwargs)
      File "/Users/fots/invoke-unicode/lib/python2.6/site-packages/invoke/runners.py", line 262, in run
        raise ThreadException(exceptions)
    invoke.exceptions.ThreadException: 
    Saw 1 exceptions within threads (TypeError):
    
    
    Thread args: {'args': [<functools.partial object at 0x1018011b0>,
              <open file '<stdout>', mode 'w' at 0x101132150>,
              [],
              False],
     'target': <bound method Local.io of <invoke.runners.Local object at 0x1017ff190>>}
    
    Traceback (most recent call last):
    
      File "/Users/fots/invoke-unicode/lib/python2.6/site-packages/invoke/runners.py", line 82, in run
        super(_IOThread, self).run()
    
      File "/usr/local/var/pyenv/versions/2.6.6/lib/python2.6/threading.py", line 484, in run
        self.__target(*self.__args, **self.__kwargs)
    
      File "/Users/fots/invoke-unicode/lib/python2.6/site-packages/invoke/runners.py", line 330, in io
        encoded_data = data.encode(output.encoding, errors='replace')
    
    TypeError: encode() takes no keyword arguments

    On the other hand, the fix I provided does work as expected:

    > invoke breakit
    └──
  • The whole problem here is that the Python 2.6 file object doesn't respect the encoding used in a unicode object that's passed in. Ultimately, my pull request is just performing the encoding for it so that it receives bytes which it knows how to handle. I personally don't see this as a huge deal at all, after all, everything must be encoded back to bytes to be displayed on a terminal. The only difference is that we're doing it instead of the C code under the hood (please see the patch that fixed the problem in the Python source code).

  • I spent several hours attempting to add a unit test for the test script I provided in my pull request but honestly had no luck getting it all working. I believe that it's a good idea that we have that unit test in place and confirmed failing in Python 2.6 before a decision is made.

  • My pull request is very heavily commented and is very targetted to only hit Python 2.6 and only affect file streams because this issue simply does not affect Python 2.7+ which correctly respects the encoding of the unicode string provided to a file stream.

  • Please don't drop Python 2.6.x support guys. It's still the standard Python installed on RHEL 6 systems which are widely used (including in our organisation). If this library were to drop Python 2.6, I wouldn't be able to use it for my current project for example. With this exception, Python 2.6 is generally easy to support along with Python 2.7 as long as you avoid the 2 - 3 nice additions in 2.7 (dict comprehensions, {} substitution in format function and {...} notation for sets).

Keep in mind that most people looking to use your library likely started with a Makefile which works everywhere 😄 Ideally this wonderful library would also work in as many systems as possible to make it easily to switch from GNU make.

All the best and thanks a lot!
Fotis

@bitprophet
Member

Don't have time to reply to all of that right now (but thanks!!) but I can note that the 2.6 thing was, for me, a "I wish I could drop support" - meaning, I can't, for exactly the reasons you state ;) I am a sysadmin by trade so I place more emphasis on "old version support" when balancing tradeoffs, than many devs do.

@fgimian
fgimian commented Aug 23, 2015

Thanks @bitprophet, really appreciate your time and help with this 😄

@frol
frol commented Sep 11, 2015

I bumped into this issue today, but I haven't noticed this PR, so I created my PR #274, which seems to be cleaner, passes tests for Python 2.x (it seems that 3.x tests are broken not because of my PR), and it works for me like expected.

@bitprophet
Member

Note to self, I believe this problem & its potential solution(s) still apply despite the IO stuff getting a moderate rework recently (for #289). Bumping to next milestone for now but it's still something I want to work out pre 1.0.

@bitprophet bitprophet modified the milestone: 0.13, 0.12 Dec 22, 2015
@frol
frol commented Dec 22, 2015

@bitprophet Just a note, I have been heavily using my patched (PR #274, PR #275) PyInvoke on Linux, Windows, and OS X since the time I have fixed it. It works perfectly well.

@frol
frol commented Dec 25, 2015

@bitprophet I want to point out that this fix doesn't solve the Unicode errors at all. This PR should be closed in favor of #274.

@svetlyak40wt

Hey guys, I found this problem too. Currently I have either to patch this line to make .encode('utf-8') before outputting to stdout, or use this workaround in tasks.py:

from invoke import run as orig_run, task
from functools import partial


run = partial(
    orig_run,
    encoding='utf-8',
    out_stream=codecs.getwriter('utf8')(sys.stdout))

This way, all output from child process is decoded from utf-8 (thanks to encoding argument) and encoded back to the 'utf-8' when printing to stdout (thanks to the codecs writer).

@frol
frol commented Jan 28, 2016

@svetlyak40wt Please, check out PR #274. It has a bullet-proof fix for this issue. When I started, I also thought, that the fix is simple, but unicode turned out to be a tricky thing especially when Python 2/3 support is required.

@bitprophet
Member

This should be obviated by #350 at this point.

@bitprophet bitprophet closed this Jun 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment