Catchall 'encoding needs to get un-crap-ified' ticket #350

Closed
bitprophet opened this Issue May 26, 2016 · 35 comments

Projects

None yet

3 participants

@bitprophet
Member
bitprophet commented May 26, 2016 edited

Background

This is a follow-on from #289 and also encompasses some or all of #241, #274, #262, #332, #242, #321, and #338. My goal is to make this general story better, ASAP, while leaving remaining work clearly defined (regardless of who ends up taking a swing at it).

I need to doublecheck some of those linked issues that I've not spent a lot of time on, but the bulk of discussion is in #289 and #274, much of it driven by @pfmoore who (as a Python 3 on Windows user) serves as a very competent & useful counterweight to myself (a default-Python-2 on Unix developer). Any solution that works for both of us is likely one that serves most users.

The basics that fall out of what's been said there so far:

  • Where & how bytes/strings are decoded/encoded, needs to be better defined/documented/contracted;
  • Encoding/decoding should be pushed as far out as possible, to the "edges" of the code, to limit how often it happens & becomes an issue;
    • But also centralized further - the code handling it needs to be clearly labeled & no faffing about should happen outside that one area.
  • The default behavior should encompass as many use cases as is reasonable to do, and should degrade as gracefully as it can otherwise (re: exceptions, vs replacements, vs ???);
    • But users should also be given the ability to work around that default cleanly & fully whenever it doesn't function for their use case.
    • Ideally, we'll balance these two things well instead of using "the user can twiddle the knobs we provided" as an excuse to let the default behavior suffer. But this is hard and time is short, so...

TODO

  • Triple check existing locations where we explicitly or implicitly encode/decode, and move them all somewhere centralized & obvious (without otherwise altering them).
    • Right now the only explicit mentions of (encode|decode)\( are a bare handful in runner.py, so there's likely others that are implicit instead.
    • Not sure what the "most obvious" place is, but given most of this is directly tied to run() and friends, it may still just live in the Runner class or runner.py generally.
  • Identify & implement additional tests (integration or regular) for use cases not represented by the funky_characters_in_stdout integration tests
    • Those tests should have caught some of the basic e.g. cat tree -> kaboom bugs that snuck in during 0.11.x, but did not.
    • I identified the basic reason why & have updated it so they fail again, in a branch.
    • But they only represent a single slice of the problem - look at the discussion in #274, as well as the other top level issues above, and see what other tests can be added that currently fail.
    • Ideally at least some of these would go in the regular, not integration, suite.
  • Get those tests passing, either based off of #274 or with something new
  • If not implied in above: probably makes sense to add more encoding config toggles (maybe up to "one for each stream") so users can override them granularly; the single encoding is nice and all, but won't work in cases where e.g. one's invoke-invoking terminal is Encoding A but the environment the subprocess is operating in is using Encoding B.
  • Handle, or surface into this or other tickets, any TODOs added in the branch left undone
  • See if @pfmoore has time for immediate feedback to those changes (sadly I wasn't able to give him a thumbs up for his offer of making more PRs on his end, back in Feb) just in case.
  • Re: comments below, update read_proc_output so we always expect bytes:
    • Enhance docstring(s) to make it clear that any file-like objects given explicitly, need to yield bytes
    • Remove the "if" statement
    • Update our test suites to use BytesIO instead of StringIO (& any related string literal changes, or maybe a wrapper if lazy).
  • Update changelog entry for any fully-closed tickets & their authors
@bitprophet bitprophet added the Bug label May 26, 2016
@pfmoore
Contributor
pfmoore commented May 26, 2016

Just to tick off one of your TODO items, I should be able to provide feedback on changes when you need it.

@bitprophet
Member
bitprophet commented May 26, 2016 edited

Catalog of encoding/decoding/etc:

  • The bulk of the machinery is currently in Runner._handle_output(), which uses self.encode as well as codecs.iterdecode on read stdout/err data, before writing it to out_stream;
  • Another call to self.encode is in handle_stdin (when writing read stdin to the subprocess);
    • Also notable is how handle_stdin only encodes when writing mirrored stdin into the subprocess, but not when echoing back to local stdout. This is one example of the confusion & inconsistency around round-tripping decode/encode vs writing raw read bytes.
  • And the last is in respond, because it too writes to the subprocess' stdin;

And a catalog of where we read or write data (i.e. our actual external-facing bits):

  • write_stdin uses os.write to write various things (depending on where it is called - right now is either mirrored local stdin, or auto-responses) to the subprocess' stdin fd;
  • _handle_output (called by handle_stdout/err) uses <output obj>.write to write the read subprocess' stdout/err to our configured stdout/err stream objects;
  • handle_stdin uses <output obj>.write to write (echoing our locally read stdin data) to our configured stdout stream object;
  • handle_stdin uses platform.read_byte (which currently uses <input obj>.read but has had some history; if it remains simply .read it can be moved back out of platform probably) to read our own stdin;
  • read_stdout/read_stderr use os.read to read our subprocess' stdout/err;
@bitprophet
Member
bitprophet commented May 26, 2016 edited

Spitballing how to reorganize that without going crazy overboard (e.g. ripping out into a mixin class or whatever):

  • Probably best to rename all of the methods regarding WHAT EXACTLY they read/write, e.g. read_proc_stdin, write_own_stdout (or write_local_stdout or write_our_stdout or ???)
  • handle_* is a tiny bit harder because they're responsible for more than just reading/writing; but no real reason not to split out the act of reading into its own tiny method?
  • Once we have the read/write actions 100% corralled in their own methods, it should presumably be easy to migrate the encoding bits there and only there.
    • Though not really because in some places (like _handle_output, reading+encoding involves multiple iterators
    • Regardless, even if we say "ok, read_proc_stdout is actually an iterator", still helpful to break things up this way.
  • Then, hopefully, "fixing" things (now or in future) would involve modifying those specific bits and no more.
@bitprophet
Member

As I go through the reorg (partly renaming methods to have proc or our inserted; partly creating new such methods where none existed previously), there's one more angle that I recall noticing before, and it's still annoying now:

  • reading from the subprocess' out/err naturally has 1 step of indirection due to how it's refactored (out/err behave the same besides the literal read step), and this means there's a nice separation of "do actual read" (implemented in subclasses) and "do other things to the data before letting other things touch it", like decoding bytes->str/unicode (implemented in base class).
  • writing to the subprocess' stdin does not have this indirection, but wants it, because otherwise every subclass has to remember to decode, which feels like unacceptable foot-shooting potential.

This isn't really a big deal (besides coming up with non stupid names for the two methods) but I wanted to jot it down anyway 😛

@bitprophet
Member
bitprophet commented May 27, 2016 edited

Think things are better organized now in my branch, certainly it seems that way - it's made the whole "decode on ingress, encode on egress, it's Just Strings™ in between" angle clearer for sure. Including exposing how silly Runner.encode is at present (encode! but only sometimes! and it's actually bytes->strings, which is decoding, not encoding! [I think!]).

Ditto some tests - by splitting out some read/write bits, it means some tests can now focus solely on the overall behavior ("I expect stdout to be emitted when X is true, but not when Y istrue") and not care as hard about forcing the test fixtures to be bytes-or-strings. (Though most of the tests right now are one way or the other & will need updating when I start actually changing behavior.)

FTR the methods as they now stand are, as expected, representing a simple read/write vs our own pipes / the subprocess' pipes, matrix:

  • read_proc_output reads process stdout or err - parameterized as to which stream, via...
  • read_proc_stdout/read_proc_stderr, the actual process stdout/err readers
  • write_proc_stdin
  • write_our_output (parameterized as to which stream, tho not via methods because no need to subclass for this end of things)
  • read_our_stdin
@bitprophet
Member
bitprophet commented May 27, 2016 edited

Pushed what I have so far: https://github.com/pyinvoke/invoke/compare/oh-boy-encoding - not done poking yet (haven't made new tests, only got most existing ones passing again). Basic contents:

  • Abovementioned reorg of methods-dealing-with-streams.
  • Where we read bytes (from our own stdin, or from subprocess' stdout/err), it is .decode()d, unless already a string.
    • This is also true even when "our stdin" is a user-specified stream-like object; because of the isinstance check, everything is "normalized" to string-like regardless of what's being served up.
  • Where we write strings to the subprocess' stdin, it is .encode()d, always;
  • Where we write strings to our own stdout/err, it is not encoded, because system streams are io.TextIOWrappers and those want strings, not bytes, so they can do their own encoding and whatnot.
    • Similar to above, when "our stdout/err" are stream-like objects such as StringIO, such things also want strings-not-bytes, so again, differentiating isn't useful.
    • This is one spot where my mental model going in was wrong, I was assuming we'd always be encoding on output. But no, only on the subprocess-facing side, not locally. As it turns out, this is nicely symmetric with how we read, so I think it's good...

Things of special note:

  • I have, for now, removed the iterdecode that @pfmoore added back in #211 or so, replacing it with a hopefully-equivalent .decode.
    • At the time it was added, it looks like that was done to massage the result of overly enthusiastic encoding my earlier code did beforehand (so things were being encoded, then decoded...). But I didn't find a super clear answer when I looked. If we reinstate it I deffo want to attach a clear comment nearby...
    • Now that things are a bit more straightened out, it didn't seem necessary (& I confess I never quite grokked why we had to iterdecode and not just decode - again, wish I had put in a comment), thus the 'side-grade' to .decode.
  • A few tests used to check if "iterdecode is being given the expected encoding parameter", which now breaks because we're not using iterdecode - and it's harder to test method calls on arbitrary bytes/chars than a mocked imported module.
    • However I think I can modify the test framework to more tightly control/mock that side of things (right now it's simply a StringIO standing in for real streams) so we'll see.
  • The integration tests using tree output still fail under Python 2 (this hasn't changed, in other words. Grump!)
@bitprophet
Member

@pfmoore as noted I've still got work I can do on my end, but if you have early feedback on current state of the branch, let me know. Betting I've made things worse for some use cases so far because I haven't banged the tires hard enough yet. But hoping that at least the reorganization is useful.

@bitprophet
Member
bitprophet commented May 28, 2016 edited

Also really wondering if there's existing libs out there that solve this ("I have something that's probably raw bytes, gimme a string, I don't really care how") in an easily reusable way. Betting most people just re-solve this time and time again & most solutions are, if clean, still buried in unrelated codebases. Meh.

That said, it's better for me personally that all this gets sorted out, understood, & documented so it can be moved past.

@bitprophet
Member
bitprophet commented May 28, 2016 edited

Fixed the tests that were caring about the encoding settings - good enough for now at least.

The integration tests are still failing & clearly also the crux of some of the real world issues. Basic stab:

  • Part of the problem may be the previously identified problems with locale.getpreferredencoding(False) turning up the wrong encoding.

    • E.g. I'm running it in a terminal with LANG=en_US.UTF-8, and sys.stdout.encoding (on Python 2) is also correctly reporting UTF-8.
    • But locale.getpreferredencoding(False) returns US-ASCII, which is wrong; changing to no param or True gets UTF-8. So that's one thing we need to figure out (gotta double check what's already been said about this in #274 and elsewhere)
  • Another is the discrepancy between print() and sys.stdout.write() under Python 2:

    • Taking a Unicode string with non-ASCII escapes in it and print()ing it works OK, but the exact same string sent to sys.stdout.write() yields UnicodeEncodeError: 'ascii' codec can't encode characters in position 18-20: ordinal not in range(128).
    • However, running encode on that variable and then handing it to sys.stdout.write, works OK (and the same works OK for print too). Trivial example of this (\u251c is the first non-ASCII-friendly character in the data we read from the subprocess running cat tree.out):
    >>> x = u"\u251c"
    >>> x
    u'\u251c'
    >>> print(x)
    ├
    >>> x.encode('utf-8')                                                                                                                                       
    '\xe2\x94\x9c'
    >>> print(x.encode('utf-8'))                                                                                                                                
    ├
    >>> import sys
    >>> sys.stdout.write(x)
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u251c' in position 0: ordinal not in range(128)
    >>> sys.stdout.write(x.encode('utf-8'))
    ├
    
    • This - the apparent need to encode before sys.stdout.writeing - seems to contradict what I found just previously re: sys.stdout.write getting pissy if you give it a bytes. But that appears to only be Python 3 - so that's the crux there. (Why do I feel like this is all a huge fricken retread...)
@pfmoore
Contributor
pfmoore commented May 28, 2016

@bitprophet I'll try to make some time over this weekend to review all this.

The problem with sys.stdout.write and Unicode is likely because under Python 2, sys.stdout writes str (== bytes) objects, not Unicode, and so needs to encode them. But if you don't specify an encoding in Python 2, the default for unicode->bytes conversion is ASCII. On Python 3, sys.stdout is still a str stream, but this now means Unicode, not bytes...

I suspect the best solution is to have a couple of Python 2/3 compatibility functions, write_bytes_to_stdout and write_unicode_to_stdout and use them exclusively, rather than ever using sys.stdout.write directly. (You may prefer to make the functions take a general file object, rather than hard coding stdout but the idea remains the same).

That's just off the top of my head though - I'll give more detailed comments once I've looked at the code.

@pfmoore
Contributor
pfmoore commented May 28, 2016

BTW, the point about iterdecode was that we were getting bytes from the get() function. And because therewas no guarantee that I could find that get() wouldn't split a multi-byte sequence part way through, we need to use iterdecode to "hold onto" the partly-complete sequence ready for the next call.

Consider

>>> 'abc£def'.encode('utf-8')[:4].decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 3: unexpected end of data

By splitting the bytes at byte 4, we cut the 2-byte sequence for "£" in half, and decode fails. That's precisely the situation iterdecode is designed to deal with.

It's also an annoying issue, because it's really unlikely to come up with in practice, so the bug could lie dormant for ages. Might be worth explicitly constructing a test to make sure this is handled.

@bitprophet
Member

Re: sys.stdout.write: yea I assumed we'd need some intermediate code handling this, glad to hear we're on the same page there, thanks.

Re: iterdecode: I actually had thoughts about what you mention, but in the spot where we read from our local stdin one byte at a time, there's a TODO worrying about multibyte characters. Totally acknowledge that it's the same problem, just much less likely to happen, with reading stdout (currently/by default, at 1000 bytes/read, instead of 1...).

About the stdout bit which used to use iterdecode, I don't recall whether the get iterator was necessary beforehand, but there isn't one now. Feels like we may need to just roll back to what was there before, except not having that earlier/apparently-unnecessary .encode. I'll have to poke some.

@bitprophet
Member

Note to self, another good real world ish encoding-is-fucked test is Debian's /usr/share/dict/words (as seen over in #350) - on Linux (but not on Darwin) it is UTF-8 text with a handful of words containing é at the end, which - when catted inside an Invoke run - turn up as ye olde question mark.

@pfmoore
Contributor
pfmoore commented May 30, 2016

Hmm, annoyingly, that "compare branches" view on github doesn't seem to allow for line comments, which is a real PITA. I'll add some comments here for now.

As a general point, there's a couple of places where you're still optionally decoding/encoding (or toying with the idea in a TODO). That's where "strict separation of Unicode/bytes" is an important idea - you should never do that, any case where you get a type you're not expecting is a bug in your caller and should be fixed there. If you say "I expect Unicode from this reader function", and you don't get it - fix the reader that was passed, don't allow for it in your code. It's a hard lesson to learn, in my experience (because "be lenient in what you accept" is advice you often hear) but it's important to stick to.

  • read_proc_output - Here's the first example of "only decode if the data seems binary". If a reader didn't return bytes, fix the reader. Have you actually seen a case where the reader returned Unicode unexpectedly? If you want to be cautious, add an assert here so that incorrect readers fail early. I seem to recall that you had trouble like this because your test harness (which AIUI mocks out the stdin/stdout/stderr streams) was returning the wrong type of data. But that means you need to fix your test harness, not your code...
  • write_our_output re the TODO. Another one. You should never encode here, if there's a need to, then the caller has incorrectly supplied a stream that expects bytes. That's a bug in the caller, and should be fixed there, not here.
  • _handle_output using a dummy output stream to mean "hide". Maybe, but (a) you then need to handle cross-platform issues (on Windows, the null stream is nul not /dev/null, for example) and (b) writing to /dev/null is still an unneeded syscall, it's probably cheaper to just do it in code. If you mean that the user should specify /dev/null then I'm -1, as Windows users are much less familiar with the idea of a null stream, and may not realise they can do that).
  • read_our_stdin yes, reading one byte at a time will break multibyte encodings. The idea of reading 1 byte at a time is described in a comment as "for interactivity's sake". I don't really understand what that means, so I'm finding it hard to comment here, but this seems to be the big problem area - if it weren't for this single area, everything else would be much easier as we could convert to Unicode much sooner. (I could explain this much easier in code terms, but it would mean creating a whole other "see what I mean" branch, and I'm not sure I have the time for that at the moment). But getting back to reading input - is what you really want to do not simply "read a block of bytes from the stdin, but on an interactive device only block long enough to get some data, don't wait for all of what was requested"? Hmm, is that as simple as modifying read_byte to return input_.read(SOME_BIG_NUMBER)? After all, RawIOBase.read() is documented as only ever doing one syscall as long as the requested size isn't -1.

I'll carry on this review later - no more time right now. But I think that getting that "read some bytes without blocking (for interactivity)" call right would be a key issue - so I'd be interested on your comments on that.

@bitprophet bitprophet added a commit that referenced this issue May 31, 2016
@bitprophet bitprophet First stab at fixing #350 for one case.
Updates test to account, fixes PTY case well enough apparently
05e8fea
@bitprophet bitprophet added a commit that referenced this issue Jun 1, 2016
@bitprophet bitprophet Changelog closes #350 0022838
@bitprophet bitprophet added a commit that referenced this issue Jun 1, 2016
@bitprophet bitprophet Fix bug in stdin threading re #351/#350
Needed stronger differentiation between 'not ready for reading' and 'read an empty byte'.
d9590d2
@bitprophet bitprophet added a commit that closed this issue Jun 1, 2016
@bitprophet bitprophet Changelog closes #350 0022838
@bitprophet bitprophet closed this in 0022838 Jun 1, 2016
@bitprophet
Member

Grump I apparently fat fingered the issue # at some point, this is not closed yet.

@bitprophet bitprophet reopened this Jun 2, 2016
@bitprophet
Member

Thanks for the additional comments @pfmoore, replying below.

read_proc_output - If a reader didn't return bytes, fix the reader. [...] you need to fix your test harness, not your code...

So, yes, one of these "returns strings not bytes" instances is the test harness, but the idea is that users themselves might equally want to use arbitrary file-like objects in this situation - including io.StringIO as we use in the test suite. (Or something returned from open() in text mode, or etc etc).

If you're saying we should tighten the API so it says "you may hand in arbitrary file-like objects which act like they are in binary mode, not text mode, such as io.BytesIO or open(..., 'rb'), and not io.StringIO or open(..., 'rt')" - that seems doable. Especially since in my brief testing, putting "string" literals (regular in Py2; b-prefixed in Py3) into BytesIO is still easy.

I am curious exactly what problems crop up from the 'normalization' approach currently being used, but either way, probably happy to make the above doc+impl change.

write_our_output re the TODO. Another one. You should never encode here, if there's a need to, then the caller has incorrectly supplied a stream that expects bytes. That's a bug in the caller, and should be fixed there, not here.

Yup, agree - not sure which version of the TODO you saw but the one there now is very "uh...is that ever a thing? I don't think it's ever a thing" since I did discover how the IO wrapper classes do their own encoding; so I'm happy to just nuke the comment and leave it as-is (i.e. never encoding).

_handle_output using a dummy output stream to mean "hide". Maybe, but (a) you then need to handle cross-platform issues

A) To be clear, that commentary predated all this work so I consider it orthogonal to this ticket; B) It didn't mean an OS level null stream like you describe, but simply "let the user give an IO class whose write is literally just pass" as a potentially more elegant way of implementing the current hardcoded hide option.

read_our_stdin yes, reading one byte at a time will break multibyte encodings. The idea of reading 1 byte at a time is described in a comment as "for interactivity's sake". I don't really understand what that means [...] is what you really want to do not simply "read a block of bytes from the stdin, but on an interactive device only block long enough to get some data, don't wait for all of what was requested"?

Pretty much this, yes. It's been a long time since that stuff was first worked in, but IIRC the problem was doing a stdin.read(N) or similar was in many situations blocking until N bytes were read, meaning we were unable to mirror anything less than that at a time to the subprocess. Kinda makes any humans typing in response to, say, prompts (or interactive shells or etc) difficult.

As I said, it's been a while, so I'm wondering if the contortions to the IO APIs we use changed enough in the interim that the issue is no longer present - such as what you said next:

Hmm, is that as simple as modifying read_byte to return input_.read(SOME_BIG_NUMBER)? After all, RawIOBase.read() is documented as only ever doing one syscall as long as the requested size isn't -1.

If that isn't a Python 3 specific behavior, that might work, I'll have to experiment.

If it's somehow not possible, I reckon the other approach might be to make this configurable, so users can at least work around it when they have multibyte stdin & don't need the minimal-read behavior for whichever reason (e.g. piped stdin instead of actual human-at-keyboard).

@bitprophet
Member
bitprophet commented Jun 2, 2016 edited

In practice it seems like my original fears were founded:

  • Confirmed that fully interactive things like run('python', pty=True) operate pretty much as if I was running the Python REPL by hand - typing works, each character is mirrored, control sequences like ^D function as expected, etc.
  • Confirmed that input_ is, by default, a _io.TextIOWrapper instance (because it's sys.stdin)
  • Changed to input_.read(1000) from (1)
  • Under both Python 3 and Python 2, typing has no apparent effect any longer - no echoing, Python prompt isn't seeing what I type, ^D doesn't work. Only ^C works and that's because it generates exceptions and we trap/forward those explicitly in the read loops.
  • Further, changing it to e.g. read(2) has exactly the expected effect - typing 1st char acts as above, typing 2nd char causes both to appear simultaneously (i.e. only after 2 chars are typed to stdin does the read call return/unblock).

This is all on my OS X workstation; identical behavior on Debian 8.

Gonna poke around briefly to see if there's alternatives I missed before; flags, other classes/APIs, etc. Otherwise, though, I'll just punt on that particular angle until it comes up for real - making sure stdout/err are correctly en/decoded is a lot more visible/important, and then I need to move on.

@bitprophet
Member

OK, so yea, APIs are a thing. @pfmoore mentioned RawIOBase, but sys.stdin defaults to a subclass of BufferedIOBase which, well, buffers.

I vaguely recall touching its buffer object at some point, but that approach also had issues - either cross-platform ones, or "works with sys.stdin, but not with user-supplied file-like objects" (which seems likely, since the deeper you get, the farther you are from compatibility with the file interface.)

Given we were recently using iterdecode I find myself wondering, can we somehow just fit the stdin firehose into iterdecode and have it Just Magically Work Somehow? (Like seen here.)

Though there may be other issues with that - the abovementioned "it's no longer compatible with things like BytesIO"; the need to control the looping so we can detect EOF or the program_finished threading flag firing; I've no idea if the linked approach is compatible with how we force the terminal to byte-buffered mode on Unix; etc.

Continuing to poke.

@bitprophet
Member
bitprophet commented Jun 2, 2016 edited

Yea, sadly that approach appears to re-enable line buffering (or otherwise skip past the enforced byte buffering). I think I'm going to have to mark this sub-issue as "pending" - spun it off into #352.

The "don't normalize, force all stream objects in play to be in byte, not text, mode" stuff still seems feasible, so if @pfmoore gives me a thumbs up on my response to him above, I will still do that.

And I still have to do the rest of the checklist...that's next. Though I am leaning towards "I've timed out on this issue" if it takes long - as long as those failing-on-2 integration tests get fixed, maybe Good Enough.

@pfmoore
Contributor
pfmoore commented Jun 2, 2016

If you're saying we should tighten the API so it says "you may hand in arbitrary file-like objects which act like they are in binary mode, not text mode, such as io.BytesIO or open(..., 'rb'), and not io.StringIO or open(..., 'rt')" - that seems doable. Especially since in my brief testing, putting "string" literals (regular in Py2; b-prefixed in Py3) into BytesIO is still easy.

I think we're confusing each other a little here, but essentially yes. I would word it as "you may hand in arbitrary file-like objects but they must produce bytes, not Unicode (on Python 2, of course, bytes = strings)". But that's maybe just a Python 3 perspective on it.

On Python 2, IIRC, "rb" vs "rt" affects line ending translation on Windows, but otherwise "rt" is still fine - both "rt" and "rb" return string (=bytes) objects, not Unicode.

But this is nitpicking - I think we both agree on the principle (even if we can't explain it very well :-))

I am curious exactly what problems crop up from the 'normalization' approach currently being used, but either way, probably happy to make the above doc+impl change.

Probably nothing. The real problems tend to occur the other way round, when we expect to get Unicode, but want to handle bytes (because that way, we have to guess the encoding, and that's always a recipe for bugs).

I guess I don't really mind if this stays as it is - I can't actually point to issues. My big problem is that I'm working purely from theoretical concerns. You have a much better feel for what users are actually doing with this interface - and making the API comfortable for users is much more important than theoretical purity.

So I'm happy with whichever approach you decide on.

Re: read(1). Yeah, buffering. I forgot about that. I guess we leave it at a 1-byte read for now, and either wait for an actual bug to surface before worrying, or push the data through iterdecode, which will buffer any partial multibyte sequences until they are complete (but only buffer the minimum chunk needed, and that should match "a single keypress" and hence be fine). Again, let's not get too bogged down in theoretical problems.

(Let me know if I missed any questions you had - I think I covered whet you asked, but I may have skipped something)

@bitprophet
Member

Yea I think we're on the same page, thanks! (Re wording, yes, your "give us a file object yielding bytes" is the more correct/specific phrasing so that's what I'll go for.)

@bitprophet
Member
bitprophet commented Jun 2, 2016 edited

Poking at the still failing tests around tree output and looking at #274:

  • Reading the tree output into a Unicode string is definitely funky in my environment unless I manually give encoding='UTF-8', because it's defaulting to US-ASCII, I think per #274.
    • So one change I think is to revert to using getpreferredencoding() without False, after I doublecheck why we are using False to begin with.
  • Even in that case, however, trying to print that Unicode string back out to my terminal - or simply passing it through str - results in ascii complaints.
    • even with encoding='UTF-8' - tho that's expected as we're not using self.encoding on egress. That probably needs to change, I need to make sure the expected API we're calling always takes encoding= and then pass it in.
    • Also feels like we should allow different encoding settings for the two ends (subproc and local) but that may wait til later, ideally for most folks it's the same.
@bitprophet
Member
bitprophet commented Jun 2, 2016 edited

Right, default sys.stdout type objects have no encoding parameters whatsoever for .write().

So this is the familiar problem (listed far above) where Python 2 is still somehow arriving at 'ascii' codec for encoding, despite it being given a unicode object and sys.stdout.encoding being 'UTF-8' (and print() working great). Paul's original off the cuff diagnosis was around writing bytestrings to sys.stdout.write, but we're definitely no longer doing that because of the cleanup - yet the problem remains.

sys.getdefaultencoding() (per http://stackoverflow.com/questions/492483/setting-the-correct-encoding-when-piping-stdout-in-python) is definitely still saying ascii, for example. Still unclear if this is a legitimate bug in Python. The various solutions bandied about in that and other locations, all sound like they have tradeoffs.

This earlier comment from #274 also lists some of the same things (wrapping sys.stdout, doing an explicit encode/decode round trip). But it's still unclear to me why we need to do either of those when this "should" work and does work fine with print.

Manually doing write(data.encode(self.encoding)) works great (because of how Python 2's IO here accepts either unicode or bytestrings), so wondering if the least-shite approach is to do an if-python2 check and explicitly encode only under Python 2 (since Python 3 already does the Right Thing here by default). Great? No. Possibly least disruptive? Maybe!

@bitprophet
Member

Actually testing this is also challenging, since the behavior under test is, well, this poorly behaved real life sys.stdout. I guess an appropriate mock - since we're currently not going for the solutions that fux with sys.stdout itself - is simply any writer object that performs the "I automatically encode all non-unicode string objects with ascii" behavior...and only on Python 2. Will see if I can whip that up.

@bitprophet
Member
bitprophet commented Jun 3, 2016 edited

(Much earlier today...) got that working enough to prove the bug (when fix reverted) and pass when fix implemented. Toot.

Also committed the change removing do_setlocale from getpreferredencoding.

Next up:

  • Triple checking whether do_setlocale should be make configurable or something, re: rest of #274
  • Double checking how things currently behave (real & test suite) with non-UTF-8 posix shell locales/langs like C (again as per some of #274).
  • Then continuing to skim the related tickets for obvious new test cases or other wrinkles that really want attention before I move on from this.
@bitprophet
Member

Looked deeper at the locale stuff and #274 actually states neither invocation of getpreferredlocale is fully safe under Python 2 (& non-Win): #274 (comment) - that PR uses getdefaultlocale instead (here).

Quickly verified that getdefaultlocale() gets 'UTF8' (no dash but doesn't matter afaict) on my LANG=en_US.UTF-8 default env, and None if I unset LANG. So using that redone default_encoding() wholesale seems wise.

@bitprophet
Member

Poking at that also raises the other question of how the tests should behave when the runner is in a non-UTF-8 world, as well as the yet-unanswered question of how we should behave when the data read from the subprocess (& then to be emitted to local terminal streams) isn't compatible with the local encoding (i.e. explode, display replace-driven junk, other).

For now I think just annotating the tests with a "only run under UTF-8" will prevent annoying explosions for users on eg C locales; that should be somewhat less complex than what #274 does there, then when we have time to narrow down the "how/whether to explode" bits we can adjust the tests, or add new ones, to enforce that.

@bitprophet bitprophet added a commit that referenced this issue Jun 3, 2016
@bitprophet bitprophet Start changelog re #350 66b380c
@bitprophet
Member
bitprophet commented Jun 3, 2016 edited

Pulled in the "use getdefaultencoding first on some platforms, fall back to getpreferredencoding(False) otherwise" logic from #274, after updating tests to account. Hopefully we're now more flexible there too.

That should do it for useful changes/concerns from that ticket; on to the rest.

@bitprophet
Member
bitprophet commented Jun 4, 2016 edited

Ugh, realized I've had things backwards re: the conditional-bytes stuff. The "Users might want their own objects" angle is for our stdout/err and proc stdin. Those spots don't have conditional decoding. The conditional decoding is for our stdin or proc stdout/err, and the conditional around that spot is only for helping the test suite be lazy.

Thankfully, fixing this the right way (instead of what I had just tried earlier & had issues with - stdin reading always yields strings on Python 3, so "always decode" in that spot was...problematic) is a lot less search & replace heavy.

@bitprophet
Member

@pfmoore I think this is done now. If you have time to look at the latest version of the branch (https://github.com/pyinvoke/invoke/compare/oh-boy-encoding) and give me your blessing, that'd be appreciated :)

I'm going to move on to actual feature work now either way (tho I'll wait to merge for a bit) so if there's remaining issues, as long as they're less critical than when I started this branch, going to still consider this an improvement 😩

@pfmoore
Contributor
pfmoore commented Jun 5, 2016

Looks like you merged it in? The compare link shows "up to date with master" at any rate.

I'll try to get a chance to install from master at some point and kick the tyres a bit.

@bitprophet
Member
bitprophet commented Jun 5, 2016 edited

Yea, I did, forgot to update in here. Not worried, this isn't released yet (poking Fabric 2 today, will release Invoke 0.13 before that work goes public) so if you find some sort of horrid blocker, still time to try and fix it. Otherwise, bugfix releases are things that exist. Thanks!

@bitprophet
Member

0.13 is out now. Gonna close this for now but as always please comment if you find anything, we can reopen or make a new ticket. ❤️

@bitprophet bitprophet closed this Jun 9, 2016
@aleksihakli

Hey,

I don't know if this is still relevant, but I'm running an invoke task in an Elastic Beanstalk deployment on AWS and bower install step breaks the task unless I have hide=True enabled in the task because of the unicode output and utf-8 encoding. Here's an excerpt from my tasks.py file:

import getpass

from invoke import task

root = ['root']
user = getpass.getuser()

@task
def bowercomponents(ctx):
    """ Run `bower install` with --allow-root if necessary """
    log.info('Installing libraries with bower')
    if user in root:
        command = 'bower --allow-root install'
    else:
        command = 'bower install'
    # Using AWS Unicode output breaks Invoke on deployments 
    # unless we ditch the output with the hide=True flag
    ctx.run(command, hide=True)

And these are my software versions:

[ec2-user@ip-x-y-z-w ~]$ date
Thu Jul 14 11:28:33 UTC 2016

[ec2-user@ip-x-y-z-w ~]$ cat /proc/version
Linux version 4.4.5-15.26.amzn1.x86_64 (mockbuild@gobi-build-60007) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Wed Mar 16 17:15:34 UTC 2016
[ec2-user@ip-x-y-z-w ~]$ bash --version
GNU bash, version 4.2.46(1)-release (x86_64-redhat-linux-gnu)

[ec2-user@ip-x-y-z-w ~]$ source /opt/python/run/venv/bin/activate

(venv)[ec2-user@ip-x-y-z-w ~]$ python --version
Python 3.4.3
(venv)[ec2-user@ip-x-y-z-w ~]$ invoke --version
Invoke 0.13.0
(venv)[ec2-user@ip-x-y-z-w ~]$ node --version
v4.4.7
(venv)[ec2-user@ip-x-y-z-w ~]$ bower --version
1.7.9

And this is what the failure approximately looks like:

[2016-07-11T16:20:35.075Z] INFO  [18180] - [Application update app-8436-160711_191916@101/AppDeployStage0/EbExtensionPostBuild/Infra-EmbeddedPostBuild/postbuild_0_trasim/Command 21_install] : Activity execution failed, because: 

bower no-home HOME environment variable not set. User config will not be loaded.
  bower bootstrap#3.3.6           cached https://github.com/twbs/bootstrap.git#3.3.6
  bower bootstrap#3.3.6         validate 3.3.6 against https://github.com/twbs/bootstrap.git#3.3.6
  bower jquery#1.12.4             cached https://github.com/jquery/jquery-dist.git#1.12.4
  bower jquery#1.12.4           validate 1.12.4 against https://github.com/jquery/jquery-dist.git#1.12.4
  bower jquery-ui#1.11.4          cached https://github.com/components/jqueryui.git#1.11.4
  bower jquery-ui#1.11.4        validate 1.11.4 against https://github.com/components/jqueryui.git#1.11.4
  bower jquery-ui#1.11.4         install jquery-ui#1.11.4
  bower jquery#1.12.4            install jquery#1.12.4
  bower bootstrap#3.3.6          install bootstrap#3.3.6

  Traceback (most recent call last):
  File "/opt/python/run/venv/bin/invoke", line 11, in <module>
  sys.exit(program.run())
  File "/opt/python/run/venv/local/lib/python3.4/site-packages/invoke/program.py", line 270, in run
  self.execute()
File "/opt/python/run/venv/local/lib/python3.4/site-packages/invoke/program.py", line 381, in execute
  executor.execute(*self.tasks)
  File "/opt/python/run/venv/local/lib/python3.4/site-packages/invoke/executor.py", line 113, in execute
  result = call.task(*args, **call.kwargs)
  File "/opt/python/run/venv/local/lib/python3.4/site-packages/invoke/tasks.py", line 111, in __call__
  result = self.body(*args, **kwargs)
  File "/opt/python/bundle/70/app/tasks.py", line 205, in build
  bowercomponents(ctx)
  File "/opt/python/run/venv/local/lib/python3.4/site-packages/invoke/tasks.py", line 111, in __call__
  result = self.body(*args, **kwargs)
  File "/opt/python/bundle/70/app/tasks.py", line 145, in bowercomponents
  ctx.run('bower --allow-root install', encoding='utf-8')
  File "/opt/python/run/venv/local/lib/python3.4/site-packages/invoke/context.py", line 53, in run
  return runner_class(context=self).run(command, **kwargs)
  File "/opt/python/run/venv/local/lib/python3.4/site-packages/invoke/runners.py", line 339, in run
  raise ThreadException(exceptions)
  invoke.exceptions.ThreadException:
  Saw 1 exceptions within threads (UnicodeEncodeError):

  Thread args: {'kwargs': {'buffer_': ['bower bootstrap#3.3.6           cached '
  'https://github.com/twbs/bootstrap.git#3.3.6\n',
  'bower bootstrap#3.3.6         validate 3.3.6 '
  'against '
  'https://github.com/twbs/bootstrap.git#3.3.6\n',
  'bower jquery#1.12.4             cached '
  'https://github.com/jquery/jquery-dist.git#1.12.4\n',
  'bower jquery#1.12.4           validate 1.12.4 '
  'against '
  'https://github.com/jquery/jquery-dist.git#1.12.4\n',
  'bower jquery-ui#1.11.4          cached '
  'https://github.com/components/jqueryui.git#1.11.4\n',
  'bower jquery-ui#1.11.4        validate 1.11.4 '
  'against '
  'https://github.com/components/jqueryui.git#1.11.4\n',
  'bower jquery-ui#1.11.4         install '
  'jquery-ui#1.11.4\n',
  'bower jquery#1.12.4            install '
  'jquery#1.12.4\n',
  'bower bootstrap#3.3.6          install '
  'bootstrap#3.3.6\n'],
  'hide': False,
  'output': <_io.TextIOWrapper name='<stdout>' mode='w' encoding='ANSI_X3.4-1968'>},
  'target': <bound method Local.handle_stdout of <invoke.runners.Local object at 0x7ff401ce5550>>}
  Traceback (most recent call last):                                           

  File "/opt/python/run/venv/local/lib/python3.4/site-packages/invoke/util.py",
 line 118, in run                                                              
  super(ExceptionHandlingThread, self).run()                                   

  File "/usr/lib64/python3.4/threading.py", line 868, in run                   
  self._target(*self._args, **self._kwargs)                                    

  File "/opt/python/run/venv/local/lib/python3.4/site-packages/invoke/runners.p
y", line 512, in handle_stdout                                                 
  indices=threading.local(),                                                   

  File "/opt/python/run/venv/local/lib/python3.4/site-packages/invoke/runners.p
y", line 483, in _handle_output                                                
  self.write_our_output(stream=output, string=data)                            

  File "/opt/python/run/venv/local/lib/python3.4/site-packages/invoke/runners.p
y", line 471, in write_our_output                                              
  stream.write(string)                                                         

  UnicodeEncodeError: 'ascii' codec can't encode characters in position 45-47: 
ordinal not in range(128)

   (ElasticBeanstalk::ExternalInvocationError)

Sorry for the poor output formatting. I am currently working with a funny shell.

I hope this proves useful to somebody. The hide=True argument for ctx.run() invocation makes the unicode and utf-8 encoding problems go away with Node.js and Bower.

@bitprophet
Member

@aleksihakli Thanks for the report! I spun this off into its own issue: #377 - if you could respond there I'd be grateful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment