Fixed a number of encoding issues (in `run` context) #274

Closed
wants to merge 4 commits into
from

Projects

None yet

6 participants

@frol
frol commented Sep 11, 2015

Unicode handling is nasty, it is quite hard to make it right, but this PR makes Invoke reliable in non-ascii world.

Old Description

I've had a session with POSIX locale on Linux and a program that I've run pushed non-ASCII (UTF-8 in my case) output anyway. This revealed a bug when user-requested/locale.getpreferredencoding() is UTF-8, but sys.stdout is ASCII.

@betterdiff betterdiff and 1 other commented on an outdated diff Sep 11, 2015
invoke/runners.py
@@ -322,8 +322,21 @@ def get():
# Can't use six.b because that just assumes latin-1 :(
data = data.encode(self.encoding)
yield data
- # Decode stream using our generator & requested encoding
- for data in codecs.iterdecode(get(), self.encoding, errors='replace'):
+ if self.encoding == output.encoding:
+ # No need in re-encoding if source and target encodings are the same
@betterdiff
betterdiff Sep 11, 2015
  • E501 line too long (80 > 79 characters)
@frol
frol Sep 11, 2015

Please, suggest how I should fix this.

@frol
frol commented Sep 11, 2015

NOTE:

locale.getpreferredencoding(use_setlocale=False) is buggy in Python 2.x:

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

$ python2
Python 2.7.9 |Continuum Analytics, Inc.| (default, Apr 14 2015, 12:54:25)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://binstar.org
>>> import locale
>>> locale.getpreferredencoding(False)
'ANSI_X3.4-1968'
>>> locale.getpreferredencoding(True)
'UTF-8'
>>> locale.getpreferredencoding()
'UTF-8'

$ python3
Python 3.4.3 (default, Mar 25 2015, 17:13:50)
[GCC 4.9.2 20150304 (prerelease)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.getpreferredencoding(False)
'UTF-8'
>>> locale.getpreferredencoding(True)
'UTF-8'
>>> locale.getpreferredencoding()
'UTF-8'

As you can see, Python 2 weirdly reports ANSI (ASCII) instead of UTF-8 when use_setlocale is False. This causes unicode characters being replaced with ?? on, for example, run('echo "Я"').

NOTE2:

We should avoid set_locale=True either, because it triggers global state changes:

@frol
frol commented Sep 11, 2015

@bitprophet Could you please review this small, but quite important fix?

@pfmoore pfmoore and 1 other commented on an outdated diff Sep 11, 2015
invoke/runners.py
+ if not hasattr(output, 'encoding') or self.encoding == output.encoding:
+ # No need in re-encoding if output is a BytesIO or CarbonCopy, or
+ # source and target encodings are the same
+ data_stream = get()
+ else:
+ # Re-encode from user requested encoding to the output encoding
+ data_stream = codecs.iterencode(
+ codecs.iterdecode(
+ get(),
+ self.encoding,
+ errors='replace'
+ ),
+ output.encoding,
+ errors='replace'
+ )
+ for data in data_stream:
@pfmoore
pfmoore Sep 11, 2015 Contributor

This seems wrong. The original code is pulling data from the results of iterdecode - i.e., data is a string. The replacement code is pulling data from iterencode or get, so it's bytes.

This would be the sort of error that Python 2 lets you make because of its incomplete separation of bytes and strings. But it should fail on Python 3 (and give potentially corrupted results on Python 2).

@frol
frol Sep 11, 2015
  1. get always yields bytes (str in Python 2.x)
  2. encode(decode(bytes)) returns bytes
  3. output.write works safer with bytes

I don't see any mistake here.

@pfmoore
pfmoore Sep 11, 2015 Contributor

Regarding (3), is output opened as a bytes stream or as a string stream? The original code clearly impies it's a string stream.

The original code did decode(get()), returning a string. You have changed the type of data without changing the code that uses it.

@frol
frol Sep 11, 2015

Regarding (3):

  1. There is no "text" mode on Linux and OS X, write will encode unicode to bytes under the hood if we don't, so it is usually advised to pass bytes when you write or send anything in Python
  2. In tests, it is used CarbonCopy, which is based on BytesIO (bytes)

I changed the code because it was buggy. It was just a luck that it used to work. There was the sort of error that you referred to in your first comment.

@pfmoore
pfmoore Sep 11, 2015 Contributor

OK. I've looked at the surrounding code. I think I see what you're trying to achieve here.

You're right, there's a bug, but your fix is wrong (see below - I'm taking a purist view here, I suspect your fix probably works, but it's very hard to analyze, so I can't be sure). According to the comment at the top of the io() function, reader should return bytes, and output should take bytes as the argument to its write() method. In Python 3, that's explicitly typed, and using the wrong types won't work. In Python 2, you can accidentally supply a string when you mean bytes and it will work, but give the wrong output. This is where Python 2's string/bytes model sucks - you have to track which objects are bytes and which are strings for yourself, the language doesn't do it for you.

The original code got bytes from get(), put them through iterdecode to get strings, then wrote the strings with output.write. That is broken, because you're writing strings to an output function that should get bytes.

But the comment about output is muddled:

If hide is False, write bytes to output, a stream such as sys.stdout.

That doesn't make sense because sys.stdout is not a stream you can write bytes to! You can write bytes to the test CarbonCopy stream, so the tests don't spot this. But you cannot supply bytes to sys.stdout.write. On Python 2 you get mojibake, on Python 3 you get a type error.

I think your code juggles its way round this, by making data be either bytes or string in a way that matches what output expects. But it's complicated, fragile code that doesn't keep a clean type model (something that in my experience is the source of a lot of bugs).

I would strongly recommend a more type-consistent fix, cleanly separating bytes and strings (conceptually in Python 2, but in terms of actual types in Python 3). That might mean needing to explicitly pass around encodings in places where it's implied at the moment.

But working round the issue in the way your patch does may be a "practicality beats purity" issue - I'll leave that decision to @bitprophet. But someone should at least test it on Python 3 and Windows (and by "test" I mean use it for real, with an actual stdout, not running the tests which stub out the string-based sys.stdout with a bytes-based CarbonCopy instance!!)

Actually, if the test harness is mocking out sys.stdout with a BytesIO object, that's pretty broken. It'll make it pretty much impossible to spot or check issues like this where strings and bytes get muddled.

@frol
frol Sep 12, 2015

You are right, there is indeed a difference between open(mode='w') and open(mode='wb') in Python 3.x and sys.stdout is opened in text mode. I will rework my PR.

@pfmoore
pfmoore Sep 12, 2015 Contributor

Thanks. And thanks for your patience - rereading, my initial explanations weren't very clear :-(

@pfmoore pfmoore and 1 other commented on an outdated diff Sep 11, 2015
invoke/runners.py
@@ -476,7 +490,7 @@ def start(self, command):
)
def default_encoding(self):
- return locale.getpreferredencoding(False)
+ return locale.getpreferredencoding()
@pfmoore
pfmoore Sep 11, 2015 Contributor

If I recall, False is supplied here because that's what the Python IO stack does when setting the encoding. I'd be careful making this change, it's possible it is just trading failures on one platform for failures on another.

I can't offer any specifics, I'm afraid, but I would strongly recommend (a) clearly understanding what difference switching from True (the default) to False makes, and (b) adding a comment clarifying the logic, so that we don't later get someone switching back to True because that works better on their platform.

@frol
frol Sep 11, 2015

I agree that a comment will be useful here, I was considering adding it. However, I believe that since Invoke was developed mostly on Linux, this is not a fix for any platform. I have found the first commit that introduced this behaviour: 2de58e9 BTW, it was your commit! Could you shed some light on it?

@pfmoore
pfmoore Sep 11, 2015 Contributor

I'm not sure what you mean by "this is not a fix for any platform". My point (as best I can recall it) in using False was simply to avoid messing with global state if it's not necessary.

But on reflection, it seems silly that the default behaviour would be the wrong thing to do. So I'm happy with this part of the change. I was probably over-thinking things in my original change...

@frol
frol Sep 12, 2015

@pfmoore I've seen some bug reports around this locale.getpreferredencoding thing, but I cannot think of any other way of getting a proper encoding. I would also prefer not to touch a global state... We can try using locale.getdefaultlocale(), but it reports (None, None) when I set LC_ALL=C, so we will probably need to combine these two for Python 2.x:

Linux UTF-8 locale:

$ python -c 'import locale; print(locale.getdefaultlocale())'
('en_US', 'UTF-8')
$ python -c 'import locale; print(locale.getpreferredencoding())'
UTF-8
$ python -c 'import locale; print(locale.getpreferredencoding(False))'
ANSI_X3.4-1968

Linux C (ASCII) locale:

$ python -c 'import locale; print(locale.getdefaultlocale())'
(None, None)
$ python -c 'import locale; print(locale.getpreferredencoding())'
ANSI_X3.4-1968
$ python -c 'import locale; print(locale.getpreferredencoding(False))'
ANSI_X3.4-1968

Mac OS:

salford:~ frol$ python -c 'import locale; print(locale.getdefaultlocale())'
('en_US', 'UTF-8')
salford:~ frol$ python -c 'import locale; print(locale.getpreferredencoding())'
UTF-8
salford:~ frol$ python -c 'import locale; print(locale.getpreferredencoding(False))'
US-ASCII

Windows cp1252 locale (bat and Cygwin sh):

$ python -c 'import locale; print(locale.getdefaultlocale())'
('en_US', 'cp1252')
$ python -c 'import locale; print(locale.getpreferredencoding())'
cp1252
$ python -c 'import locale; print(locale.getpreferredencoding(False))'
cp1252

I'll fix my PR based on this info.

@frol
frol Sep 12, 2015

NOTE: We have issues only in Python 2.x on Linux and Mac OS when uselocale is False.

NOTE: Python 3 works fine on all platforms!

@frol frol and 1 other commented on an outdated diff Sep 12, 2015
invoke/runners.py
@@ -216,7 +216,7 @@ def run(self, command, **kwargs):
err_stream = sys.stderr
# Echo
if opts['echo']:
- print("\033[1;37m{0}\033[0m".format(command))
+ print(u"\033[1;37m{0}\033[0m".format(command))
@frol
frol Sep 12, 2015

Windows is crazy: http://stackoverflow.com/questions/878972/windows-cmd-encoding-change-causes-python-crash/3259271
I'm still trying to get this work. As well as other bugs in this PR. It is not that easy as I thought at the beginning.

@pfmoore
pfmoore Sep 12, 2015 Contributor

Using UTF-8 in the Windows console is pretty buggy. You probably shouldn't do that.

There are better ways of getting full Unicode in the Windows console. Python doesn't use them by default (because doing so requires violating the Unix/C assumption that the console is just like any other file stream, and it's hard to do that in a compatible way). The 3rd party module win-unicode-console implements them in a nearly-transparent way, if you're interested.

@frol
frol commented Sep 15, 2015

@pfmoore @bitprophet I've just pushed a better improved version. Please, review it.

Tests pass for Python 2.6, 2.7, 3.3 and 3.4 on Linux, but it fails on 3.2. Please, advise on how to fix u"" Sytax Error. Do we need to support 3.2 at all?

This PR fixes a number of issues related to running non-ascii commands as well as handling non-ascii output.

NOTE: I've updated six.py for python_2_unicode_compatible decorator.

@bitprophet
Member

@frol - I seem to recall six having a u() function for use on 3.2 where there's no u literal - look there?

And thanks for poking this, sorry I haven't had time to dig into it. Once @pfmoore is satisfied I'll try to take a peek, he's my backstop for this general topic ;)

@pfmoore
Contributor
pfmoore commented Sep 16, 2015

@frol - This looks OK to me now. Thanks for updating. The unicode stuff in invoke is nasty stuff, and help in getting it right is much appreciated.

All that stuff switching to u"..." bugs me slightly, as I don't recall other projects needing to do this, but then again, invoke has a lot of low-level encoding hacking. As @bitprophet says, there's a u() function in six that should deal with the lack of u"..." strings in 3.2.

I'm slightly bothered about @python_2_unicode_compatible as the choice of UTF-8 for the encoding seems like it might be wrong on Windows. But that's a choice six has made, not you. And as I don't have time to do extensive testing of Python 2 on Windows, I'm happy to let it go for now.

So yeah, I'm OK with this.

@frol
frol commented Sep 16, 2015

@bitprophet @pfmoore My concern about u"" is mostly about the following points:

  1. Should we add from __future__ import unicode_literals and remove u prefixes? Django recommends to do so (https://docs.djangoproject.com/en/1.8/topics/python3/), but some people argue.
  2. Should we use a function six.u for Python 3.2 support? It makes code cumbersome and slows performance down.
  3. Should we give up on Python 3.2 support?

I personaly prefer options (1) and (3).

@bitprophet
Member

Amusingly, I ran into this myself while running the internal but rarely-used inv docs.tree task (which calls tree obviously). Bang, ascii complaints. Confirmed the current state of this branch fixes that particular instance on Python 2.6.

I also skimmed #242 and it reminded me about the 2.6 vs 2.7 issue; I confirmed also that under 2.7 on master, I get no exception but DO get garbage characters. This branch, under 2.7 - also looks clean.

Also mentioned in #242 is the issue of testing; I'm irritated that the old tests I wrote (which specifically include tree output as sentinel data) never caught this, and need to re-examine why :(


Re: unicode literals vs Python 3.2, another related discussion is http://python-future.org/unicode_literals.html . I don't have a strong opinion at this time; we waffle a lot on 3.2 support, there've been times when it was dropped and then added again.

Most of the good arguments against unicode_literals seem poised against applying it to an existing/stable codebase, which Invoke isn't yet. So I figure we could try that option first, it preserves 3.2 support and is also a bit cleaner syntax-wise.

@frol
frol commented Sep 18, 2015

@bitprophet All sounds good. I will prepare and test a patch with unicode_literals instead of u prefixes in a few days then.

@frol frol added a commit to frol/invoke that referenced this pull request Sep 18, 2015
@frol frol Replaced `u` prefixes with `from __future__ import unicode_literals` …
…as decided in PR #274
dbcb823
@frol
frol commented Sep 18, 2015

@bitprophet I had a couple of minutes right after I have wrote the previous comment, so the patch is already pushed. Travis is happy now!

@bitprophet
Member

Cool, thank you!

BTW, if you find time before I do to figure out why these integration tests were passing prior to your branch, and fix them so they fail without & pass with, that would be super great.

@frol
frol commented Sep 20, 2015

@bitprophet I guess, it is because of hide='both', isn't it?

@frol
frol commented Sep 22, 2015

@bitprophet I see you have pushed a number of commits recently. Could you please merge this PR (and probably some other PRs) while they can be merged successfully?

@frol
frol commented Sep 24, 2015

@bitprophet I have fixed the integration tests, so they will catch the encoding errors, but travis will fail on nonprintable_bytes_in_unicode_pty test without applying PR #275.

@frol frol added a commit to frol/invoke that referenced this pull request Sep 24, 2015
@frol frol Replaced `u` prefixes with `from __future__ import unicode_literals` …
…as decided in PR #274
702a012
@frol
frol commented Sep 24, 2015

I had to rebase my PR since integration/main.py could not get merged automatically.

@frol
frol commented Sep 24, 2015

Everything is tested and Travis is happy. Please, merge.

@bitprophet bitprophet added this to the 0.12 milestone Sep 25, 2015
@frol frol changed the title from Fixed an issue of `run` output encoding to Fixed encoding issues (in `run` context) Nov 19, 2015
@frol frol changed the title from Fixed encoding issues (in `run` context) to Fixed a number of encoding issues (in `run` context) Nov 19, 2015
@bitprophet bitprophet modified the milestone: 0.12 Dec 21, 2015
@bitprophet bitprophet added this to the 0.12.2 milestone Feb 4, 2016
@bitprophet
Member

This needs reconciliation with both recent master-merged changes and the notes from pfmoore in #289 re: centralizing and tweaking encoding stuff. Adding to next milestone so it doesn't get (re-)lost.

@bitprophet bitprophet modified the milestone: 0.12.2, 0.12.3 Feb 8, 2016
@frol
frol commented Feb 18, 2016

Just for your info:

  1. This PR is quite small if you don't look at the update of six module which was required for Python 2 & 3 compatibility support (I would not put the whole module into the code-base in the first place). The bug-fix itself is in invoke/runners.py and all weird places were documented there.
  2. I don't have time to redo my PR, so you can do whatever you want with it. It's a pity that the encoding issue was not fixed in half a year.
@pfmoore
Contributor
pfmoore commented Feb 18, 2016

@frol @bitprophet I'll pick this up and take a look at reconciling it if that's OK with everyone.

@frol
frol commented Feb 18, 2016

@pfmoore It is fine with me. Please, go ahead.

@pfmoore
Contributor
pfmoore commented Feb 18, 2016

@frol Just to clarify, could you please explain the issue you are getting, and how to reproduce it? If I follow the code and tests, it's related somehow to cases when the command is a bytestring (as opposed to a Unicode string), and when the command output is bytes not in the console encoding (e.g., ANSI escape sequences).

But I may be misunderstanding - so if you could explain how I can reproduce the issue, that'd be a great help.

From your initial description in this thread, you say you ran a program with your system set to POSIX locale, but the program output was Unicode (you don't clarify what encoding - I guess UTF8?) I'm not quite sure how that relates to the above?

@frol
frol commented Feb 19, 2016

@pfmoore First, just try to run tests (as usual and with LANG=C as it is added to .travis.yml) in my PR with reverted invoke/runners.py changes (BTW, you would better apply #275 PR, otherwise you will see that some tests go two, four, or even eight times!).

Having UTF-8 system encoding "hide" some crashes since it can handle a wide range of characters. To reproduce crashes, try "downgrading" your locale to C/POSIX/cp1252 (Windows default encoding for English-language installations), and run any command which will print non-latin characters:

import invoke
invoke.run('cat ./integration/_support/tree.out')
$ env LANG=C python bug_274.py
Traceback (most recent call last):
  File "/tmp/qq.py", line 3, in <module>
    invoke.run('cat /mnt/storage/www/invoke/integration/_support/tree.out')
  File "/tmp/env/lib/python2.7/site-packages/invoke/__init__.py", line 27, in run
    return Context().run(command, **kwargs)
  File "/tmp/env/lib/python2.7/site-packages/invoke/context.py", line 53, in run
    return runner_class(context=self).run(command, **kwargs)
  File "/tmp/env/lib/python2.7/site-packages/invoke/runners.py", line 280, in run
    raise ThreadException(exceptions)
invoke.exceptions.ThreadException:
Saw 1 exceptions within threads (UnicodeEncodeError):


Thread args: {'kwargs': {'buffer_': [],
            'hide': False,
            'output': <open file '<stdout>', mode 'w' at 0x7f55949ac150>},
 'target': <bound method Local.handle_stdout of <invoke.runners.Local object at 0x7f5590e04910>>}

Traceback (most recent call last):

  File "/tmp/env/lib/python2.7/site-packages/invoke/runners.py", line 861, in run
    super(_IOThread, self).run()

  File "/usr/lib64/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)

  File "/tmp/env/lib/python2.7/site-packages/invoke/runners.py", line 416, in handle_stdout
    indices=threading.local(),

  File "/tmp/env/lib/python2.7/site-packages/invoke/runners.py", line 386, in _handle_output
    output.write(data)

UnicodeEncodeError: 'ascii' codec can't encode characters in position 18-26: ordinal not in range(128)

Even with UTF-8 locale the output is corrupted:

$ python bug_274.py
docs
��������� api
������� ��������� cli.rst
������� ��������� collection.rst
������� ��������� exceptions.rst
������� ��������� loader.rst
������� ��������� parser
������� ������� ��������� argument.rst
������� ������� ��������� context.rst
������� ������� ��������� parser.rst
������� ��������� parser.rst
������� ��������� runner.rst
������� ��������� tasks.rst
������� ��������� util.rst
��������� api.rst
��������� concepts
������� ��������� cli
������� ������� ��������� background.rst
������� ������� ��������� execution.rst
������� ������� ��������� intro.rst
������� ������� ��������� type_mapping.rst
������� ��������� cli.rst
������� ��������� execution.rst
������� ��������� loading.rst
������� ��������� namespaces.rst
��������� concepts.rst
��������� conf.py
��������� contributing.rst
��������� index.rst
��������� prior_art.rst

4 directories, 25 files

Fixed version:

python bug_274.py
docs
├── api
│   ├── cli.rst
│   ├── collection.rst
│   ├── exceptions.rst
│   ├── loader.rst
│   ├── parser
│   │   ├── argument.rst
│   │   ├── context.rst
│   │   └── parser.rst
│   ├── parser.rst
│   ├── runner.rst
│   ├── tasks.rst
│   └── util.rst
├── api.rst
├── concepts
│   ├── cli
│   │   ├── background.rst
│   │   ├── execution.rst
│   │   ├── intro.rst
│   │   └── type_mapping.rst
│   ├── cli.rst
│   ├── execution.rst
│   ├── loading.rst
│   └── namespaces.rst
├── concepts.rst
├── conf.py
├── contributing.rst
├── index.rst
└── prior_art.rst

4 directories, 25 files

Fixed version with LANG=C:

$ env LANG=C python bug_274.py
docs
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd api
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd cli.rst
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd collection.rst
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd exceptions.rst
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd loader.rst
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd parser
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd argument.rst
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd context.rst
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd parser.rst
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd parser.rst
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd runner.rst
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd tasks.rst
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd util.rst
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd api.rst
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd concepts
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd cli
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd background.rst
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd execution.rst
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd intro.rst
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd type_mapping.rst
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd cli.rst
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd execution.rst
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd loading.rst
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd namespaces.rst
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd concepts.rst
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd conf.py
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd contributing.rst
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd index.rst
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd prior_art.rst

4 directories, 25 files
@pfmoore
Contributor
pfmoore commented Feb 19, 2016

@frol OK, thanks. So just to help me understand (my environment is Windows, so I want to be clear on expected behaviour on Unix):

  • The file tree.out is UTF8, with escape sequences embedded.
  • Unix cat does a binary dump of the file to the console, without taking into account the console's set locale - so if the console is set to something other than UTF-8 you will see mojibake (but no error),
  • With a locale of UTF-8 you would expect invoke to show exactly what cat does when run directly.
  • With another locale you'd expect backslash-escaped Unicode, rather than mojibake.

OK. I'll need to set up a Unix VM to do some testing, but I think I see what you're after now. I'll take a look at how the latest invoke code fares, and see what needs to be changed from there. Then hopefully I'll be in a position to review and re-base your code changes :-)

@frol
frol commented Feb 19, 2016

@pfmoore Try type command on Windows instead of cat and you should reproduce the crash right away (I haven't tested it now, but I tested my PR on Windows, OS X, and Linux and I have been using this code for half a year now without a failure). Good luck, I hope your PR won't stuck like mine did.

@pfmoore
Contributor
pfmoore commented Feb 19, 2016

OK, I see the issue (handle_stdin is dreadfully confused over byte/string separation - it uses encode on a variable (which must therefore contain a string), and that variable is named bytes!!!). I'm sure I had a PR that tidied this up, but it looks like it was a casualty of the extensive work @bitprophet is doing in this area under #289 (I guess parts got applied manually, but not the whole).

@bitprophet what I'd like to do is to open a new issue specifically to get the string/bytes separation correct. I can then create a series of PRs to both document and address the issues once and for all. Would that be acceptable? If it's going to cause merge conflicts with work you're doing, I'd rather hold off - as you've already said you're still getting your head round the string/bytes stuff, so I don't want to leave you having to resolve conflicts with code that makes encoding assumptions that I haven't explained clearly enough for you to follow :-)

What I'd suggest then:

  1. Open an issue "Fix encoding and string/bytes separation problems"
  2. I'll start with a PR that just adds a document to the source explaining the issue and how I think we should address it.
  3. I'll then go through the code adding to the docstrings to formally state whether parameters are bytes or (unicode) strings. (I did this once already - it's not hard, but it got lost in the shuffle I think). This can be another PR, which again just documents stuff.
  4. We can then do some small, focused PRs actually fixing those bits of code that are wrong.
  5. While this is going on, any encoding issues that already exist (like this one) or new ones, can get linked to that master issue, and I'll write tests to make sure that whet they are reporting gets checked in future.

How does that sound?

BTW, @frol given where (I think) we've got to with the Unicode situation, I'm no longer sure I'm happy with your approach of wrapping the output stream. It's basically wrapping a "not quite sure if it's bytes or unicode" stream to make it a "definitely Unicode" stream. And I (now) think we should in fact be trapping all of the uncertainty about bytes vs unicode at the extreme edges of the program, not internally like this is doing. OTOH, I think we're a lot closer to that ideal now than we were when this patch was originally written, so at least there's progress :-)

@frol
frol commented Feb 19, 2016

@pfmoore

And I (now) think we should in fact be trapping all of the uncertainty about bytes vs unicode at the extreme edges of the program, not internally like this is doing.

You will need that wrap_output function anyway, wherever you put it doesn't really matter. sys.stdout in Python can automatically encode unicode type to bytes, but if it is not possible, it will raise an exception, so it is better to pass all unicode data through an explicitly defined encoder (with errors='backslashreplace'), which will write bytes to stdout (but due to the fact that Python 3 cannot accept bytes right to the sys.stdout.write(), you should get access to sys.stdout.buffer.write()). It is not a rare situation when the output from run command is in another encoding than the sys.stdout, and neither of the two encodings might be a system default.

@pfmoore
Contributor
pfmoore commented Feb 19, 2016

@frol agreed. But (IMO) the output wrapping is a question of "what do we want Invoke to do when it needs to print out arbitrary Unicode?" Python's answer is "encode it using the OS-defined encoding, error if not". Choosing to change that behaviour (wrapping sys.stdout) is one valid option. Another is "sanitise the Unicode so that it doesn't contain un-encodable characters" with

    enc = sys.getfilesystemencoding() # or whatever
    my_str.encode(enc, errors='backslashreplace').decode(enc)

Which is better depends on the application. (I won't comment on whether Python's default behaviour is ever the best approach - we live with what we're given on that one...)

On Unix, the behaviour of cat (mojibake) is another choice (but that choice involves choosing to make bytes your internal format rather than Unicode, which was basically the Python 2 choice). On Windows, it's a non-issue as the console supports Unicode natively (although you have to take extra steps to get Python to let you access that facility).

@frol
frol commented Feb 19, 2016

@pfmoore I chose to implement non-failing printing since none of the known to me tools like make/cmake/... has ever failed on printing something. Note that using str.encode(...).decode(...) will end up in stdout.write( run_output.read() (.decode(...)) (.encode(...).decode(...)) (.encode(...)) ), while using codecs writer will save time and memory not doing (.encode(...).decode(...)).

Please, also learn about default_encoding Python 2 & 3 workarounds for Linux & OS X: https://github.com/frol/invoke/blob/patch-1/invoke/runners.py#L512

@pfmoore
Contributor
pfmoore commented Feb 19, 2016

@frol Ta - agreed an encode/decode dance is not ideal (and I'm not necessarily recommending it, just deferring a decision for now). But the "have we got an encoding/a buffer" code in your wrapper is potentially an issue too (it's not str/bytes clean to sometimes use buffer and other times not).

Nothing's ideal.

Your default encoding workarounds aren't in master at the moment. If there's an issue they solve, they can be included either now or independently (they don't affect bytes/unicode separation directly). From the URL you gave I can't track back to a specific issue number, though.

Anyway, that's probably enough talk for now. I need @bitprophet to confirm he's OK with my suggestion before it's worth doing anything more.

@frol
frol commented Feb 19, 2016

@pfmoore

Your default encoding workarounds aren't in master at the moment.

Because all of those encoding workarounds are parts of this PR, which should have been merged a long time ago and later, if somebody wouldn't like the design (which I have not violated, as I have only patched the exact issue), it could have been refactored. Instead, we have this conversation while the issue is still there.

@pfmoore
Contributor
pfmoore commented Feb 19, 2016

Ah, sorry. The direct link didn't indicate that it was related to this PR. My bad.

You say in there:

Based on some experiments there is an issue with locale.getpreferredencoding(do_setlocale=False) in Python 2.x on Linux and OS X,

but you don't say what that issue is. Could you elaborate?

which should have been merged a long time ago

Well, that's not something I can affect - @bitprophet only has so much time to work on invoke, I guess. All I can offer is to look at this and prepare a new patch if I can understand your issue(s). (Given that you don't have the time to rework this PR so that it applies to the current codebase). So far it seems we have two, possibly three:

  1. If an invoked command produces Unicode output that can't be printed with the default sys.stdout encoding, we get a traceback. You suggest backslash encoded hex for the invalid characters.
  2. The above unspecified issue with getpreferredencoding.
  3. (Possibly) the implied issue from your cat example that an invoked subprocess produces bytes that we can't decode into valid Unicode using the preferred encoding.

Note that (1) and (3) combined could be taken as saying "why don't we just take the bytes from the subprocess and dump them straight out to the console unchanged? And indeed that is precisely what Python 2 originally did - but it hits all sorts of issues as soon as you try to interact with that byte stream as if it were a string (for example, inspecting the result like Invoke allows you to do) unless the encodings are all set correctly. (The latter is why, in many cases, Python 2 "works" - but ask someone Japanese if you want to hear about encoding problems!) In Python 3, however, the issue is forced - either you write all your internal interactions with the byte stream as bytes (which is a dreadful UI) or you encode/decode - and the only way of doing that while keeping your sanity (and standing any chance at all of knowing the right encoding) is to decode to Unicode as early as possible, and re-encode to bytes as late as you can. You can transport a byte stream cleanly via Unicode-based code, but you need to use the "surrogateescape" error handler, which is a Python 3 only solution.

To summarise, unless we want to drop support for Python 3, or force users to deal with things like Result.stdout being a bytes object rather than a string, then you need to use Unicode exclusively in the internals and decode as soon as you can and encode as late as you can, to push bytes out to be solely a low-level communication detail.

But this is the essay I need to write for that document I mentioned in my plan above. And I'm not ready to do that yet :-)

@frol
frol commented Feb 19, 2016

@pfmoore

but you don't say what that issue is. Could you elaborate?

Please, read my second comment on this issue (I have just slightly updated it).

two, possibly three:

In fact, four:
4. --echo=True switch leads to a crash if a run command contains non-supported characters (e.g. \xFF cannot be encoded to UTF-8):

import invoke
invoke.run(u'echo \xFF', echo=True)
$ python bug_274_2.py
Traceback (most recent call last):
  File "/tmp/qq.py", line 3, in <module>
    invoke.run(u'echo \xFF', echo=True)
  File "/tmp/env/lib/python2.7/site-packages/invoke/__init__.py", line 27, in run
    return Context().run(command, **kwargs)
  File "/tmp/env/lib/python2.7/site-packages/invoke/context.py", line 53, in run
    return runner_class(context=self).run(command, **kwargs)
  File "/tmp/env/lib/python2.7/site-packages/invoke/runners.py", line 223, in run
    print("\033[1;37m{0}\033[0m".format(command))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xff' in position 5: ordinal not in range(128)

... decode to Unicode as early as possible, and re-encode to bytes as late as you can.

This is exactly what is done in this PR.

@pfmoore
Contributor
pfmoore commented Feb 19, 2016

OK. Regarding the encoding one, I don't have the Linux knowledge to comment.

Regarding invoke.run(u'echo \xFF', echo=True) that's weird. It's trying to print the command as if your terminal is ASCII. I really don't have the Linux knowledge to comment, but it seems to me that your configuration is broken. What happens if you run the following command in Python?

print(u'echo \xff')

If it doesn't work, your Python setup is apparently unable to print the Unicode character "LATIN SMALL LETTER Y WITH DIAERESIS". Which makes no sense because you say your terminal is set to use UTF-8. If your console was set to use ASCII, then I'd say that Python's (and as a consequence invoke's) behaviour is a little unfriendly, but valid. We can (if we choose) do better in invoke, but something's still lying to Python about your terminal's capabilities.

If the above does work, invoke is somehow corrupting the configuration of sys.stdout - which I doubt.

Sigh. This is why I hate Python 2. Let's just drop support for it and all move to Python 3 ;-)

@frol
frol commented Feb 19, 2016

@pfmoore Everything is fine with my Linux environment; it is just a broken implementation in invoke, here what happens:

>>> print(b"{0}".format(u"echo \xff"))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xff' in position 5: ordinal not in range(128)

You may say, "OK, let's use u"{0}"", but then:

$ env LANG=C python
>>> print(u"{0}".format(u"echo \xff"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xff' in position 5: ordinal not in range(128)

BTW, in the "usual" cases everything is "fine" indeed:

$ python
>>> print(u"echo \xFF")
echo ÿ
>>> print("echo \xFF")
echo �
@pfmoore
Contributor
pfmoore commented Feb 19, 2016

Well, your first example is obviously wrong - b"...".format(u"...") will encode the Unicode string using ASCII, that's as defined.

Your second example sets the locale to C, so again, yea, of course it only likes ASCII.

Your third example is what I'd expect.

Your fourth shows why Python 2 is awful :-) What it is doing is printing the byte stream b"echo \xFF" (because unqualified string literals in Python 2 are bytes) to sys.stdout. Presumably because we're dumping bytes to the notionally text stream, Python 2 just ignores the encoding, dumps out raw bytes, which appear as mojibake as you can see.

It's quite possible that your first and fourth example indicate what the problem is in invoke - and the solution is quite likely to use from __future__ import unicode_literals. But that's a pretty large change, so I'll need to think about it before proposing it in an actual PR...

@frol
frol commented Feb 19, 2016

Well, your first example is obviously wrong - b"...".format(u"...") will encode the Unicode string using ASCII, that's as defined.

And if you have somehow missed this, it is how it is implemented in invoke! https://github.com/pyinvoke/invoke/blob/master/invoke/runners.py#L223

the solution is quite likely to use from __future__ import unicode_literals. But that's a pretty large change, so I'll need to think about it before proposing it in an actual PR...

And if you would read the comments above, you will find that this was discussed before I made such a move!

@thebjorn

In Python 3, however, the issue is forced - either you write all your internal interactions with the byte stream as bytes (which is a dreadful UI) or you encode/decode - and the only way of doing that while keeping your sanity (and standing any chance at all of knowing the right encoding) is to decode to Unicode as early as possible, and re-encode to bytes as late as you can.

Do you think it is possible to decode the output of every command that invoke can call? Perhaps Result.stdout needs to keep the raw data (in addition to attempting a decode..)? Forcing Unicode decoding/encoding errors on unsuspecting users "is bad"(tm).

@pfmoore
Contributor
pfmoore commented Feb 22, 2016

Do you think it is possible to decode the output of every command that invoke can call?

tl;dr - We have to decode everything to meet the contracted API for invoke. That's hard, but not optional. IMO.

This is a difficult question to answer. The Unix and Windows philosophies are very different here, so I'll try to give a platform-neutral answer, but I am aware that there are nuances in each case (and I'd ask that anyone reading this also be careful to remember that Unix isn't the only environment here)

First of all, we need to know what the data returned by any command invoke runs means, because:

  • we send that data to the (text-based) sys.stdout stream
  • we make it available as Result.stdout for users to work with, and Result.stdout is defined as text ("The subprocess’ standard output, as a multiline string")

In particular, Result.stdout is a defined API, so we've got a contract with the user that we can't change without backward compatibility issues.

OK, so we have to be able to interpret the data returned by a command as text. Or we have to know that it's not intended as text (and in that case, we currently have no way in the API of dealing with that properly). To interpret the data as text we need to know the encoding. There is no encoding supplied with the data, which is a fundamental flaw in the stdout interface (but that interface goes back to the days when Unicode was unheard of, so it's not surprising). So we need to work out the encoding using other data.

Most of the time, systems are configured with a known encoding that we can assume is used by commands. On Windows, that's pretty reliable, as it's defined by the OS and the C runtime, so we really only need to watch out for programs that don't use the C runtime and don't follow its conventions. On Unix, it's defined by the user's environment variables - but there are then C runtime and OS APIs that interpret that data (and those have their own issues, such as getpreferredencoding not being thread safe). And also, use of the defined encoding is only a convention anyway - programs can do whatever they like should they wish to.

So it's a mess.

The solutions we have are:

  1. Assume programs follow the normal conventions. This covers most cases.
  2. Provide an encoding parameter which users can use to explicitly override the default. This solves specific cases where a program is known to behave differently, at the cost of the user having to be explicit.

The problem cases are systems where the "normal conventions" aren't working as expected. I'm being very careful not to describe such systems as "broken", because this is an area where pointing the finger of blame is particularly unhelpful. But my assumption is that the systems affected (i.e. cases that aren't covered by the 2 approaches above) are relatively few and unusual. That is something we need to confirm, though.

For those systems, we have the following options:

  1. Fix the systems themselves. It's possible the problem is a broken configuration. The problem here is that there are some very weird setups out there, and "works for me" is a reasonable approach to take - so expecting people to modify a fragile but working setup just to cater for one program isn't an easy sell.
  2. Fix the standard processing. If the setup is sensible, maybe the core Python processing is at fault? Of course, because we support old versions of Python (2.7 in particular) we'd still need to have our own backported workaround until we drop support for those versions, but at least we can track the core fix rather than developing and maintaining our own.
  3. Add special-case code into invoke for that situation. That's a pain, as precisely because it's an unusual situation, and it's about the user's environment rather than invoke's own code, it's going to be very hard to ensure that we don't break it again with future changes.
  4. Add "back door" workarounds that users can use to avoid the problem (such as saying that sometimes Result.stdout can be broken, but we offer Result.stdout_bytes which provides the raw data). I don't like this approach as it makes this the user's problem, and it breaks the API guarantee on the current API.

My preference is to make sure the code works reliably for all "normal" cases, and degrades gracefully for odd situations. I do not want the user to ever have to work around this issue in their own code, other than by explicitly stating the correct encoding to use. But having said that I am OK with us giving errors (but ones we choose, not raw Python encode/decode exceptions) if we find that we can't properly determine the encoding on the user's behalf.

Of course, it's not clear to me how much work there is in getting to the "ideal world" situation I describe above, and while I am offering to do that work, until it's done people will still have this type of issue to deal with. And this is all just my opinion anyway - ultimately it needs @bitprophet to say if he agrees with this approach, or if he prefers something else.

frol added some commits Apr 10, 2016
@frol frol Fixed a number of encoding issues (in `run` context)
5c8fc56
@frol frol Fixed encoding test so it includes Python2+Unix quirks 5847a5b
@frol frol Fixed tests.runners.exceptions_get_logged so mock_debug doesn't have …
…u'' prefix, which was the case because of unicode_literals
d95b2f0
@frol frol Integrated PR #275 due to the failing tests without it
2b5cb1b
@codecov-io

Current coverage is 92.50%

Merging #274 into master will decrease coverage by -0.95% as of 01fb163

@@            master   #274   diff @@
=====================================
  Files           20     20       
  Stmts         1773   1802    +29
  Branches       305    315    +10
  Methods          0      0       
=====================================
+ Hit           1657   1667    +10
- Partial         34     35     +1
- Missed          82    100    +18

Review entire Coverage Diff as of 01fb163

Powered by Codecov. Updated on successful CI builds.

@frol
frol commented Apr 11, 2016

For those who care, I have rebased and merged together this and #275 PRs. All tests are green.

@bitprophet bitprophet added a commit that referenced this pull request Jun 3, 2016
@bitprophet bitprophet Switch default_encoding to use a cleaned-up variant of #274 code
This brings us back to using `getpreferredencoding(False)` but the addition of preferring `getdefaultlocale` presumably catches any cases where the former was giving bad info.
8440f38
@bitprophet bitprophet closed this Jul 25, 2016
@bitprophet bitprophet modified the milestone: 0.15.x Feb 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment