-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
subprocess seems to use local encoding and give no choice #50385
Comments
When I start a process with subprocess.Popen() and pipe the stdin and I tried setting process.stdin.encoding = "utf8" and the same for stdout I also tried using shell=True since on Mac, Terminal.app is fine with So basically, I have programs that output Unicode and run fine on the I wish it were possible to specify the stdin and stdout encoding that is |
Related discussion thread: https://answers.launchpad.net/bzr/+question/63601 |
I propose to add two parameters (encoding, error) to the
I'll try to come with a patch. |
I wrote a patch to add encoding and error to subprocess.Popen in Python |
Cosmetic update. |
Two things:
|
Should we also cover the unusual case where stdout, stderr and stdin |
I agree with Florian Mayer that the encoding handling should be One solution that builds on what Amaury Forgeot d'Arc has done (i.e., |
I don't understand. How is the subprocess stdout related to the main |
On Thu, Dec 31, 2009 at 1:30 PM, Amaury Forgeot d'Arc
Yes, you're right. (What I had in mind was a scenario where you read one process's stdout |
Ran new unit test before and after patching subprocess on Windows Vista against 3.1 debug maintenance release, all ok apart from this at end of latter. File "test\test_subprocess.py", line 568, in test_encoded_stderr I'm sure I've seen a ref to this somewhere, can anyone remember where? |
In 2.7 and py3k test_subprocess has a class BaseTestCase which has a assertStderrEqual method. These don't exist in 3.1. I believe that the py3k code will need to be backported to 3.1. Can this be done on this issue, or do we need a new one to keep things clean? |
About the topic:
I don't see how you can get unicode from a process only using subprocess. But with Python 3, you can get unicode if you set universal_newlines option to True. So for Python 2, it's a new feature (get unicode), and for Python 3, it's a new option to specify the encoding. The title should be changed to something like "subprocess: add an option to specify stdin, stdout and/or stderr encoding and errors" and the type should be changed to "feature request". Or am I completly wrong? |
The locale encoding is not necessary a 8-bit encoding, it can by a multibyte like... UTF-8 :-) -- subprocess.patch: You should maybe use io.open(process.stdout.fileno(), encoding=..., errors=...) instead of codecs.getreader/getwriter. The code will be closer to Python 3. I think that the io module has a better support of unicode than codec reader/writer objects and a nicer API. See: --
I like this idea. But what about other TextIOWrapper (or other file classes) options: buffer size, newline, line_buffering, etc.? Why not using a dict for existing stdin, stdout and stderr arguments? Dummy example: process = Popen(
command,
stdin={'file': PIPE, 'encoding': 'iso-8859-1', 'newline': False},
stdout={'file': PIPE', 'encoding': 'utf-8', 'buffering': 0, 'line_buffering': False},
...) If stdin, stdout or stderr is a dict: the default value of its 'file' key can be set to PIPE. I don't think that it's possible to choose the encoding, buffer size, or anything else if stdin, stdout or stderr is not a pipe. With this solution, you cannot specify the encoding for stdin, stdout and stderr at once. You have at least to repeat the encoding for stdin and stdout (and use stderr=STDOUT). -- I still hesitate to accept this feature request. Is it really needed to add extra arguments for TextIOWrapper? Can't the developer create its own TextIOWrapper object with all interesting options? In Python 3, be able to specify stdout encoding is an interesting feature. Control the buffers size is also an interesting option. My problem is maybe the usage of a dict to specify various options. I'm not sure that it is extensible to support future needs. |
I discovered this same problem recently when updating the subprocess docs, and also in working on the improved shell invocation support I am proposing for 3.3 (bpo-13238). I initially posted an earlier variant this suggestion as a new issue (bpo-13442), but Victor redirected me here. Firstly, I don't think it makes any sense to set encoding information globally for the Popen object. As a simple example, consider using Python to write a test suite for the iconv command line tool: there's only one Popen instance (for the iconv call), but different encodings for stdin and stdout. Really, we want to be able to make full use of Python 3's layered I/O model, but we want the subprocess pipe instances to be slotted in at the lowest layer rather than creating them ourselves. The easiest way to do that is to have a separate class that specifies the additional options for pipe creation and does the wrapping: class TextPipe:
def __init__(self, *args, **kwds):
self.args = args
self.kwds = kwds
def wrap_pipe(self, pipe):
return io.TextIOWrapper(pipe, *self.args, **self.kwds) The stream creation process would then include a new "wrap = getattr(stream_arg, 'wrap_pipe', None)" check that is similar to the existing check for subprocess.PIPE, but invokes the method to wrap the pipe after creating it. So to read UTF-8 encoded data from a subprocess, you could just do: data = check_stdout(cmd, stdout=TextPipe('utf-8'), stderr=STDOUT) |
Isn't that the exception rather than the rule? I think it actually makes Perhaps the common case should be made extra easy. |
Is subprocess affected by PYTHONIOENCODING? |
Yes, as any Python process. |
So the users can control the encoding, and this is a doc bug. |
If you decide this is only a doc bug, please see also related bpo-12832. |
Not really. People can control the encoding in the child process (and only if it's a Python 3 process of course). |
FWIW, I recently encountered a scenario (albeit in a test situation) where the ability to set different encodings for stdout and stderr would have been useful to me. It was while creating a test case for bpo-15595. I was changing the locale encoding for stdout, but I also wanted to leave it unchanged for stderr because there didn't seem to be a way to control the encoding that the child used for stderr. |
To my previous comment, bpo-15648 shows the case where I was able to change the encoding for stdout in the child process but not stderr (which would require supporting two encodings in Popen to handle). |
I've found a workaround by specifying the enviroment variable: my_env = os.environ
my_env['PYTHONIOENCODING'] = 'utf-8'
p = subprocess.Popen(shlex.split(command), stdout=subprocess.PIPE, stderr=subprocess.PIPE, stdin=subprocess.PIPE, env=my_env) I've attached an example script for testing. It calls itself recursively 10 times. |
This seems like a request for extra feature(s), rather than a bug report. FWIW I agree with Victor’s hesitation in <https://bugs.python.org/issue6135#msg123024\>. I doubt it is worth adding special support to have multiple pipes that all use text mode (universal_newlines=True) but with different encoding settings. If you really want this, just add an external TextIOWrapper, or string encoding: process = Popen(command, stdin=PIPE, stdout=PIPE, ...)
# Manually wrap with TextIOWrapper. Probably risks deadlocking in the general case, due to buffering, so this is a bad idea IMO.
input_writer = io.TextIOWrapper(process.stdin, "iso-8859-1")
output_reader = io.TextIOWrapper(process.stdout, "utf-8")
# Better: use communicate(), or a custom select() loop or similar:
input_string = input_string.encode("iso-8859-1")
[output_string, _] = process.communicate(input_string)
output_string = output_string.decode("utf-8") Also I’m not enthusiastic about adding encoding etc parameters for a single PIPE scenario. What is wrong with leaving it in binary mode in the Popen call, and encoding, decoding, or using TextIOWrapper as a separate step? |
Martin Panter added the comment:
Right. But if encoding or errors is used, universal_newlines value For example, I expect Unicode when writing: output = subprocess.call(cmd, stdout=subprocess.PIPE, encoding='utf8').stdout |
Nick Coghlan added the comment:
If it's just text=True, I don't see the point of having two options
Can you give an example? I don't see how the API would be used. |
Given specifying an encoding will do the same thing as universal_newlines would have, should I just "hide" references to universal_newlines in the doc? (i.e. only mention it under a versionchanged banner, rather than front-and-centre) |
Ah, excellent point about "encoding='utf-8'" already being a less cryptic replacement for universal_newlines=True, so consider my questions withdrawn :) As far as universal_newlines goes, that needs to remain documented for the sake of folks reading code that uses it, and folks that need compatibility with older versions, but +1 for recommending specifying an encoding over using that flag for new 3.6+ code. |
Steve Dower added the comment:
No. Don't hide universal_newlines, it's different than encoding. |
Well, some cases, and only really by accident. I won't hide the parameter, but it is promoted as "frequently used" and I'll make sure to document encoding before universal_newlines. Call it a very weak deprecation. The new patch also removes the openers, which I think were solving a problem we don't have right now. |
Addressed more feedback. |
6135_4.patch: Hum, what if you only set errors? I suggest to use TextIOWrapper but use the locale encoding: if (universal_newlines or errors) and not encoding:
encoding = getpreferredencoding(False) |
Sure, that's easy enough. Any other concerns once I've made that change? |
Steve Dower added the comment:
If errors enables Unicode and the doc is updated, the patch will LTGM :-) |
Why do you need to call getpreferredencoding()? Isn't that already the default if you call TextIOWrapper with encoding as None? For example: text_mode = encoding or errors or universal_newlines
self.stdin = io.open(p2cwrite, 'wb', bufsize)
if text_mode:
self.stdin = io.TextIOWrapper(self.stdin, write_through=True,
line_buffering=(bufsize == 1),
encoding=encoding, errors=errors) |
I proposed to do that, but I prefer your simple flag:
Steve: please use this simpler flag to avoid TextIOWrapper details in subprocess.py ;-) |
The TextIOWrapper details are already specified in multiple places in the documentation. Should we remove all of those and write it more like: "if *encoding*, *errors* or *universal_newlines* are specified, the streams will be opened in text mode using TextIOWrapper." That way we make the class part of the interface, but not the interpretation of the arguments. (Currently it's a big mess of encodings and newline transformations.) |
More doc updates - shouldn't be a difficult review this time, but I always like getting multiple opinions on doc changes. |
Maybe good to adjust the other mentions of universal_newlines, e.g. for check_output(). The Posix version of the multiple-pipe _communicate() method probably needs adjusting too. Test case: >>> check_output(("true",), encoding="ascii", input="") # Should be text
b''
>>> check_output(("true",), encoding="ascii") # Correct result
'' |
New changeset 720f0cf580e2 by Steve Dower in branch 'default': |
I also added more tests, specifically for ansi and oem encodings on Windows and the errors argument on all platforms. |
Still can't specify encoding for getstatusoutput and getoutput. Also it uses wrong encoding by default, most of time it will be console's codepage instead of ANSI codepage which is used now. |
And looks like only way to specify console's codepage is with |
That's true, but it only applies when the subprocess is using the same call (GetConsoleCP) to determine what encoding to use to write to something *other* than the console, which is incorrect. The defaults in this cause ought to be ACP, but it depends entirely on the target application and there is no good "99%" default. To see this demonstrated, check the difference between these two commands (with 3.5 or earlier): python -c "import sys; print(sys.stdin.encoding)" When stdin is redirected, the default encoding is not the console encoding (unless it happens to match your ACP). And ACP is not the correct encoding for a file anyway - it just happens to be the default for TextIOWrapper - so to do it properly you need to detect that it's not a console (sys.stdin.isatty()) and then use the raw stream rather than the default wrapper. (Or you specify that your program requires ACP files, or that PYTHONIOENCODING should be set, or that it requires any other encoding and you explicitly switch to that one.) In short, character encoding is hard, and the only way to get it right is for the developer to be aware of it. No good defaults exist. |
The commit for this bug (720f0cf580e2) introduces encoding and errors arguments but doesn't actually document what the values of these should be. In the case of the encoding it could be reasonably guessed, but the only way to determine what the value of 'errors' should be is to make the logical leap and look at the TextIOWrapper doc page. And in reply to message bpo-274510, there certainly should be a 'text=True' argument added. There are countless use cases where text rather than bytes is the expected behaviour, and currently this has to be triggered by using a more obscure option. In any case, it could be argued that text should be returned when shell=True since this mimics the behaviour of any shell program. (Sorry if adding a comment to a closed bug is poor etiquette; it seemed like the best place to put this comment) |
Do you mean it doesn't document that they take the normal encoding/errors arguments that all the other functions that do character encoding take? Or they don't specify which encoding you should use? The latter is not feasible, and if you mean the former then feel free to open a new issue for documentation. Having a single "text" parameter is essentially what we had before - a magic option that modifies your data without giving you any ability to do it correctly. This change brought Popen in line with all the other places where we translate OS-level bytes into Python str, and it's done it in exactly the same way. Also see msg274562 where Nick withdraws his own suggestion of a "text" parameter. |
I meant the former; I'll look a bit more at the documentation and submit an issue/patch. As regards the 'text' flag - universal_newlines is actually exactly that already. I've just checked the code of subprocess.py and the universal_newlines argument is read in only two places:
So subprocess *already* has a text mode and a 'magic option' to trigger it. It works well, and is useful in most cases. When the default encoding guess is incorrect, it can easily be corrected by supplying the correct encoding. This is a good situation! What is not so good is the API. I'm teaching a Python course for scientists at the moment. Retrieving text from external processes is an extremely common use case. I would rather not teach them to just have to use 'encoding=utf-8', because arguably teaching a user to supply an encoding without knowing if it's correct is worse than the system guessing. Equally, teaching 'universal_newlines=True' is a bit obscure. The way forward seems to be:
|
This discussion should probably be submitted as a new RFE (requesting "text" as a more obvious and beginner friendly alias for universal_newlines), but I'll also add a note regarding precedent for a simple binary/text toggle: the mode settings for open(). There, the default is text (with the default encoding determined by the system), and you have to include "b" in the mode settings to say "I want raw binary". For the subprocess APIs, the default is different (i.e. binary), so the natural counterpart to the "b" mode flag in open() would be an explicit "text" mode flag. |
RFE submitted as bpo-31756 , thanks |
subprocess.check_output()
doc to mention changes in 3.6 #5564Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: