-
-
Notifications
You must be signed in to change notification settings - Fork 31.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add encoding & errors parameters to TextIOWrapper.reconfigure() #59421
Comments
As discussed on python-ideas, swapping out the sys.std* streams can be fraught with peril. This can make writing code that wants to support an "--encoding" option for pipeline processing difficult. The proposal [1] is that TextIOWrapper support a set_encoding() method that is only supported between creation of the stream and the first read or write operation. Once the stream is "dirty", you can no longer change the encoding - you must replace the stream (e.g. by passing stream.fileNo() to open()) [1] http://mail.python.org/pipermail/python-ideas/2012-June/015535.html |
IMHO "encoding", "errors", "buffering", "line_buffering" and "newline" should be writable properties (unless first read/write). |
That will be fragile. A bit of prematurate input or output (for whatever reason) and your program breaks. |
Indeed. However, the current alternatives (based on detach() and fileNo()) are also problematic - using detach() breaks the corresponding sys.__std*__ entry, while using fileNo() means you now have two independent IO stacks using the same underlying descriptor. My own preference is to have a set_encoding() method that *can* be used even if IO has already occurred. Yes, there's the chance of mixing encodings if data is processed before the encoding is changed, but that's also true for both of the alternatives. Pipeline applications written in Python should have an easy way to implement "--encoding" parameters for the standard IO streams, and a method to update the settings is the obvious choice. |
Would a "set_encoding" method be Pythonic? I would've preferred an "encoding" property which flushes the output when it's changed. |
I would prefer to have a method. The side-effect is too violent to be |
I fail to see why this is a release blocker; no rationale is given in the original message, nor in the quoted message. So unblocking. |
Here is a Python implementation of TextIOWrapper.set_encoding(). The main limitation is that it is not possible to set the encoding on a non-seekable stream after the first read (if the read buffer is not empty, ie. if there are pending decoded characters). + # flush read buffer, may require to seek backward in the underlying -- I don't have an use case for setting the encoding of sys.stdout/stderr after startup, but I would like to be able to control the *error handler* after the startup! My implementation permits to change both (encoding, errors, encoding and errors). For example, Lib/test/regrtest.py uses the following function to force the backslashreplace error handler on sys.stdout. It changes the error handler to avoid UnicodeEncodeError when displaying the result of tests. def replace_stdout():
"""Set stdout encoder error handler to backslashreplace (as stderr error
handler) to avoid UnicodeEncodeError when printing a traceback"""
import atexit
stdout = sys.stdout
sys.stdout = open(stdout.fileno(), 'w',
encoding=stdout.encoding,
errors="backslashreplace",
closefd=False,
newline='\n')
def restore_stdout():
sys.stdout.close()
sys.stdout = stdout
atexit.register(restore_stdout) The doctest module uses another trick to change the error handler: save_stdout = sys.stdout
if out is None:
encoding = save_stdout.encoding
if encoding is None or encoding.lower() == 'utf-8':
out = save_stdout.write
else:
# Use backslashreplace error handling on write
def out(s):
s = str(s.encode(encoding, 'backslashreplace'), encoding)
save_stdout.write(s)
sys.stdout = self._fakeout |
I agree that it is not the most pure solution, but sometimes practicality beats purity (rule #9) ;-) We can add an ugly big red warning in the doc ;-) |
We may also add a set_errors() method: def set_errors(self, errors):
self.set_encoding(self.encoding, errors) |
The reason I marked this as a release blocker for 3.4 is because it's a key piece of functionality for writing command line apps which accept an encoding argument. I'll use "high" instead. An interesting proposal was posted to the python-dev thread [1]: using self.detach() and self.__init__() to reinitialise the wrapper *in-place*. With that approach, the pure Python version of set_encoding() would look something like this: _sentinel = object()
def set_encoding(self, encoding=_sentinel, errors=_sentinel):
if encoding is _sentinel:
encoding = self.encoding
if errors is _sentinel:
errors = self.errors
self.__init__(self.detach(),
encoding, errors,
self._line_buffering,
self._readnl,
self._write_through) (The pure Python version currently has no self._write_through attribute - see bpo-15571) Note that this approach addresses my main concern with the use of detach() for this: because the wrapper is reinitialised in place, old references (such as the sys.__std*__ attributes) will also see the change. Yes, such a function would need a nice clear warning to say "Calling this may cause data loss or corruption if used without due care and attention", but it should *work*. (Without automatic garbage collection, the C implementation would need an explicit internal "reinitialise" function rather than being able to just use the existing init function directly, but that shouldn't be a major problem). [1] http://mail.python.org/pipermail/python-ideas/2012-August/015898.html |
To bring back Victor's comments from the list:
I think the guiding use case here really needs to be this one: "How do I implement the equivalent of 'iconv' as a Python 3 script, without breaking internal interpreter state invariants?" My current thought is that, instead of seeking, the input case can better be handled by manipulating the read ahead buffer directly. Something like (for the pure Python version): self._encoding = new_encoding
if self._decoder is not None:
old_data = self._get_decoded_chars().encode(old_encoding)
old_data += self._decoder.getstate()[0]
decoder = self._get_decoder()
new_chars = ''
if old_data:
new_chars = decoder.decode(old_data)
self._set_decoded_chars(new_chars) (A similar mechanism could actually be used to support an "initial_data" parameter to TextIOWrapper, which would help in general encoding detection situations where changing encoding *in-place* isn't needed, but the application would like an easy way to "put back" the initial data for inclusion in the text stream without making assumptions about the underlying buffer implementation) Also, StringIO should implement this new API as a no-op. |
It is no what I said. My patch raises an exception if you already |
Ah, you're right - peeking into the underlying buffer would be enough to |
Oh, set_encoding.patch is wrong: + offset = self._decoded_chars_used - len(next_input) self._decoded_chars_used is a number of Unicode characters, len(next_input) is a number of bytes. It only works with 7 and 8 bit encodings like ascii or latin1, but not with multibyte encodings like utf8 or ucs-4.
I wrote a new patch using this idea. It does not work (yet?) with non seekable streams. The raw read buffer (bytes string) is not stored in the _snapshot attribute if the stream is not seeakble. It may be changed to solve this issue. set_encoding-2.patch is still a work-in-progress. It does not patch the _io module for example. |
Note: it is not possible to reencode the buffer of decoded characters to compute the offset in bytes. Some codecs are not bijective. Examples:
|
That's a fair point - I think it's acceptable to throw an error in the case of *already decoded* characters that haven't been read. There's also a discussion on python-ideas about an explicit API for clearing the internal buffers, and pushing data back into a stream. If that is added, then set_encoding() would be free to error out if there was any already buffered data - it would be up to the application to call clear_buffer() before calling set_encoding(), and deal with an such data appropriately (such as calling push_data() with the results of the clear_buffer() call) |
Oops, meant to link to my post in the thread about a buffer manipulation API: http://mail.python.org/pipermail/python-ideas/2013-March/019769.html |
It would be good to have this addressed in Python 3.4 (I'm following up on a few other encoding related issues at the moment). Is that a reasonable possibility before beta 1, or do we need to bump this one to 3.5? |
My patch was not reviewed yet and must be reimplemented in C. I will not have time before the beta1 to finish the work. |
Just reviewed Victor's patch - aside from a trivial typo and not covering the C implementation or documentation yet, it looks good to me. Most importantly, the tests in that patch could be used to validate a C implementation as well. I'll see if I can find anyone interested in creating a patch for the C io implementation, otherwise we can postpone the addition until 3.5. |
I am interested in working on this, but I might need guidance at times. Is that acceptable? If yes, I'm willing to start as soon as possible. |
Thanks, Andrei, that would be great. The tests and the Python version in Victor's patch show the desired API and behaviour. In theory, the C version should just be a matter of adding an equivalent textiowrapper_set_encoding method as a static function in |
I have to postpone this for a few weeks. Sorry if I stayed in anyone's way. |
I am working on this now. |
Question: What is the point of the old_encoding = codecs.lookup(self._encoding).name
encoding = codecs.lookup(encoding).name
if encoding == old_encoding and errors == self._errors:
# no change
return dance? Isn't this equivalent to if encoding == self.encoding and errors == self._errors:
# no change
return except that it doesn't validate the given encoding? But if the encoding is invalid, isn't it enough that the exception is raised a few lines further down? |
Second question: The following looks as if a BOM might be written for writeable, non-seekable streams: # don't write a BOM in the middle of a file
if self._seekable and self.writable():
position = self.buffer.tell()
if position != 0:
try:
self._get_encoder().setstate(0)
except LookupError:
# Sometimes the encoder doesn't exist
pass Is that really desirable? It seems to me the safe choice is to *not* write the BOM, except when we know it's safe. Eg:
|
It's a blunt instrument to be sure, but a necessary one, I think - Here's a thought, though: what if we went with Serhiy's "reconfigure" The main problem I see with that idea is that some changes (binary vs text, However the "seekable streams only by default, flag or separate method to I'm beginning to think that this is all complicated enough that it should |
I think that any API that gives you a TextIOWrapper object with undesired properties is broken and needs to be fixed, rather than a workaround added to TextIOWrapper. That said, I defer to the wisdow of the core developers, and I'll be happy to implement whatever is deemed to be the right thing(tm). However, my opinion on this feature probably makes me unsuitable to drive a PEP for it :-). |
Newest version of the patch attached, the reference leak problem has been fixed. |
If operating systems always exposed accurate metadata and configuration settings, I'd agree with you. They don't though, so sometimes developers need to be able to override whatever the interpreter figured out automatically. In addition, needing to cope with badly designed APIs is an unfortunate fact of life - that's why monkeypatching is only *discouraged*, rather than disallowed :) |
On 02/04/2014 03:28 AM, Nick Coghlan wrote:
Where does the operating system come into play? What I'm trying to say fh = some_python_function()
fh.set_encoding(foo) should really be written as fh = some_python_function(encoding=foo) or fh = TextIOWrapper(some_python_function(raw=True), encoding=foo) The only exception to this is sys.std{in,out,err}. But this special case
Well, yes, but introducing a set_encoding() or reconfigure() function is Best, --
|
Metaclasses and monkeypatching encourage bad API design, too, but we still For the standard streams, the problem is that the stream is created
|
Some more use cases for temporarily switching error handler in the middle of writing to a stream:
A use case for changing a reader’s newline translation mode is to use standard input with the built-in “csv” module. My current idea is to do something like this: encoding = sys.stdin.encoding
errors = sys.stdin.errors
line_buffering = sys.stdin.line_buffering
# No way to retain write_through mode, but shouldn’t matter for reading
sys.stdin = TextIOWrapper(sys.stdin.detach(), encoding, errors,
newline="", line_buffering=line_buffering)
for row in csv.reader(sys.stdin):
... On the other hand, I wonder about rewinding an input file after already having read and buffered text in the wrong encoding. From a distance, the Python native version of the code seems rather complex and full of dark magic. Is there a use case, or maybe it would be simpler to have it only work when nothing has been read or buffered? |
Reviewing the items I had flagged as dependencies of bpo-22555 for personal tracking purposes, I suggest we defer further consideration of this idea to 3.6 after folks have had a chance to get some experience with the interpreter defaulting to using "surrogateescape" on the standard streams when an underlying *nix operating system (other than Mac OS X) claims that the system encoding is ASCII. |
Revisiting the idea Nikolaus raised last year regarding whether or not this could be done using a dedicated API in the sys module, I realised today that even if we decided to use a separate public API, *that API* would still need a defined way to modify the stream encoding and error handling in place. Since we're going to need the object level in-place modification API regardless, a separate sys module API for modifying the standard streams in particular would then introduce additional complexity without providing a proportional benefit. |
quad, please don't change issue attributes to unrelated values. |
Updated the patch for default branch. |
Reviewing Inada-san's latest version of the patch, we seem to be in a somewhat hybrid state where:
I'm starting to wonder if the best option here might be to attempt to make the API work for arbitrary codecs and non-seekable streams, and then simply accept that it may take a few maintenance releases before that's actually true. If we decide to go down that path, then I'd suggest the follow stress test:
Optionally:
That way, confidence in the reliability of the feature (including across Python implementations) can be based on the strength of the test cases covering it. |
I had not read the patch carefully when updating. For writing side, this patch is very simple. How about starting this API which works only when We already failed to ship this important API in 3.6. |
This patch doesn't change current behavior of "decode by chunk". >>> import io
>>> buff = io.BytesIO("hello\nこんにちは".encode('utf-8'))
>>> f = io.TextIOWrapper(buff, encoding='ascii')
>>> f.readline()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/inada-n/work/python/cpython/Lib/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 6: ordinal not in range(128)
>>>
>>> buff = io.BytesIO("hello\nこんにちは".encode('utf-8'))
>>> f = io.TextIOWrapper(buff, encoding='ascii')
>>> f.read(6)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/inada-n/work/python/cpython/Lib/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 6: ordinal not in range(128) |
Inada, I think you messed up the positioning of bits of the patch. E.g. there are now test methods declared inside a helper function (rather than a test class). Since it seems other people are in favour of this API, I would like to expand it a bit to cover two uses cases (see set_encoding-newline.patch):
Regarding Serhiy’s other suggestion about buffering parameters, perhaps TextIOWrapper.line_buffering could become a writable attribute instead, and the class could grow a similar write_through attribute. I don’t think these affect encoding or decoding, so they could be treated independently. The algorithm for rewinding unread data is complicated and can fail. What is the advantage of using it? What is the use case for reading from a stream and then changing the encoding, without a guarantee that it will work? Even if it is enhanced to never “fail”, it will still have strange behaviour, such as data loss when a decoder is fed a single byte and produces multiple characters (e.g. CR newline, backslashreplace, UTF-7). One step in the right direction IMO would be to only support calling set_encoding() when no extra read data has been buffered (or to explicitly say that any buffered data is silently dropped). So there is no support for changing the encoding halfway through a disk file, but it may be appropriate if you can regulate the bytes being read, e.g. from a terminal (user input), pipe, socket, etc. But I would be happy enough without set_encoding(), and with something like my rewrap() function at the bottom of <https://github.com/vadmium/data/blob/master/data.py#L526\>. It returns a fresh TextIOWrapper, but when you exit the context manager you can continue to reuse the old stream with the old settings. |
I'm sorry.
+1. Since stdio is configured before running Python program, TextIOWrapper should be configurable after creation, as possible.
Could them go another new issue?
When I posted the set_encoding-7.patch, I hadn't read io module deeply. I just solved conflict and ran test.
Totally agree.
I want one obvious way to control encoding and error handler from Python, (not from environment variable). |
set_encoding-8.patch dropped support of changing encoding after read. |
Antoine posted a simpler reconfiguration RFE over in https://bugs.python.org/issue30526 (for setting line buffering). While technically orthogonal to this RFE, we're thinking it might be possible to expose them to users as a common "reconfigure()" API, rather than as independent methods. |
|
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: