New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding error with sax and codecs #62115
Comments
There is an encoding issue between codecs.open and sax (see attached file). The issue is reproducible on Python 3.3.1, it is working fine on Python 3.3.0 |
Since this is a regression, setting (temporarily perhaps) as release blocker. |
It looks like a regression of introduced by the fix of the issue bpo-1470548, changeset 66f92f76b2ce. |
Extracted test from report.txt. Test with Python 3.4: $ ./python test_codecs.py
Traceback (most recent call last):
File "test_codecs.py", line 7, in <module>
xml.startDocument()
File "/home/haypo/prog/python/default/Lib/xml/sax/saxutils.py", line 148, in startDocument
self._encoding)
File "/home/haypo/prog/python/default/Lib/codecs.py", line 699, in write
return self.writer.write(data)
File "/home/haypo/prog/python/default/Lib/codecs.py", line 355, in write
data, consumed = self.encode(object, self.errors)
TypeError: Can't convert 'bytes' object to str implicitly _gettextwriter() of xml.sax.saxutils does not recognize codecs classes. (See also the PEP-400 :-)). |
It is not working fine on Python 3.3.0. >>> with codecs.open('/tmp/test.txt', 'w', encoding='iso-8859-1') as f:
... xml = XMLGenerator(f, encoding='iso-8859-1')
... xml.startDocument()
... xml.startElement('root', {'attr': u'\u20ac'})
... xml.endElement('root')
... xml.endDocument()
...
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
File "/home/serhiy/py/cpython-3.3.0/Lib/xml/sax/saxutils.py", line 141, in startElement
self._write(' %s=%s' % (name, quoteattr(value)))
File "/home/serhiy/py/cpython-3.3.0/Lib/xml/sax/saxutils.py", line 96, in _write
self._out.write(text)
File "/home/serhiy/py/cpython-3.3.0/Lib/codecs.py", line 699, in write
return self.writer.write(data)
File "/home/serhiy/py/cpython-3.3.0/Lib/codecs.py", line 355, in write
data, consumed = self.encode(object, self.errors)
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 7: ordinal not in range(256) And shouldn't. On Python 2 XMLGenerator works only with binary files and "works" with text files only due implicit str->unicode converting. On Python 3 working with binary files was broken. bpo-1470548 restores working with binary file (for which only XMLGenerator can work correctly), but for backward compatibility accepting of text files was left. The problem is that there no trustworthy method to determine whenever a file-like object is binary or text. Accepting of text streams in XMLGenerator should be deprecated in future versions. |
I agree that the following pattern is strange: with codecs.open('/tmp/test.txt', 'w', encoding='iso-8859-1') as f:
xml = XMLGenerator(f, encoding='iso-8859-1') Why would I specify a codec twice? What happens if I specify two with codecs.open('/tmp/test.txt', 'w', encoding='utf-8') as f:
xml = XMLGenerator(f, encoding='iso-8859-1') It may be simpler (and safer?) to reject text files. If you cannot 2013/5/7 Serhiy Storchaka <report@bugs.python.org>:
>
> Serhiy Storchaka added the comment:
>
> It is not working fine on Python 3.3.0.
>
>>>> with codecs.open('/tmp/test.txt', 'w', encoding='iso-8859-1') as f:
> ... xml = XMLGenerator(f, encoding='iso-8859-1')
> ... xml.startDocument()
> ... xml.startElement('root', {'attr': u'\u20ac'})
> ... xml.endElement('root')
> ... xml.endDocument()
> ...
> Traceback (most recent call last):
> File "<stdin>", line 4, in <module>
> File "/home/serhiy/py/cpython-3.3.0/Lib/xml/sax/saxutils.py", line 141, in startElement
> self._write(' %s=%s' % (name, quoteattr(value)))
> File "/home/serhiy/py/cpython-3.3.0/Lib/xml/sax/saxutils.py", line 96, in _write
> self._out.write(text)
> File "/home/serhiy/py/cpython-3.3.0/Lib/codecs.py", line 699, in write
> return self.writer.write(data)
> File "/home/serhiy/py/cpython-3.3.0/Lib/codecs.py", line 355, in write
> data, consumed = self.encode(object, self.errors)
> UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 7: ordinal not in range(256)
>
> And shouldn't. On Python 2 XMLGenerator works only with binary files and "works" with text files only due implicit str->unicode converting. On Python 3 working with binary files was broken. Issue1470548 restores working with binary file (for which only XMLGenerator can work correctly), but for backward compatibility accepting of text files was left. The problem is that there no trustworthy method to determine whenever a file-like object is binary or text.
>
> Accepting of text streams in XMLGenerator should be deprecated in future versions.
>
>
|
Here is a patch which adds explicit checks for codecs stream writers and adds tests for these cases. The tests are not entirely honest, they test only that XMLGenerator works with some specially prepared streams. XMLGenerator doesn't work with a stream with arbitrary encoding and errors handler. |
Of course, if this patch will be committed, perhaps it will be worth to apply it also for 3.2 which has the same regression. |
Perhaps we should add a deprecation warning for codecs streams right in this patch? |
New changeset 1c01571ce0f4 by Georg Brandl in branch '3.2': |
Fixed in 3.2, 3.3 and default. |
thanks everybody ! |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: