Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding error with sax and codecs #62115

Closed
sconseil mannequin opened this issue May 6, 2013 · 12 comments
Closed

Encoding error with sax and codecs #62115

sconseil mannequin opened this issue May 6, 2013 · 12 comments
Labels
release-blocker stdlib Python modules in the Lib dir topic-XML type-bug An unexpected behavior, bug, or error

Comments

@sconseil
Copy link
Mannequin

sconseil mannequin commented May 6, 2013

BPO 17915
Nosy @birkenfeld, @pitrou, @vstinner, @larryhastings, @serhiy-storchaka
Files
  • report.txt: Minimal example to reproduce the issue
  • test_codecs.py
  • XMLGenerator_codecs_stream.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2013-05-12.10:45:59.882>
    created_at = <Date 2013-05-06.11:14:06.500>
    labels = ['expert-XML', 'type-bug', 'library', 'release-blocker']
    title = 'Encoding error with sax and codecs'
    updated_at = <Date 2013-05-12.21:19:48.930>
    user = 'https://bugs.python.org/sconseil'

    bugs.python.org fields:

    activity = <Date 2013-05-12.21:19:48.930>
    actor = 'sconseil'
    assignee = 'none'
    closed = True
    closed_date = <Date 2013-05-12.10:45:59.882>
    closer = 'georg.brandl'
    components = ['Library (Lib)', 'XML']
    creation = <Date 2013-05-06.11:14:06.500>
    creator = 'sconseil'
    dependencies = []
    files = ['30146', '30158', '30164']
    hgrepos = []
    issue_num = 17915
    keywords = ['patch']
    message_count = 12.0
    messages = ['188508', '188587', '188599', '188600', '188640', '188642', '188650', '188654', '188657', '189003', '189009', '189063']
    nosy_count = 7.0
    nosy_names = ['georg.brandl', 'pitrou', 'vstinner', 'larry', 'python-dev', 'serhiy.storchaka', 'sconseil']
    pr_nums = []
    priority = 'release blocker'
    resolution = 'fixed'
    stage = 'patch review'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue17915'
    versions = ['Python 3.2', 'Python 3.3', 'Python 3.4']

    @sconseil
    Copy link
    Mannequin Author

    sconseil mannequin commented May 6, 2013

    There is an encoding issue between codecs.open and sax (see attached file). The issue is reproducible on Python 3.3.1, it is working fine on Python 3.3.0

    @sconseil sconseil mannequin added the stdlib Python modules in the Lib dir label May 6, 2013
    @pitrou pitrou added the type-bug An unexpected behavior, bug, or error label May 6, 2013
    @pitrou
    Copy link
    Member

    pitrou commented May 6, 2013

    Since this is a regression, setting (temporarily perhaps) as release blocker.

    @vstinner
    Copy link
    Member

    vstinner commented May 6, 2013

    It looks like a regression of introduced by the fix of the issue bpo-1470548, changeset 66f92f76b2ce.

    @vstinner
    Copy link
    Member

    vstinner commented May 6, 2013

    Extracted test from report.txt. Test with Python 3.4:

    $ ./python test_codecs.py 
    Traceback (most recent call last):
      File "test_codecs.py", line 7, in <module>
        xml.startDocument()
      File "/home/haypo/prog/python/default/Lib/xml/sax/saxutils.py", line 148, in startDocument
        self._encoding)
      File "/home/haypo/prog/python/default/Lib/codecs.py", line 699, in write
        return self.writer.write(data)
      File "/home/haypo/prog/python/default/Lib/codecs.py", line 355, in write
        data, consumed = self.encode(object, self.errors)
    TypeError: Can't convert 'bytes' object to str implicitly

    _gettextwriter() of xml.sax.saxutils does not recognize codecs classes. (See also the PEP-400 :-)).

    @serhiy-storchaka
    Copy link
    Member

    It is not working fine on Python 3.3.0.

    >>> with codecs.open('/tmp/test.txt', 'w', encoding='iso-8859-1') as f:
    ...     xml = XMLGenerator(f, encoding='iso-8859-1')
    ...     xml.startDocument()
    ...     xml.startElement('root', {'attr': u'\u20ac'})
    ...     xml.endElement('root')
    ...     xml.endDocument()
    ... 
    Traceback (most recent call last):
      File "<stdin>", line 4, in <module>
      File "/home/serhiy/py/cpython-3.3.0/Lib/xml/sax/saxutils.py", line 141, in startElement
        self._write(' %s=%s' % (name, quoteattr(value)))
      File "/home/serhiy/py/cpython-3.3.0/Lib/xml/sax/saxutils.py", line 96, in _write
        self._out.write(text)
      File "/home/serhiy/py/cpython-3.3.0/Lib/codecs.py", line 699, in write
        return self.writer.write(data)
      File "/home/serhiy/py/cpython-3.3.0/Lib/codecs.py", line 355, in write
        data, consumed = self.encode(object, self.errors)
    UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 7: ordinal not in range(256)

    And shouldn't. On Python 2 XMLGenerator works only with binary files and "works" with text files only due implicit str->unicode converting. On Python 3 working with binary files was broken. bpo-1470548 restores working with binary file (for which only XMLGenerator can work correctly), but for backward compatibility accepting of text files was left. The problem is that there no trustworthy method to determine whenever a file-like object is binary or text.

    Accepting of text streams in XMLGenerator should be deprecated in future versions.

    @vstinner
    Copy link
    Member

    vstinner commented May 7, 2013

    Accepting of text streams in XMLGenerator should be deprecated in future versions.

    I agree that the following pattern is strange:

    with codecs.open('/tmp/test.txt', 'w', encoding='iso-8859-1') as f:
       xml = XMLGenerator(f, encoding='iso-8859-1')

    Why would I specify a codec twice? What happens if I specify two
    different codecs?

    with codecs.open('/tmp/test.txt', 'w', encoding='utf-8') as f:
       xml = XMLGenerator(f, encoding='iso-8859-1')

    It may be simpler (and safer?) to reject text files. If you cannot
    detect that f is a text file, just make it explicit in the
    documentation that f must be a binary file.

    2013/5/7 Serhiy Storchaka <report@bugs.python.org>:
    >
    > Serhiy Storchaka added the comment:
    >
    > It is not working fine on Python 3.3.0.
    >
    >>>> with codecs.open('/tmp/test.txt', 'w', encoding='iso-8859-1') as f:
    > ...     xml = XMLGenerator(f, encoding='iso-8859-1')
    > ...     xml.startDocument()
    > ...     xml.startElement('root', {'attr': u'\u20ac'})
    > ...     xml.endElement('root')
    > ...     xml.endDocument()
    > ...
    > Traceback (most recent call last):
    >   File "<stdin>", line 4, in <module>
    >   File "/home/serhiy/py/cpython-3.3.0/Lib/xml/sax/saxutils.py", line 141, in startElement
    >     self._write(' %s=%s' % (name, quoteattr(value)))
    >   File "/home/serhiy/py/cpython-3.3.0/Lib/xml/sax/saxutils.py", line 96, in _write
    >     self._out.write(text)
    >   File "/home/serhiy/py/cpython-3.3.0/Lib/codecs.py", line 699, in write
    >     return self.writer.write(data)
    >   File "/home/serhiy/py/cpython-3.3.0/Lib/codecs.py", line 355, in write
    >     data, consumed = self.encode(object, self.errors)
    > UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 7: ordinal not in range(256)
    >
    > And shouldn't. On Python 2 XMLGenerator works only with binary files and "works" with text files only due implicit str->unicode converting. On Python 3 working with binary files was broken. Issue1470548 restores working with binary file (for which only XMLGenerator can work correctly), but for backward compatibility accepting of text files was left. The problem is that there no trustworthy method to determine whenever a file-like object is binary or text.
    >
    > Accepting of text streams in XMLGenerator should be deprecated in future versions.
    >
    > 


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue17915\>


    @serhiy-storchaka
    Copy link
    Member

    Here is a patch which adds explicit checks for codecs stream writers and adds tests for these cases. The tests are not entirely honest, they test only that XMLGenerator works with some specially prepared streams. XMLGenerator doesn't work with a stream with arbitrary encoding and errors handler.

    @serhiy-storchaka
    Copy link
    Member

    Of course, if this patch will be committed, perhaps it will be worth to apply it also for 3.2 which has the same regression.

    @serhiy-storchaka
    Copy link
    Member

    Perhaps we should add a deprecation warning for codecs streams right in this patch?

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 12, 2013

    New changeset 1c01571ce0f4 by Georg Brandl in branch '3.2':
    Issue bpo-17915: Fix interoperability of xml.sax with file objects returned by
    http://hg.python.org/cpython/rev/1c01571ce0f4

    @birkenfeld
    Copy link
    Member

    Fixed in 3.2, 3.3 and default.

    @sconseil
    Copy link
    Mannequin Author

    sconseil mannequin commented May 12, 2013

    thanks everybody !

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    release-blocker stdlib Python modules in the Lib dir topic-XML type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants