Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sax parser crashes if given unicode file name #55368

Closed
ricli85 mannequin opened this issue Feb 9, 2011 · 10 comments
Closed

Sax parser crashes if given unicode file name #55368

ricli85 mannequin opened this issue Feb 9, 2011 · 10 comments
Assignees
Labels
topic-XML type-bug An unexpected behavior, bug, or error

Comments

@ricli85
Copy link
Mannequin

ricli85 mannequin commented Feb 9, 2011

BPO 11159
Nosy @tiran, @ezio-melotti, @serhiy-storchaka
Files
  • sax_unicode_fn-2.7.patch
  • sax_unicode_fn-3.x.patch
  • sax_unicode_fn_alt-2.7.patch: Use the file system encoding only for file opening
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2013-02-02.08:53:03.217>
    created_at = <Date 2011-02-09.14:20:01.747>
    labels = ['expert-XML', 'type-bug']
    title = 'Sax parser crashes if given unicode file name'
    updated_at = <Date 2013-02-02.10:19:59.167>
    user = 'https://bugs.python.org/ricli85'

    bugs.python.org fields:

    activity = <Date 2013-02-02.10:19:59.167>
    actor = 'python-dev'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2013-02-02.08:53:03.217>
    closer = 'serhiy.storchaka'
    components = ['XML']
    creation = <Date 2011-02-09.14:20:01.747>
    creator = 'ricli85'
    dependencies = []
    files = ['28268', '28714', '28722']
    hgrepos = []
    issue_num = 11159
    keywords = ['patch']
    message_count = 10.0
    messages = ['128212', '142666', '177211', '179866', '179919', '179926', '179932', '181145', '181146', '181157']
    nosy_count = 8.0
    nosy_names = ['christian.heimes', 'cgrohmann', 'ezio.melotti', 'John.Chandler', 'ricli85', 'python-dev', 'serhiy.storchaka', 'Sergey.Prokhorov']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue11159'
    versions = ['Python 2.7']

    @ricli85
    Copy link
    Mannequin Author

    ricli85 mannequin commented Feb 9, 2011

    The error is the following:

        Traceback (most recent call last):
          File "<stdin>", line 4, in <module>
          File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/__init__.py", line 31, in parse
            parser.parse(filename_or_stream)
          File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/expatreader.py", line 109, in parse
            xmlreader.IncrementalParser.parse(self, source)
          File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/xmlreader.py", line 119, in parse
            self.prepareParser(source)
          File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/expatreader.py", line 121, in prepareParser
            self._parser.SetBase(source.getSystemId())
        UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 0: ordinal not in range(128)

    The following bash script can be used to reproduce the error:

    #!/bin/sh
    
    cat > å.timeline <<EOF
    <?xml version="1.0" encoding="utf-8"?>
    <timeline>
      <version>0.13.0devb38ace0a572b+</version>
      <categories>
      </categories>
      <events>
        <event>
          <start>2011-02-01 00:00:00</start>
          <end>2011-02-03 08:46:00</end>
          <text>asdsd</text>
        </event>
      </events>
      <view>
        <displayed_period>
          <start>2011-01-24 16:38:11</start>
          <end>2011-02-23 16:38:11</end>
        </displayed_period>
        <hidden_categories>
        </hidden_categories>
      </view>
    </timeline>
    EOF
    
    python <<EOF
    # encoding: utf-8
    from xml.sax import parse
    from xml.sax.handler import ContentHandler
    parse(open(u"å.timeline", 'r'), ContentHandler())
    EOF
    

    If I instead do this, it works fine:

    parse(u"å.timeline".encode("utf-8"), ContentHandler())
    

    Also:

        >>> sys.getfilesystemencoding()
        'UTF-8'

    I heard from another user that this was not a problem with Python 3.1.2.

    @ricli85 ricli85 mannequin added type-crash A hard crash of the interpreter, possibly with a core dump topic-XML labels Feb 9, 2011
    @JohnChandler
    Copy link
    Mannequin

    JohnChandler mannequin commented Aug 22, 2011

    Confirmed about not being an issue in Python 3. Just checked with Python 3.3.0a0 and the example works fine - no exception raised.

    @durban durban mannequin added type-bug An unexpected behavior, bug, or error and removed type-crash A hard crash of the interpreter, possibly with a core dump labels Dec 8, 2012
    @serhiy-storchaka
    Copy link
    Member

    However Python doesn't work with bytes filenames (I don't think this is a bug).

    The proposed patch allows unicode filenames be used in SAX parser.

    @serhiy-storchaka serhiy-storchaka self-assigned this Dec 29, 2012
    @serhiy-storchaka
    Copy link
    Member

    Ported tests for nonascii System-Id on 3.x.

    If no one objects I'll commit this next week.

    @tiran
    Copy link
    Member

    tiran commented Jan 14, 2013

    I don't think that the file system encoding is the correct answer here. AFAIR expat uses UTF-8 encoded strings. Python 3.x uses PyArg_ParseTupleAndKeywords() with "s" which converts PyUnicode to PyBytes with the utf-8 codec.

    @serhiy-storchaka
    Copy link
    Member

    Yes, this thing was doubted me too. I proceeded from the following considerations.

    1. Often system id is used for file operations and in this case you need to use the file system encoding. Unfortunately Python 2 does not have 'surrogateescape' handler which would allow to encode arbitrary name and then restore and re-encode it for file operations.

    2. Python 2 in contrary to Python 3 accepts bytes and they may not be valid UTF-8.

    We have to choose between compatibility with Python 2 and Python 3. I chose the first, because it is more important for bugfix.

    May be I am wrong.

    @serhiy-storchaka
    Copy link
    Member

    Here is an alternative patch. It doesn't encode system id when it settled, instead system id attribute can be bytes or an unicode and encoding/decoding happened only a file opened.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 2, 2013

    New changeset d3e7aea8a550 by Serhiy Storchaka in branch '2.7':
    Issue bpo-11159: SAX parser now supports unicode file names.
    http://hg.python.org/cpython/rev/d3e7aea8a550

    New changeset d2622ca8493a by Serhiy Storchaka in branch '3.2':
    Issue bpo-11159: Add tests for testing SAX parser support of non-ascii file names.
    http://hg.python.org/cpython/rev/d2622ca8493a

    New changeset b85ba45b9579 by Serhiy Storchaka in branch '3.3':
    Issue bpo-11159: Add tests for testing SAX parser support of non-ascii file names.
    http://hg.python.org/cpython/rev/b85ba45b9579

    New changeset 107a06f1a542 by Serhiy Storchaka in branch 'default':
    Issue bpo-11159: Add tests for testing SAX parser support of non-ascii file names.
    http://hg.python.org/cpython/rev/107a06f1a542

    @serhiy-storchaka
    Copy link
    Member

    Fixed. Thank you for the report.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 2, 2013

    New changeset 706218e0facb by Serhiy Storchaka in branch '2.7':
    Fix tests for issue bpo-11159.
    http://hg.python.org/cpython/rev/706218e0facb

    New changeset a7c074d9cbfb by Serhiy Storchaka in branch '3.2':
    Fix tests for issue bpo-11159.
    http://hg.python.org/cpython/rev/a7c074d9cbfb

    New changeset 2bf01f03ff40 by Serhiy Storchaka in branch '3.3':
    Fix tests for issue bpo-11159.
    http://hg.python.org/cpython/rev/2bf01f03ff40

    New changeset 4ab386b00aaf by Serhiy Storchaka in branch 'default':
    Fix tests for issue bpo-11159.
    http://hg.python.org/cpython/rev/4ab386b00aaf

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-XML type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants