New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sax parser crashes if given unicode file name #55368
Comments
The error is the following: Traceback (most recent call last):
File "<stdin>", line 4, in <module>
File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/__init__.py", line 31, in parse
parser.parse(filename_or_stream)
File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/expatreader.py", line 109, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/xmlreader.py", line 119, in parse
self.prepareParser(source)
File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/expatreader.py", line 121, in prepareParser
self._parser.SetBase(source.getSystemId())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 0: ordinal not in range(128) The following bash script can be used to reproduce the error:
If I instead do this, it works fine:
Also: >>> sys.getfilesystemencoding()
'UTF-8' I heard from another user that this was not a problem with Python 3.1.2. |
Confirmed about not being an issue in Python 3. Just checked with Python 3.3.0a0 and the example works fine - no exception raised. |
However Python doesn't work with bytes filenames (I don't think this is a bug). The proposed patch allows unicode filenames be used in SAX parser. |
Ported tests for nonascii System-Id on 3.x. If no one objects I'll commit this next week. |
I don't think that the file system encoding is the correct answer here. AFAIR expat uses UTF-8 encoded strings. Python 3.x uses PyArg_ParseTupleAndKeywords() with "s" which converts PyUnicode to PyBytes with the utf-8 codec. |
Yes, this thing was doubted me too. I proceeded from the following considerations.
We have to choose between compatibility with Python 2 and Python 3. I chose the first, because it is more important for bugfix. May be I am wrong. |
Here is an alternative patch. It doesn't encode system id when it settled, instead system id attribute can be bytes or an unicode and encoding/decoding happened only a file opened. |
New changeset d3e7aea8a550 by Serhiy Storchaka in branch '2.7': New changeset d2622ca8493a by Serhiy Storchaka in branch '3.2': New changeset b85ba45b9579 by Serhiy Storchaka in branch '3.3': New changeset 107a06f1a542 by Serhiy Storchaka in branch 'default': |
Fixed. Thank you for the report. |
New changeset 706218e0facb by Serhiy Storchaka in branch '2.7': New changeset a7c074d9cbfb by Serhiy Storchaka in branch '3.2': New changeset 2bf01f03ff40 by Serhiy Storchaka in branch '3.3': New changeset 4ab386b00aaf by Serhiy Storchaka in branch 'default': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: