Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xml.dom.minidom cannot parse ISO-2022-JP #60081

Closed
dcallagh mannequin opened this issue Sep 7, 2012 · 4 comments
Closed

xml.dom.minidom cannot parse ISO-2022-JP #60081

dcallagh mannequin opened this issue Sep 7, 2012 · 4 comments

Comments

@dcallagh
Copy link
Mannequin

dcallagh mannequin commented Sep 7, 2012

BPO 15877
Nosy @amauryfa, @iritkatriel

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2020-10-19.19:27:05.193>
created_at = <Date 2012-09-07.06:38:03.026>
labels = ['expert-XML']
title = 'xml.dom.minidom cannot parse ISO-2022-JP'
updated_at = <Date 2020-10-19.19:27:05.193>
user = 'https://bugs.python.org/dcallagh'

bugs.python.org fields:

activity = <Date 2020-10-19.19:27:05.193>
actor = 'iritkatriel'
assignee = 'none'
closed = True
closed_date = <Date 2020-10-19.19:27:05.193>
closer = 'iritkatriel'
components = ['XML']
creation = <Date 2012-09-07.06:38:03.026>
creator = 'dcallagh'
dependencies = []
files = []
hgrepos = []
issue_num = 15877
keywords = []
message_count = 4.0
messages = ['169974', '169982', '377715', '378996']
nosy_count = 3.0
nosy_names = ['amaury.forgeotdarc', 'dcallagh', 'iritkatriel']
pr_nums = []
priority = 'normal'
resolution = 'works for me'
stage = 'resolved'
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue15877'
versions = ['Python 2.7']

@dcallagh
Copy link
Mannequin Author

dcallagh mannequin commented Sep 7, 2012

Python 2.7.3 (default, Jul 24 2012, 10:05:38) 
[GCC 4.7.0 20120507 (Red Hat 4.7.0-5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> c = u'\u65e5\u672c\u8a9e'
>>> import xml.dom.minidom

Encoded as UTF-8, everything is fine:

>>> xml.dom.minidom.parseString('<?xml version="1.0" encoding="UTF-8" ?><x>%s</x>' % c.encode('UTF-8'))
<xml.dom.minidom.Document instance at 0x7f310d27dcf8>

but not ISO-2022-JP:

>>> xml.dom.minidom.parseString('<?xml version="1.0" encoding="ISO-2022-JP" ?><x>%s</x>' % c.encode('ISO-2022-JP'))
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/usr/lib64/python2.7/site-packages/_xmlplus/dom/minidom.py", line 1925, in parseString
    return expatbuilder.parseString(string)
  File "/usr/lib64/python2.7/site-packages/_xmlplus/dom/expatbuilder.py", line 942, in parseString
    return builder.parseString(string)
  File "/usr/lib64/python2.7/site-packages/_xmlplus/dom/expatbuilder.py", line 223, in parseString
    parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 48

lxml can handle it fine though:

>>> import lxml.etree
>>> lxml.etree.fromstring('<?xml version="1.0" encoding="ISO-2022-JP" ?><x>%s</x>' % c.encode('ISO-2022-JP'))
<Element x at 0x7f310d284960>
>>> _.text == c
True

@dcallagh dcallagh mannequin added the topic-XML label Sep 7, 2012
@amauryfa
Copy link
Member

amauryfa commented Sep 7, 2012

This is similar to bpo-13612: pyexpat does not support multibytes encodings.

@iritkatriel
Copy link
Member

I don't see this problem on 3.10. Is this still an issue or can this issue be closed?

Running Release|Win32 interpreter...
Python 3.10.0a0 (heads/bpo17490-dirty:00eb063b66, Sep 27 2020, 13:20:24) [MSC v.1916 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> c = u'\u65e5\u672c\u8a9e'
>>> import xml.dom.minidom
>>> xml.dom.minidom.parseString('<?xml version="1.0" encoding="UTF-8" ?><x>%s</x>' % c.encode('UTF-8'))
<xml.dom.minidom.Document object at 0x015FC9E8>
>>> xml.dom.minidom.parseString('<?xml version="1.0" encoding="ISO-2022-JP" ?><x>%s</x>' % c.encode('ISO-2022-JP'))
<xml.dom.minidom.Document object at 0x01493208>
>>>

@iritkatriel
Copy link
Member

Closing - this now works for me on Python 3.8 and 3.10. It was fixed sometime in the last 8 years.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants