xml.dom.minidom cannot parse ISO-2022-JP #60081

dcallagh · 2012-09-07T06:38:03Z

BPO	15877
Nosy	@amauryfa, @iritkatriel

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2020-10-19.19:27:05.193>
created_at = <Date 2012-09-07.06:38:03.026>
labels = ['expert-XML']
title = 'xml.dom.minidom cannot parse ISO-2022-JP'
updated_at = <Date 2020-10-19.19:27:05.193>
user = 'https://bugs.python.org/dcallagh'

bugs.python.org fields:

activity = <Date 2020-10-19.19:27:05.193>
actor = 'iritkatriel'
assignee = 'none'
closed = True
closed_date = <Date 2020-10-19.19:27:05.193>
closer = 'iritkatriel'
components = ['XML']
creation = <Date 2012-09-07.06:38:03.026>
creator = 'dcallagh'
dependencies = []
files = []
hgrepos = []
issue_num = 15877
keywords = []
message_count = 4.0
messages = ['169974', '169982', '377715', '378996']
nosy_count = 3.0
nosy_names = ['amaury.forgeotdarc', 'dcallagh', 'iritkatriel']
pr_nums = []
priority = 'normal'
resolution = 'works for me'
stage = 'resolved'
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue15877'
versions = ['Python 2.7']

dcallagh · 2012-09-07T06:38:02Z

Python 2.7.3 (default, Jul 24 2012, 10:05:38) 
[GCC 4.7.0 20120507 (Red Hat 4.7.0-5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> c = u'\u65e5\u672c\u8a9e'
>>> import xml.dom.minidom

Encoded as UTF-8, everything is fine:

>>> xml.dom.minidom.parseString('<?xml version="1.0" encoding="UTF-8" ?><x>%s</x>' % c.encode('UTF-8'))
<xml.dom.minidom.Document instance at 0x7f310d27dcf8>

but not ISO-2022-JP:

>>> xml.dom.minidom.parseString('<?xml version="1.0" encoding="ISO-2022-JP" ?><x>%s</x>' % c.encode('ISO-2022-JP'))
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/usr/lib64/python2.7/site-packages/_xmlplus/dom/minidom.py", line 1925, in parseString
    return expatbuilder.parseString(string)
  File "/usr/lib64/python2.7/site-packages/_xmlplus/dom/expatbuilder.py", line 942, in parseString
    return builder.parseString(string)
  File "/usr/lib64/python2.7/site-packages/_xmlplus/dom/expatbuilder.py", line 223, in parseString
    parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 48

lxml can handle it fine though:

>>> import lxml.etree
>>> lxml.etree.fromstring('<?xml version="1.0" encoding="ISO-2022-JP" ?><x>%s</x>' % c.encode('ISO-2022-JP'))
<Element x at 0x7f310d284960>
>>> _.text == c
True

amauryfa · 2012-09-07T09:22:33Z

This is similar to bpo-13612: pyexpat does not support multibytes encodings.

iritkatriel · 2020-09-30T18:06:32Z

I don't see this problem on 3.10. Is this still an issue or can this issue be closed?

Running Release|Win32 interpreter...
Python 3.10.0a0 (heads/bpo17490-dirty:00eb063b66, Sep 27 2020, 13:20:24) [MSC v.1916 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> c = u'\u65e5\u672c\u8a9e'
>>> import xml.dom.minidom
>>> xml.dom.minidom.parseString('<?xml version="1.0" encoding="UTF-8" ?><x>%s</x>' % c.encode('UTF-8'))
<xml.dom.minidom.Document object at 0x015FC9E8>
>>> xml.dom.minidom.parseString('<?xml version="1.0" encoding="ISO-2022-JP" ?><x>%s</x>' % c.encode('ISO-2022-JP'))
<xml.dom.minidom.Document object at 0x01493208>
>>>

iritkatriel · 2020-10-19T19:27:05Z

Closing - this now works for me on Python 3.8 and 3.10. It was fixed sometime in the last 8 years.

dcallagh mannequin added the topic-XML label Sep 7, 2012

iritkatriel closed this as completed Oct 19, 2020

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xml.dom.minidom cannot parse ISO-2022-JP #60081

xml.dom.minidom cannot parse ISO-2022-JP #60081

dcallagh mannequin commented Sep 7, 2012

dcallagh mannequin commented Sep 7, 2012

amauryfa commented Sep 7, 2012

iritkatriel commented Sep 30, 2020

iritkatriel commented Oct 19, 2020

xml.dom.minidom cannot parse ISO-2022-JP #60081

xml.dom.minidom cannot parse ISO-2022-JP #60081

Comments

dcallagh mannequin commented Sep 7, 2012

dcallagh mannequin commented Sep 7, 2012

amauryfa commented Sep 7, 2012

iritkatriel commented Sep 30, 2020

iritkatriel commented Oct 19, 2020