ElementTree attributes replace "\r" with "\n" #83192

mefistotelis · 2019-12-09T23:40:50Z

BPO	39011
Nosy	@scoder, @serhiy-storchaka, @mefistotelis
PRs	bpo-39011: Preserve line endings within ElementTree attributes #18468
Files	0001-bpo-39011-Preserve-line-endings-within-attributes.patch: Patch v1 0002-bpo-39011-Test-white-space-preservation-in-attribs.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2020-04-12.12:52:57.755>
created_at = <Date 2019-12-09.23:40:50.208>
labels = ['expert-XML', 'type-bug', '3.9']
title = 'ElementTree attributes replace "\\r" with "\\n"'
updated_at = <Date 2020-04-12.12:52:57.755>
user = 'https://github.com/mefistotelis'

bugs.python.org fields:

activity = <Date 2020-04-12.12:52:57.755>
actor = 'scoder'
assignee = 'none'
closed = True
closed_date = <Date 2020-04-12.12:52:57.755>
closer = 'scoder'
components = ['XML']
creation = <Date 2019-12-09.23:40:50.208>
creator = 'mefistotelis'
dependencies = []
files = ['48885', '48890']
hgrepos = []
issue_num = 39011
keywords = ['patch']
message_count = 9.0
messages = ['358154', '358181', '358219', '358831', '361664', '361681', '361682', '361795', '366244']
nosy_count = 4.0
nosy_names = ['scoder', 'serhiy.storchaka', 'mefistotelis', 'nows']
pr_nums = ['18468']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue39011'
versions = ['Python 3.9']

mefistotelis · 2019-12-09T23:40:50Z

TLDR:
If I place "\r" in an Element attribute, it is handled and idiomized to " " in the XML file. But wait - \r is not really code 10, right?

Real description:

If I create ElementTree and read it just after creation, I'm getting what I put there - "\r". But if I save and re-load, it transforms into "\n". The character is incorrectly converted before being idiomized, and saved XML file has invalid value stored.

Quick repro:

# python3 -i
Python 3.8.0 (default, Oct 25 2019, 06:23:40)  [GCC 9.2.0 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import xml.etree.ElementTree as ET
>>> elem = ET.Element('TEST')
>>> elem.set("Attt", "a\x0db")
>>> tree = ET.ElementTree(elem)
>>> with open("_test1.xml", "wb") as xml_fh:
...     tree.write(xml_fh, encoding='utf-8', xml_declaration=True)
...
>>> tree.getroot().get("Attt")
'a\rb'
>>> tree = ET.parse("_test1.xml")
>>> tree.getroot().get("Attt")
'a\nb'
>>>

Related issue: https://bugs.python.org/issue5752
(keeping this one separate as it seem to be a simple bug, easy to fix outside of the discussion there)

If there's a good workaround - please let me know.

Tested on Windows, v3.8 and v3.6

serhiy-storchaka · 2019-12-10T11:14:09Z

See https://www.w3.org/TR/REC-xml/#sec-line-ends.

mefistotelis · 2019-12-10T20:04:34Z

Disclaimer: I'm not at all an expert in XML specs.

The linked spec chapter, "End-of-Line Handling", says all line endings should behave like they were converted to "\n" _before_ parsing.

This means:

This part of spec does not apply to the behavior described in the issue , because line endings are converted before the file is saved. The spec describes loading process, not saving.
Before parsing, the line endings within attributes are replaced by idioms - so they are no longer line endings in the meaning assigned by the spec. The chapter starts with clear indication that it only applies to line endings which are used to give structure to physical file. The affected line endings are narrowed by stating: "files [...], for editing convenience, are organized into lines.". Since line endings in attributes are idiomized, they don't take part of organizing file into lines.

Then again, I'm not an expert. From the various specs I worked with, I know that the affected industry always comes out with unified interpretation of specs. If it was widely accepted to apply this chapter to values of attributes, I'd understand.

scoder · 2019-12-23T19:19:53Z

I think we did it wrong in bpo-17582. Parser behaviour is not a reason why the *serialisation* should modify the content.

Luckily, fixing this does not impact the C14N serialisation (which aims to guarantee byte identical serialisation), but it changes the "normal" serialisation. I would therefore suggest that we remove the newline replacement code in the next release only, Py3.9.

@mefistotelis, do you want to submit a PR?

mefistotelis · 2020-02-10T00:24:28Z

Patch attached.

I was thinking about one for() instead, but didn't wanted to introduce too large changes..

Let me know if you would prefer something like:

    for i in (9,10,13,):
        if chr(i) not in text: continue
        text = text.replace(chr(i), "&#{:02d};".format(i))

That would also make it easy to extend for other chars, ie. if we'd like the parser to be always able to re-read the XML we've created. Currently, placing control chars in attributes will prevent that. But I'm getting out of scope of this issue now.

scoder · 2020-02-10T12:13:41Z

Your patch looks good to me. Could you please add (or adapt) the tests and then create a PR from it? You also need to write a NEWS entry for this change, and it also seems worth an entry in the "What's new" document.

https://devguide.python.org/committing/

nows · 2020-02-10T12:20:52Z

Hope it is fixed now.

mefistotelis · 2020-02-11T10:39:02Z

I'm on it.

Test update attached.

scoder · 2020-04-12T12:52:18Z

New changeset 5fd8123 by mefistotelis in branch 'master':
bpo-39011: Preserve line endings within ElementTree attributes (GH-18468)
5fd8123

mefistotelis mannequin added 3.8 only security fixes topic-XML type-bug An unexpected behavior, bug, or error labels Dec 9, 2019

scoder added 3.9 only security fixes and removed 3.8 only security fixes labels Dec 23, 2019

scoder closed this as completed Apr 12, 2020

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ElementTree attributes replace "\r" with "\n" #83192

ElementTree attributes replace "\r" with "\n" #83192

mefistotelis mannequin commented Dec 9, 2019

mefistotelis mannequin commented Dec 9, 2019

serhiy-storchaka commented Dec 10, 2019

mefistotelis mannequin commented Dec 10, 2019

scoder commented Dec 23, 2019

mefistotelis mannequin commented Feb 10, 2020

scoder commented Feb 10, 2020

nows mannequin commented Feb 10, 2020

mefistotelis mannequin commented Feb 11, 2020

scoder commented Apr 12, 2020

ElementTree attributes replace "\r" with "\n" #83192

ElementTree attributes replace "\r" with "\n" #83192

Comments

mefistotelis mannequin commented Dec 9, 2019

mefistotelis mannequin commented Dec 9, 2019

serhiy-storchaka commented Dec 10, 2019

mefistotelis mannequin commented Dec 10, 2019

scoder commented Dec 23, 2019

mefistotelis mannequin commented Feb 10, 2020

scoder commented Feb 10, 2020

nows mannequin commented Feb 10, 2020

mefistotelis mannequin commented Feb 11, 2020

scoder commented Apr 12, 2020