-
-
Notifications
You must be signed in to change notification settings - Fork 31.7k
ElementTree attributes replace "\r" with "\n" #83192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
TLDR: Real description: If I create ElementTree and read it just after creation, I'm getting what I put there - "\r". But if I save and re-load, it transforms into "\n". The character is incorrectly converted before being idiomized, and saved XML file has invalid value stored. Quick repro: # python3 -i
Python 3.8.0 (default, Oct 25 2019, 06:23:40) [GCC 9.2.0 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import xml.etree.ElementTree as ET
>>> elem = ET.Element('TEST')
>>> elem.set("Attt", "a\x0db")
>>> tree = ET.ElementTree(elem)
>>> with open("_test1.xml", "wb") as xml_fh:
... tree.write(xml_fh, encoding='utf-8', xml_declaration=True)
...
>>> tree.getroot().get("Attt")
'a\rb'
>>> tree = ET.parse("_test1.xml")
>>> tree.getroot().get("Attt")
'a\nb'
>>> Related issue: https://bugs.python.org/issue5752 If there's a good workaround - please let me know. Tested on Windows, v3.8 and v3.6 |
Disclaimer: I'm not at all an expert in XML specs. The linked spec chapter, "End-of-Line Handling", says all line endings should behave like they were converted to "\n" _before_ parsing. This means:
Then again, I'm not an expert. From the various specs I worked with, I know that the affected industry always comes out with unified interpretation of specs. If it was widely accepted to apply this chapter to values of attributes, I'd understand. |
I think we did it wrong in bpo-17582. Parser behaviour is not a reason why the *serialisation* should modify the content. Luckily, fixing this does not impact the C14N serialisation (which aims to guarantee byte identical serialisation), but it changes the "normal" serialisation. I would therefore suggest that we remove the newline replacement code in the next release only, Py3.9. @mefistotelis, do you want to submit a PR? |
Patch attached. I was thinking about one for() instead, but didn't wanted to introduce too large changes.. Let me know if you would prefer something like: for i in (9,10,13,):
if chr(i) not in text: continue
text = text.replace(chr(i), "&#{:02d};".format(i)) That would also make it easy to extend for other chars, ie. if we'd like the parser to be always able to re-read the XML we've created. Currently, placing control chars in attributes will prevent that. But I'm getting out of scope of this issue now. |
Your patch looks good to me. Could you please add (or adapt) the tests and then create a PR from it? You also need to write a NEWS entry for this change, and it also seems worth an entry in the "What's new" document. |
Hope it is fixed now. |
I'm on it. Test update attached. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: