-
-
Notifications
You must be signed in to change notification settings - Fork 29.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ElementTree not preserving attribute order #78341
Comments
Starting with Python3.6, the order of keyword arguments has been guaranteed. Something in ElementTree is not respecting that order. $ python3.7
Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 26 2018, 23:26:24)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from xml.etree.ElementTree import Element, dump
>>> dump(Element('cirriculum', status='public', company='example'))
<cirriculum company="example" status="public" />
>>> # ^-----------------^-------------------- These are swapped |
Attributes are sorted by name when converted an element to the text representation for making it stable. Changing this may break existing software. I think it makes sense now to add an optional keyword-only parameter sort_attrs (True by default) for functions tostring() and tostringlist() and method ElementTree.write(). If it is False, attributes will keep the original order. Since dump() is used for debugging only, it could change the behavior by default, even in maintained releases. But currently it just uses ElementTree.write(), which should sort attributes by default for compatibility. |
At least for lxml, attributes were never specified to have a sorted order (although attribute dicts are sorted on *input*, to give a *predictable* order as in ET), and the tutorial says: "Attributes are just unordered name-value pairs". However, I still think that Serhiy is right: this change would break code, and in particular test code that compares XML output. Having to deal with two different "correct" serialisations in tests depending on the Python version is annoying at best. But OTOH, comparing XML output should always be based on C14N and not on an arbitrary serialisation. XML C14N serialisation was specifically designed to provide a deterministic output byte sequence, regardless of how the XML document was created. (This is also used for cryptographic fingerprinting.) It is interesting that ET sorts the attributes on output. lxml does it only on API *input*, because the underlying tree model is order preserving. Parsed attributes are always written in the original document order, followed by the attributes set through the API in the order in which they were set. For lxml, a serialisation flag is not going to help, because the serialisation order is always deterministic from the time where an attribute is added to the tree. Given that dicts are order preserving in Python now, ET could follow this path as well. The attributes are already in parse order and could easily be serialised that way, and attributes that were added through the API would end up being serialised last, also preserving the order. This would leave lxml and ET with the same serialisation order, which seems attractive. In short, I like this change, it's just difficult to see a way how this could preserve full backwards compatibility. And it's difficult to say how important backwards compatibility really is here. |
This doesn't strike me as a bug, or even clearly being a potential improvement. If this about round-trip parsing/serializing not preserving order, that would be significant. As it is, if this is only intended as a debugging tool, why change it? Is there a concrete reason to change this and break backwards compatibility? If not I suggest we close this as "won't fix". |
Consider this as a feature request. It is perfectly reasonable for a user to want to generate specific XML output and to want to control the order that the attributes are listed. It is perfectly reasonable for the API to preserve the order that the user specified rather than change it into some other order that they either don't expect or can't control. Starting in Python 3.6, the language guarantees that the order of keyword arguments is preserved. Knowing that, it will be natural for users to start viewing as buggy APIs that discard that order when everything else in the eco-system respects the ordering. When I teach ElementTree in Python courses, students notice this behavior and ask why the order was swapped. My only answer is that we didn't used to have a way to control it, so the input order was ignored (i.e. a reason that no longer makes sense). |
Thanks for the clarification, Raymond. I've spent a few minutes searching for uses of dump(), and have only come up with uses for debugging or printing helpful info in command line tools. So, it seems that changing this as suggested wouldn't break much existing code, since I couldn't easily find even one such example. |
This is an easy issue and I left it for beginning contributors. The PR should include the following changes:
|
I'm working on this issue, but I have some questions:
|
Though it is compatible with earlier versions, I don't see any reason to keep sorting at all, especially as a default. The reasonable default is that whatever order the user specifies is what the tool does. The backwards compatible argument is specious. The ordering was never guaranteed. The only reason that we had sorting was to make the XML generation deterministic, and that is something we would still have. I vote against the permanent API expansion with an unhelpful default. The sorting is no longer necessary or desirable. Appropriately, we never guaranteed it, leaving us the implementation flexibility to make exactly this change. AFAICT, no XML user has ever expressed an interest in having attributes appear in alphabetical order -- for that matter, I can't recall any HTML user having expressed this preference (instead, they prefer to put class and id tags ahead of other tags for example). |
Diego, it would be premature to write an implementation until we've agreed on an API. I object to the Serhiy's proposal, and if no one agrees to my proposal to preserve the user's specific actions, then it would be better to do nothing at all, leaving the current API (deterministic, but with no particular order being guaranteed). |
There is also a middle ground: Keep .write() as it is *and* have .dump() preserve the order in which the parameters were supplied in the constructor. They could both use a common *internal* method with a flag to select the desired ordering. |
Okay, lets just remove sorting. The method for creating XML file in the minidom module also sorts attributes. It should be changed as well. XMLGenerator in xml.sax.saxutils doesn't sort attributes. |
Diego, I've taken care of the ElementTree case. Would you like to make a PR for the other occurrences (minidom, html, etc)? |
Nice enhancement! I was always annoyed by XML libraries in PHP and Python which don't respect the "insertion order". |
Raymond, sure. I could do the rest, I'll start tomorrow Monday. |
Serhiy Storchaka, Raymond. I already made the 10219 PR. Preserves order in html Element attributes and minidom. |
Thanks for doing the additional work to finish this up. |
Hi, this broke my tests, just as earlier comments predicted. Can we get an optional argument to sort the keys? The json module lets us sort keys even though it's irrelevant to JSON. Being able to sort attributes would be a very useful addition to the prettyxml method. |
(sorry, to sort the attributes.) |
Sorry for being unclear. The user needed to *overcome* the sorting behavior so that they could produce an ordering of their own choosing. The problem then became that we had to work out a way to defeat the sorting to allow the requested attribute ordering to be preserved. Roughly: casefile = Element('casefile', clearance="confidential",
access="internal", valid_through="31 Mar 2022") The desired result is that the attribute order be preserved. By removing the sort, a user can specify an order they want and have that specification honored. |
in fact, there is no easy way for the fix for python-docutils. |
Please don't break the backward compatibility without an easy way to retrieve Python 3.7 behavior. I set the priority to release blocker until a consensus can be found, and I add Lukasz (Python 3.8 release manager) in the loop. Ned Batchelder: "I'm a bit mystified why people are still opposed to providing sorting control, even after people are resorting to "hack work-arounds." The (two) pull requests that provide the control are simple, easy to understand, and easy to test." I concur. I'm fine with sorting by default, but it breaks the backward compatibility on purpose without providing an option to opt-in for the old behavior :-( Many XML parsers rely on the order of attributes. It's part of the XML "semantics". Well, it shouldn't, but I cannot fix all applications around the world to explain them that Python is right, and they are all wrong :-) It's not straighforward to fix an application to get Python 3.7 behavior. I would prefer to not have to copy-paste Stefan Behnel's recipe in every project using XML who wants to sort attributes: """ def sort_attributes(root):
for el in root.iter():
attrib = el.attrib
if len(attrib) > 1:
attribs = sorted(attrib.items())
attrib.clear()
attrib.update(attribs)
""" This recipe does modify the document and so changes the behavior of the application when it iterates on attributes later, whereas in Python 3.7 attributes are only sorted while writing the XML into a file. |
Victor, as much as I appreciate backwards compatibility, I really don't think it's a big deal in this case. In fact, it might not even apply. My (somewhat educated) gut feeling is that most users simply won't care or won't even notice the change. Out of those who do (or have to) care, many are better off by fixing their code to not rely on an entirely arbitrary (sorted by name) attribute order than by getting the old behaviour back. And for those few who really need attributes to be sorted by name, there's the recipe I posted which works on all ElementTree implementations out there with all alive CPython versions.
This is actually a very rare thing. If I were to make up numbers, I'd guess that some 99% of the applications do XML serialisation as the last thing and then throw away the tree afterwards, without touching it (or its attributes) again. And the remaining cases are most probably covered by the "don't need to care" type of users. I don't think we should optimise for 0.05% of our user base by providing a new API option for them. Especially in ElementTree, which decidedly aims to be simple. The example that Ned gave refers to a very specific and narrow case: comparing serialised XML, at the byte level, in tests. He was very lucky that ElementTree was so stable over the last 10 Python releases that the output did not change at all. That is not something that an XML library needs to guarantee. There is some ambiguity in XML for everything that's outside of the XML Information set, and there is a good reason why the W3C has tackled this ambiguity with an explicit and *separate* specification: C14N. So, when you write:
It's certainly not many parsers, and could even be close to none. The order of attributes is explicitly excluded from the XML Information set: https://www.w3.org/TR/xml-infoset/#omitted Despite this, cases where the order of the attributes matters to the *application* are not unheard of. But for them, attributes sorted by their name are most likely the problem and not a solution. Raymond mentioned one such example. Sorting attributes by their name really only fulfils a single purpose: to get reproducible output in cases where the order does *not* matter. For all the (few) cases where the order *does* matter, it gets in the way. But by removing the sorting, as this change does, we still get predictable output due to dict ordering. So this use case is still covered. It's just not necessarily the same output as before, because now the ordering is entirely in the hands of the users. Meaning, those users who *do* care can now actually influence the ordering, which was very difficult and hackish to achieve before. We are allowing users to remove these hacks, not forcing them to add new ones. |
In short, the XML serialization is no longer deterministic. It depends in which order attributes are defined. Example: from xml.etree import ElementTree as ET
document = ET.Element('document')
document.attrib['a'] = '1'
document.attrib['b'] = '2'
ET.dump(document)
document = ET.Element('document')
document.attrib['b'] = '2'
document.attrib['a'] = '1'
ET.dump(document) Python 3.7 output: Python 3.8 output: On this example, it's obvious that the attributes are defined in a different order, but the code which generates the XML document can be way more complex and use unordered data structures like set(). Why does it matter? Most programs compare XML using basic string comparison, they don't implement smart XML comparison which ignore attributes order. Concrete example:
It's not just a matter of parsers.
... Really? Why do you think that so many people are involved in this issue if nobody cares of XML attribute order? :-) |
I'll make a post to python-dev so that we can escalate the discussion and get broader participation. Personally, I'm not wedded to any one particular outcome and just want to do what is best for the users in the long run. |
Here is a basic comparison between two xml streams. Maybe we could propose this solution when we want to compare an XML with xml.dom.minidom instead of the byte-to-byte comparison. |
We found two occurrences of external code breaking due to this change, and one looks tricky to fix (docutils), this indicates there are other library that will break.
I concur with this. Why changing the default behavior? The costs of changing this looks high, and the benefits looks very low (people not noticing). As already proposed, a keyword-only On the plus side |
Actually, it is really easy to fix just by updating the target comparison files (that is the procedure described in the comments for the test). What is taking more thought is coming up with a better test that verifies intended semantic content rather than the exact serialization. I'm working with Stéphane to figure out how to do this. |
By answering this, it looks like you're currently going this way, and obviously you'll succeed fixing docutils, this still mean voluntarily leaving some (or many?) other code broken (open source and closed source), how that's acceptable? |
For docutils, we'll most likely propose some variant of Stéphane Wirtel's script to test semantic equivalence for docutils. For other cases, Serhiy is working on a C14N canonicalization tool which is specifically designed for the task of creating reproducible output, in a cross-language standards compliant way. As Stefan Behnel clearly articulated, there are multiple reasons why Python should not guarantee byte-for-byte serialization across point releases. That said, we'll likely make the guarantee across micro-releases. That will make it possible a third mitigation strategy of generating new baseline files for a new point releases and adding a version check to decide which baseline to test against. FWIW, we had a similar discussion regarding hash randomization. While there are a number of significant differences, the outcome is relevantL User tests that depended on non-guaranteed implementation details had to be fixed. |
I set the priority to release blocker to make sure that the issue got enough attention. Raymond started a thread on python-dev, so I reset the priority: Moreover, the change is now documented in "What's New in Python 3.8?" documentation ("Changes in the Python API" section): |
Thanks to the discussion that rhettinger triggered on python-dev, it seems that there is a majority accordance for considering the previous ordering "undefined but deterministic" rather than "sorted", and for making the change from "arbitrarily sorted" to "user defined" in Py3.8. That matches the current status after the merge of the four pull requests, for both ElementTree and MiniDOM. There also seems to be a large fraction of participants that would like to see a stdlib way to safely/correctly/easily compare XML output. IMHO, that is covered by bpo-13611, adding a C14N module to ElementTree. There is also a (Py2) C14N implementation for minidom, from the now unmaintained PyXML package¹, which, according to the license, is already distributed under Python copyright. I will try to bring at least the ET module in shape for integration before 3.8 beta1. During the discussions, several ways were shown to achieve deterministic behaviour, also across Python releases. My proposal is therefore to *not* add a new "sort_attrs" option to the API, with the reasoning that it is a) not required for any of the use cases, b) not maintaining compatibility without code changes (neither backwards nor forward), and c) not what most people actually want who strive for deterministic output. Instead, I think we should concentrate on providing and documenting better ways to compare XML output. ¹ https://github.com/actmd/PyXML/blob/97f144152878a67cd95dd229fe72e86f19bce706/xml/dom/ext/c14n.py |
Stefan, as the module maintainer, you get to make the final decision on this. What you've proposed makes sense in light of the prior discussion and the python-dev discussion. Unless you think something new will come along to change your mind, just mark as closed/fixed, then continue work on c14n in another tracker issue. |
This is done now. Thanks everyone who helped in discussing and implementing this change. I will leave Serhiy's last PR (adding the "sort_attrs" flag option) open for a while until I'm sure we have a better solution for comparing XML in 3.8, at which point I would reject it. |
FYI pungi project is also broken by this change and it blocks Fedora Rawhide to upgrade to Python 3.8: |
PR 10452 is still open. Should it be closed if you consider that we must not add an optional sort_attrs=False attribute? |
I rejected the (now conflicting) PR that adds a sorting option. |
I implemented (most of) C14N 2.0 in bpo-13611. Please give it a try if you are interested in the canonical serialisation feature. I would like to include it in Py3.8. |
Other examples of projects broken by this change:
What's New In Python 3.8 only briefly mention this change: Would it be possible to suggest solutions for this backward incompatible change? Like https://bugs.python.org/issue34160#msg338102 recipe. Maybe we should even put it in the XML documentation, somewhere. |
Victor, you mean place again this code? https://github.com/python/cpython/pull/10163/files#diff-d5a064acb6ae44dcb7e01fee148c733dR926 And the recipe proposed in https://bugs.python.org/issue34160#msg338102 place it as a method of ElementTree or just as a helper function? |
I created a PR that adds a couple of paragraphs to the documentation. Comments welcome. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: