-
-
Notifications
You must be signed in to change notification settings - Fork 31.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate ElementC14N module into xml.etree package #57820
Comments
The ElementC14N.py module by Fredrik Lundh implements XML canonicalisation for the ElementTree serialiser. Given that the required API hooks to use it are already in xml.etree.ElementTree, this would make a nice, simple and straight forward addition to the existing xml.etree package. The source can be found here (unchanged since at least 2009): https://bitbucket.org/effbot/et-2009-provolone/src/tip/elementtree/elementtree/ElementC14N.py Note that the source needs some minor modifications to use relative imports at the top. Also, the "2.3 compatibility" code section can be dropped. |
Code added to the standard library should be contributed by its author, with an explicit statement of plans to support it in an ongoing manner, and preferably also with plans to stop providing standalone releases over time. |
Whilst in most cases this would be correct, in this case it looks like the original contributor took a subset of what the original author wrote and put it into the python libraries. Until relatively recently the ElementTree.py file included a stanza that attempted to import the ElementC14N module and conditionally set up the 'c14n' key value in _serialize |
"c14n" is documented as an accepted serialization method of write() and there is some (non-working) code for support of C14N. e6a951b removed optional import of ElementC14N. |
References: Canonical XML Version 2.0 -- https://www.w3.org/TR/xml-c14n2/ |
Turns out, it was not that easy. :-/ ElementTree lacks prefixes in its tree model, so they would have to be either registered globally (via register_namespace()) or come from the parser. I tried the latter since that is the most generic way when the input is serialised already. See bpo-36673 and bpo-36676 for extensions to the parser target interface that this implementation relies on. Note that this is a new implementation, only marginally based off the original ElementC14N implementation. I only implemented C14N 2.0 (which lxml also does not have, but I'll add it there). I got most of the official test cases working, including prefix rewriting and prefix resolution in tag and attribute content. https://www.w3.org/TR/xml-c14n2-testcases/ What's not supported? The original namespace prefixes may not be preserved when namespaces are declared with multiple prefixes. In that case, one of them is picked. That's difficult to implement in ET because the parser resolves and discards prefixes. I think that's acceptable, as long as the prefix selection is deterministic. Also, qname rewriting in XPath expressions that appear in XML text is not currently supported. I guess that's a bit of an esoteric feature which can still be added later if it's needed. While testing, I noticed that ET and cET behave differently when it comes to resolving default attributes from an internal DTD subset. The parser in cET does it, ET does not. That should probably get aligned. For now, the tests hack around that difference. Comments and reviews welcome. |
Review of what? There is no PR attached to this issue. |
It took me a couple of minutes longer to submit it, but it's there now. :) I'm aware that there is a lot of new code involved, covering really three new features, which makes reviewing it a non-trivial task. I personally think it's ready to go into the last alpha release on Monday to receive some initial visibility, but I would like to have at least a little feedback before that. Even just a general opinion whether you support pushing this into 3.8 or not. Postponing it to the first beta would be ok, if you need more time to form an opinion, but having it in an alpha would improve the chance of getting user feedback. |
Playing around with it a bit more, I ended up changing the interface of the canonicalize() function to return its output as a string by default. It's really nice to be able to say c14n_xml = canonicalize(plain_xml) To write to a file, you now do this: with open("c14n_output.xml", mode='w', encoding='utf-8') as out_file:
canonicalize(xml_data, out=out_file) and to read from a file: canonicalize(from_file=fileobj) I think that nicely handles all use cases. |
Since I didn't get any negative comments or requests for deferral, I'll merge this today to get the feature into the last (still unreleased) alpha. We still have the beta phase to resolve issues with it. |
The PR has reference leaks. |
Thanks for testing, Zackery. I resolved the reference leaks. They were already in the PR for bpo-36676. Both PRs updated. |
A buildbot failure made me notice that the test files were not part of the CPython installation yet, so I added them. I also took the opportunity to add a README file that describes where they come from and under which conditions they were originally provided by the W3C (IANAL, but basically free use with copyright notice). https://www.w3.org/Consortium/Legal/2015/doc-license Is there anything else I have to take care of when adding externally provided/licensed files to the source tree? |
Maybe complete Doc/license.rst? |
Thanks, done. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: