Skip to content

Conversation

dzeban
Copy link
Contributor

@dzeban dzeban commented Oct 17, 2017

This PR adds pretty printing to the xml.etree.ElementTree module, namely to the tostring, tostringlist functions and write method of the ElementTree object.

Pretty printing feature saves backward compatibility in a sense that old output won't be changed.

https://bugs.python.org/issue14465


Generates a string representation of an XML element, including all
subelements. *element* is an :class:`Element` instance. *encoding* [1]_ is
the output encoding (default is US-ASCII). Use ``encoding="unicode"`` to
generate a Unicode string (otherwise, a bytestring is generated). *method*
is either ``"xml"``, ``"html"`` or ``"text"`` (default is ``"xml"``).
*short_empty_elements* has the same meaning as in :meth:`ElementTree.write`.
*pretty* enables pretty printing with default indent. Keyword-only *indent*
allows to set custom indent.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allows to setallows setting (or allows one to set, etc)

@@ -935,7 +946,8 @@ ElementTree Objects
The keyword-only *short_empty_elements* parameter controls the formatting
of elements that contain no content. If ``True`` (the default), they are
emitted as a single self-closed tag, otherwise they are emitted as a pair
of start/end tags.
of start/end tags. *indent* enables and set indentation level for the
human readable output.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enables and sets
human-readable

@dzeban
Copy link
Contributor Author

dzeban commented Oct 18, 2017

@vadmium Thanks for the doc review - I've fixed the mistakes.

@serhiy-storchaka serhiy-storchaka self-assigned this Oct 18, 2017

Generates a string representation of an XML element, including all
subelements. *element* is an :class:`Element` instance. *encoding* [1]_ is
the output encoding (default is US-ASCII). Use ``encoding="unicode"`` to
generate a Unicode string (otherwise, a bytestring is generated). *method*
is either ``"xml"``, ``"html"`` or ``"text"`` (default is ``"xml"``).
*short_empty_elements* has the same meaning as in :meth:`ElementTree.write`.
*pretty* enables pretty printing with default indent. Keyword-only *indent*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please name it pretty_print instead for lxml compatibility.

@scoder
Copy link
Contributor

scoder commented Oct 20, 2017

Does this slow down the normal serialisation in any noticeable way?

@dzeban
Copy link
Contributor Author

dzeban commented Oct 21, 2017

As for serialization performance - here are some simple tests. I've tested with current master and my branch with and without pretty printing. I did tostring serialization on 207KB xml file.

$ git checkout master
$ ./configure --enable-optimizations
$ make
$ ./python 
Python 3.7.0a2+ (heads/master:73c4708, Oct 22 2017, 00:03:16) 
[GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux

$ ./python -m timeit -s "import xml.etree.ElementTree as ET; tree = ET.parse('xml.xml'); root = tree.getroot()" "ET.tostring(root, 'unicode', 'xml')"
20 loops, best of 5: 12.4 msec per loop

$ git checkout etree-pretty-print 
$ ./configure --enable-optimizations
$ make

$ ./python 
Python 3.7.0a1+ (heads/etree-pretty-print:f114e68, Oct 21 2017, 23:35:43) 
[GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux

$ ./python -m timeit -s "import xml.etree.ElementTree as ET; tree = ET.parse('xml.xml'); root = tree.getroot()" "ET.tostring(root, 'unicode', 'xml')"                                                                                                                           
20 loops, best of 5: 14.2 msec per loop

$ ./python -m timeit -s "import xml.etree.ElementTree as ET; tree = ET.parse('xml.xml'); root = tree.getroot()" "ET.tostring(root, 'unicode', 'xml', pretty_print=True)"
20 loops, best of 5: 16.7 msec per loop

So the new version is 14% slower. Pretty printing is 17% slower than non pretty printing. Overall, pretty printing enabled code is 34% slower than original version from master.

I'm not sure if it's really scary. Do we have such comparison for lxml?

@vstinner
Copy link
Member

./python -m timeit

I suggest you to use perf timeit, timeit is not reliable.
http://perf.readthedocs.io/en/latest/cli.html#timeit-cmd

@dzeban
Copy link
Contributor Author

dzeban commented Oct 24, 2017

Here are the benchmarks using perf module:

$ # Master branch
$ ./python -m perf timeit -s "import xml.etree.ElementTree as ET; tree = ET.parse('xml.xml'); root = tree.getroot()" "ET.tostring(root, 
'unicode', 'xml')"
.....................
Mean +- std dev: 12.8 ms +- 0.3 ms

$ # etree-pretty-print branch without pretty printing
$ ./python -m perf timeit -s "import xml.etree.ElementTree as ET; tree = ET.parse('xml.xml'); root = tree.getroot()" "ET.tostring(root, 'unicode', 'xml')"
.....................
Mean +- std dev: 14.3 ms +- 0.4 ms

$ # etree-pretty-print branch with pretty printing
$ ./python -m perf timeit -s "import xml.etree.ElementTree as ET; tree = ET.parse('xml.xml'); root = tree.getroot()" "ET.tostring(root, 'unicode', 'xml', pretty_print=True)"
.....................
Mean +- std dev: 15.7 ms +- 0.5 ms

This one shows that new version is 12% slower. Pretty printing adds 10%.

I don't think that slowing down because of pretty printing is a big deal because if you need fast serialization you don't need a pretty output. I'm gonna see how lxml behaves with and without pretty printing, cause I think I'll see the same percentage.

What I'm worried about is 12% slowing against master version. Should we really stress it? Because I think if you really need to serialize a lot of XML and need it fast you'll gonna use lxml anyway.

@dzeban dzeban force-pushed the etree-pretty-print branch from 97783fb to 1e9149c Compare October 28, 2017 10:20
@Stevoisiak
Copy link
Contributor

Stevoisiak commented Apr 24, 2018

@dzeban I think it's unfair to assume users will always use lxml over xml.etree.

  • xml.etree is built into Python
  • Many Python guides teach xml.etree over lxml
  • The API for xml.etree is arguably better documented
  • xml.etree allows sub-classes of Element, which lxml strongly discourages.

@scoder
Copy link
Contributor

scoder commented Apr 24, 2018

new version is 12% slower

I don't see how a rare and "mostly for debugging" feature like pretty printing could possibly justify such a big slow-down for a time critical feature like serialisation.

Many Python guides teach xml.etree over lxml

"Many" seems highly exaggerated.

The API for xml.etree is arguably better documented

A surprising argument, given that almost all of the documentation (and tutorials and books and ...) for xml.etree applies to lxml.etree just as well.

xml.etree allows sub-classes of Element, which lxml strongly discourages.

You shouldn't believe everything that random people write on the Internet. Why do you think there is a dedicated subclassing API in lxml?

@Stevoisiak
Copy link
Contributor

Stevoisiak commented Apr 25, 2018

@scoder Valid counterpoints. Though, on your last point:

You shouldn't believe everything that random people write on the Internet.

I was referencing an answer I wrote myself based on the documentation for lxml.etree.ElementBase.

Why do you think there is a dedicated subclassing API in lxml?

Is there a dedicated subclassing API in lxml? I would be glad to amend my point if this were the case.

Upon further review, it appears lxml.etree.ElementBase does support subclassing.

@Stevoisiak
Copy link
Contributor

Stevoisiak commented Jun 4, 2018

While I'm in favor of adding an option for pretty printing, I'm not sure it's worth a 12% decrease in performance.

Could the performance impact be lessened if the indent attribute was replaced with a constant?

@dzeban
Copy link
Contributor Author

dzeban commented Jun 5, 2018

@Stevoisiak do you mean to remove indent as an argument and make it a module level constant like indent = ' ' so the user couldn't change it? I could give it a shot but don't think that it's what users expect from Python APIs.

@mcepl
Copy link
Contributor

mcepl commented Aug 17, 2018

While I'm in favor of adding an option for pretty printing, I'm not sure it's worth a 12% decrease in performance.

@Stevoisiak This feels like a serious ungrateful bikeshedding to me. Could we first add this feature, which will finally remove for me the last and only reason why to use lxml at all (and close six years old bug)? Then you can fiddle with the performance. After all, if you really care about performance so much, than you probably won't beat lxml with its dependence on libxml2 which is the fastest XML library around.

@scoder
Copy link
Contributor

scoder commented Aug 18, 2018

lxml ... is the fastest XML library around.

That's true for many use cases but not for all. ElementTree is actually very memory efficient and very fast on certain API operations. Building trees is something that can be done very fast in ElementTree but is not quite as fast in lxml. If the serialisation then slows it down so much afterwards, that is a problem. I would rather like to see the serialisation performance of ElementTree improved, definitely not further slowed down.

@mcepl
Copy link
Contributor

mcepl commented Aug 18, 2018

That's true for many use cases but not for all.

The point I wanted to make is, that the question of performance of the ElementTree as found in the standard library is quite orthogonal to whether I would still have to pepper my code with silliness like:

# from http://infix.se/2007/02/06/gentlemen-indent-your-xml
    def __xmlIndent(self,elem,level=0):
        i = "\n" + level*"  "
        if len(elem):
            if not elem.text or not elem.text.strip():
                elem.text = i + "  "
            for e in elem:
                self.__xmlIndent(e, level+1)
                if not e.tail or not e.tail.strip():
                    e.tail = i + "  "
            if not e.tail or not e.tail.strip():
                e.tail = i
        else:
            if level and (not elem.tail or not elem.tail.strip()):
                elem.tail = i

Performance is the never ending story (https://bugs.python.org/issue14465#msg323690 promises more holy wars over C versus Cython versus some other hobby horse) and this particular issue will never be fixed.

@scoder
Copy link
Contributor

scoder commented Aug 19, 2018

Performance is the never ending story

Well, you can't just drop a performance regression on everyone just to get a feature implemented that many people won't use, and that is particularly uninteresting for performance relevant cases (where the amount of data is large and you wouldn't want to inflate it further). This is simply the wrong tradeoff.

pepper my code with silliness

What about providing a pull request that takes that code and either a) adds it as an indent(elem) function to xml.etree.ElementTree, or b) adds it to the documentation as a recipe? ISTM that the first would solve all problems: it's available in the stdlib for everyone to use, and it's out of the way for everyone who cares more about speed than indentation. You could even optimise it for memory efficiency along the way, by interning and reusing the indentation strings.

@mcepl
Copy link
Contributor

mcepl commented Aug 25, 2018

Well, you can't just drop a performance regression on everyone just to get a feature implemented that many people won't use

Well, of course, the pretty_print defaults to False, so it would happen only on request.

ISTM that the first would solve all problems: it's available in the stdlib for everyone to use, and it's out of the way for everyone who cares more about speed than indentation.

It wouldn't solve all problems. I would really prefer having compatible API (of course, again, defaulting to False). And yes, if there is a chance that PR is accepted, I will gladly use my function as the base for the pretty_print option.

@scoder
Copy link
Contributor

scoder commented Aug 25, 2018

the pretty_print defaults to False, so it would happen only on request.

Ah, but the performance regression of this PR was measured to be 12% when it the option is disabled and another 10% when enabling the option. I wouldn't mind the pretty printing case being slower, but I am arguing against making the normal case so much slower.

@scoder
Copy link
Contributor

scoder commented Aug 15, 2019

Rejecting this PR since it slows down the normal (non pretty-printed) serialisation.

@scoder scoder closed this Aug 15, 2019
@serhiy-storchaka serhiy-storchaka removed their assignment Sep 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants