event-driven XML parser #61941

pitrou · 2013-04-15T20:49:42Z

BPO	17741
Nosy	@jcea, @ncoghlan, @scoder, @jkloth, @bitdancer, @florentx, @ethanfurman
Files	etree_incparser.patch iterparse_cleanup.patch: make _IterParseIterator use an IncrementalParser instead of being one iterparse_api_cleanup.patch: new patch that also renames the methods to the normal ET names iterparse_TreeEventBuilder.patch: new complete patch that replaces the IncrementalParser by a TreeEventBuilder remove_IncrementalParser.patch integrate_IncrementalParser_into_XMLParser.patch issue17741.api-renames.1.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2013-09-02.00:05:25.756>
created_at = <Date 2013-04-15.20:49:42.016>
labels = ['expert-XML', 'type-feature', 'library']
title = 'event-driven XML parser'
updated_at = <Date 2014-03-11.13:18:02.443>
user = 'https://github.com/pitrou'

bugs.python.org fields:

activity = <Date 2014-03-11.13:18:02.443>
actor = 'r.david.murray'
assignee = 'none'
closed = True
closed_date = <Date 2013-09-02.00:05:25.756>
closer = 'eli.bendersky'
components = ['Library (Lib)', 'XML']
creation = <Date 2013-04-15.20:49:42.016>
creator = 'pitrou'
dependencies = []
files = ['29874', '31205', '31206', '31211', '31460', '31475', '31497']
hgrepos = []
issue_num = 17741
keywords = ['patch']
message_count = 109.0
messages = ['187029', '187030', '187042', '187058', '187244', '187271', '187273', '194722', '194723', '194724', '194728', '194729', '194730', '194731', '194732', '194733', '194734', '194735', '194736', '194737', '194738', '194739', '194740', '194741', '194742', '194743', '194744', '194745', '194746', '194747', '194752', '194754', '194758', '194770', '194772', '194784', '194786', '195945', '196025', '196026', '196027', '196028', '196031', '196032', '196033', '196035', '196046', '196053', '196064', '196079', '196105', '196116', '196133', '196134', '196135', '196136', '196137', '196139', '196143', '196146', '196148', '196149', '196155', '196168', '196177', '196178', '196180', '196185', '196186', '196187', '196188', '196189', '196190', '196191', '196193', '196220', '196231', '196236', '196237', '196238', '196244', '196250', '196262', '196267', '196270', '196274', '196290', '196295', '196322', '196324', '196327', '196347', '196384', '196388', '196389', '196532', '196533', '196639', '196759', '197309', '197365', '197384', '197553', '213092', '213104', '213123', '213139', '213146', '213147']
nosy_count = 9.0
nosy_names = ['jcea', 'ncoghlan', 'scoder', 'jkloth', 'r.david.murray', 'eli.bendersky', 'flox', 'ethan.furman', 'python-dev']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue17741'
versions = ['Python 3.4']

pitrou · 2013-04-15T20:49:41Z

iterparse() is a blocking operation. It is not really suitable for event-driven applications (e.g. non-blocking I/O). Here is a patch adding a IncrementalParser class.

pitrou · 2013-04-15T20:51:38Z

Note that basing iterparse() on IncrementalParser and removing the API discrepancy between the Python and C modules helps make the etree code smaller.

elibendersky · 2013-04-16T01:24:10Z

Thanks Antoine. This looks interesting - I'm somewhat swamped ATM but will try to review the patch in the next few days.

Incidentally, since it is a new feature would it be worthwhile to discuss it on python-ideas?

pitrou · 2013-04-16T09:05:28Z

I've posted on the tulip mailing-list already:
https://groups.google.com/forum/?fromgroups=#!topic/python-tulip/SNOnS27Bctc

elibendersky · 2013-04-18T12:45:19Z

Antoine, the patch LGTM. There's some more cleaning that needs to be done in surrounding code, but I can do that later. Also I should probably update the documentation with a bit more details.

Just add a NEWS entry when you commit.

Thanks for working on this.

python-dev · 2013-04-18T17:37:14Z

New changeset f903cf864191 by Antoine Pitrou in branch 'default':
Issue bpo-17741: Add ElementTree.IncrementalParser, an event-driven parser for non-blocking applications.
http://hg.python.org/cpython/rev/f903cf864191

pitrou · 2013-04-18T17:42:10Z

Thank you, Eli. Now committed :)

scoder · 2013-08-09T10:15:57Z

Here is a patch that cleans up the current implementation to avoid making the result of iterparse() an IncrementalParser (with all of its new API).

Please note that the tulip mailing list is not an appropriate place to discuss additions to the XML libraries, and ElementTree in particular.

scoder · 2013-08-09T11:14:59Z

Copying the discussion between Antoine and me from python-dev:

> IMO it should mimic the interface of the TreeBuilder, which
> calls the data reception method "feed()" and the termination method
> "close()". There is no reason to add yet another set of methods names
> just to do what others do already.

Well, the difference here is that after calling eof_received() you can
still (and should) call events() once to get the last events. I think
it would be weird if you could still do something useful with the object
after calling close().

Also, the method names are not invented, they mimick the PEP-3156
stream protocols:
http://www.python.org/dev/peps/pep-3156/#stream-protocols

I see your point about close(). I assume your reasoning was to make the IncrementalParser an arbitrary stream end-point. However, it doesn't really make all that much sense to connect an arbitrary data source to it, as the source wouldn't know that, in addition to passing in data, it would also have to ask for events from time to time. I mean, you could do it, but then it would just fill up the memory with parser events and loose the actual advantages of incremental parsing. So, in a way, the whole point of the class is to *not* be an arbitrary stream end-point.

Anyway, given that there isn't really the One Obvious Way to do it, maybe you should just add a docstring to the class (ahem), reference the stream protocol as the base for its API, and then rename it to IncrementalStreamParser. That would at least make it clear why it doesn't really fit with the rest of the module API (which was designed some decade before PEP-3156) and instead uses its own naming scheme.

scoder · 2013-08-09T11:32:06Z

Actually, let me take that last paragraph back. There is an Obvious Way to do it, and that's the feed() and close() methods. They are the existing and established ElementTree interface for incremental parsing. The fact that close() doesn't clean up all state is IMHO a minor issue. The state will be cleaned up automatically once the iteration terminates, and that's the normal behaviour of iterators. So it actually fits both protocols quite nicely.

I'd just leave the stream protocol out completely here. For one, the implementation isn't complete anyway (the connection_*() methods don't make much sense), and as I said, it's not very useful to consider the parser a general end-point to that protocol.

I'd also suggest returning the iterator over the remaining events from close(), just like the TreeBuilder returns the tree. That covers the (less common) use case of first parsing everything and then processing the events. I'll add another patch that does that.

scoder · 2013-08-09T13:37:30Z

Thinking about the original patch some more - I wonder why it doesn't use a wrapper for TreeBuilder to process the events. Antoine, is there a reason why you had to add this _setevents() method to the XMLParser, instead of making the IncrementalParser an IncrementalTreeBuilder?

scoder · 2013-08-09T13:38:00Z

Oh, and could we reopen this ticket, please?

pitrou · 2013-08-09T13:51:05Z

Antoine, is there a reason why you had to add this _setevents() method
to the XMLParser, instead of making the IncrementalParser an
IncrementalTreeBuilder?

The point is not to build a tree of potentially unbounded size (think XMPP). The point is to yield events in a non-blocking way (iterparse() is blocking, which makes it useless for non-blocking applications).

An IncrementalTreeBuilder wouldn't have much point IMO. It is not more costly to accumulate a string and parse it at the end, than to progressively build a growing XML tree.

pitrou · 2013-08-09T13:53:04Z

About the patch: I think changing the API names now that alpha1 has been released is a bit gratuitous. I would like to keep the encapsulation part but ditch the naming changes.

scoder · 2013-08-09T14:09:38Z

The point is not to build a tree of potentially unbounded size (think
XMPP). The point is to yield events in a non-blocking way (iterparse()
is blocking, which makes it useless for non-blocking applications).

Ok, but that's the only difference. Instead of getting the events from the parser, you could equally well get them from the TreeBuilder, also in a non-blocking way.

Sticking this functionality into a parser target object has the advantage that the parser interface wouldn't have to change. So, instead of introducing an entirely new parser interface, we'd just add a class that can be used as a parser target and provides an additional events() method. That's a substantially less invasive API change.

An IncrementalTreeBuilder wouldn't have much point IMO. It is not more
costly to accumulate a string and parse it at the end, than to
progressively build a growing XML tree.

I don't think I understand what you mean.

scoder · 2013-08-09T14:15:00Z

About the patch: I think changing the API names now that alpha1 has been
released is a bit gratuitous.

Sorry for being late, but I can't see it being my fault.

A change in an alpha release is still way better than a change after a final release. Eventually, lxml will have to go on par here, so it's better to get it right now, before any harm is done.

pitrou · 2013-08-09T14:28:22Z

> The point is not to build a tree of potentially unbounded size
> (think
> XMPP). The point is to yield events in a non-blocking way
> (iterparse()
> is blocking, which makes it useless for non-blocking applications).

Ok, but that's the only difference. Instead of getting the events
from the parser, you could equally well get them from the
TreeBuilder, also in a non-blocking way.

But your TreeBuilder is also growing a tree internally, and therefore
eating more and more memory, right?

I don't see the point of stuffing different kinds of functionality
inside a single class. It makes the intended use less obvious, and
errors more likely. Right now, IncrementalParser has a simple API,
and there's a single way to use it.

pitrou · 2013-08-09T14:31:22Z

A change in an alpha release is still way better than a change after
a final release.

But worse than no change at all. Arguing about API naming is a bit futile,
*especially* when the ship has sailed.

scoder · 2013-08-09T14:32:34Z

But IncrementalParser uses an XMLParser internally, which in turn uses a TreeBuilder internally. So what exactly is your point?

scoder · 2013-08-09T14:44:15Z

But worse than no change at all. Arguing about API naming is a bit
futile, *especially* when the ship has sailed.

<rant>It's easy to say that as a core developer with commit rights who can simply hide changes in a low frequented bug tracker without notifying those who have to know about these changes.</rant> It's pure luck I noticed this change during the alpha release cycle.

I'm also not arguing about naming. I'm questioning the design, trying to get it into a shape that fits the existing APIs. Why do we need two incremental parsing interfaces in one and the same library that use completely different method names, although doing otherwise exactly the same? Why should the type of the *result* of the parsing process change the method name that you need to call to *insert* data into the parser?

pitrou · 2013-08-09T14:47:06Z

But IncrementalParser uses an XMLParser internally, which in turn
uses a TreeBuilder internally. So what exactly is your point?

Well, I would rather like to understand yours. Whatever
IncrementalParser uses internally needn't impact what API it
exposes to the user.

(as a matter of fact, iterparse() doesn't expose a
TreeBuilder-like API, either)

scoder · 2013-08-09T14:59:46Z

Well, I would rather like to understand yours.

My point is that the IncrementalParser uses a TreeBuilder that builds an XML tree in the back. So I'm wondering why you are saying that it doesn't build a tree.

Whatever IncrementalParser uses internally needn't impact what API it
exposes to the user.

And in fact, we don't even need an IncrementalParser, because XMLParser already *has* an incremental parsing interface. All that's missing is the bit that collects and exposes the events. And my point is that we shouldn't duplicate the existing *data entry* interface for that, especially not under different names for identical functionality, but only add an interface to access those events as *output*, i.e. to add the bit of the API that's actually missing.

As an analogon, what would you say if I asked for adding a new, separate Mapping interface to Python that uses the method name "set_value()" instead of "__setitem__()" because I want it to read data from the hard drive and not from memory?

pitrou · 2013-08-09T15:02:41Z

<rant>It's easy to say that as a core developer with commit rights
who can simply hide changes in a low frequented bug tracker without
notifying those who have to know about these changes.</rant> It's
pure luck I noticed this change during the alpha release cycle.

It is the rule for most stdlib improvements that they go directly
through the bug tracker. Most core developers and outsiders
would feel swamped by the traffic if all feature additions went
through the mailing-list.

I'm honestly baffled that you think I am trying to "hide" things.
Why do you think I would feel guilty about proposing an addition,
or try to sneak things inside xml.etree?

(yes, we could theoretically run polls for every addition
we propose, collect and discuss the results, and iterate several
times until the outcome is successful; I don't think any of us
has the bandwidth to do that, which is why that practice is
only used for game-making changes (i.e. PEP material))

I'm also not arguing about naming. I'm questioning the design, trying
to get it into a shape that fits the existing APIs. Why do we need
two incremental parsing interfaces in one and the same library that
use completely different method names, although doing otherwise
exactly the same?

Well, unless I'm missing something, TreeBuilder doesn't do parsing,
it takes already parsed data: it has a start() method to open a tag,
and a data() method to add raw text inside that tag. IncrementalParser,
OTOH, has a data_received() method to which you pass a piece of
non-parsed XML string.

The other incremental parsing API is actually iterparse(). The
reason I proposed IncrementalParser is that iterparse() is useless
for non-blocking applications. IncrementalParser produces the same
kind of output as iterparse(), but control of when to feed it data
is transferred to the user of the API.

scoder · 2013-08-09T15:12:46Z

TreeBuilder doesn't do parsing, it takes already parsed data: it has a
start() method to open a tag, and a data() method to add raw text
inside that tag.

That is correct. However, the XMLParser has a feed() method that sends new data into the parser, and a close() method that tells the parser that it's done. So there already is an incremental parsing interface, and your change is duplicating that interface under a different name. Specifically, IncrementalParser has exactly the same interface as XMLParser when it comes to entering data, but uses different method names for it. This is a Bad Design because it introduces an unnecessary inconsistency in the API.

However, what you are trying to change is not the way data is *entered* into the parser. What you are after is to change the way the *parsed* data is *presented* to the user. That is not the responsibility of the parser, it's the responsibility of the TreeBuilder.

In your code, the TreeBuilder builds a tree, and the new interface *additionally* collects parse events in a list. So, the right way to do it would be to change the parser *target* to do both, i.e. to build a tree and collect events at the same time.

pitrou · 2013-08-09T15:14:39Z

> Well, I would rather like to understand yours.

My point is that the IncrementalParser uses a TreeBuilder that builds
an XML tree in the back. So I'm wondering why you are saying that it
doesn't build a tree.

Unless I'm reading it wrong, when _setevents() is called, the internal
hooks are rewired to populate the events list, rather than call the
corresponding TreeBuilder methods. So, yes, there's a TreeBuilder
somewhere, but it stands unused.

(the _setevents() method already existed on the C impl, by the way.
I added it to the Python impl to make things more regular and avoid
two separate iterparse() implementations)

And my
point is that we shouldn't duplicate the existing *data entry*
interface for that, especially not under different names for
identical functionality, but only add an interface to access those
events as *output*, i.e. to add the bit of the API that's actually
missing.

What difference would that make? In the end, you mustn't mix
event-driven/non-blocking and cumulative/blocking styles of
programming, so having two separate APIs doesn't strike me as
a problem (it may be a good thing actually, for clarity reasons).

ethanfurman · 2013-08-26T21:31:02Z

Stefan Behnel wrote:

... I still wonder why I'm the one writing all the patches ...

I imagine for the same reasons that I offered to write Enum: I had definite ideas about how it should be, it is sometimes easier to explain with working code than with prose, and I wanted to contribute to Python.

I have to agree with Eli, though. Wading through the accusations to get to the ideas is not helping your cause.

elibendersky · 2013-08-27T04:01:46Z

On Mon, Aug 26, 2013 at 12:35 PM, Stefan Behnel <report@bugs.python.org>wrote:

Stefan Behnel added the comment:

Here is a proof-of-concept patch that integrates the functionality of the
IncrementalParser into the XMLParser. I ended up reusing most of Antoines
implementation and test suite. In case he'll look back into this ticket at
some point, I'll put a "thank you" here.

The patch certainly needs more cleanup, but it shows where things are
going and passes all tests.

OK, I looked at the patch. So I think I understood your verbal proposal
correctly earlier. And as I said then, I don't like it. I believe Nick has
summarized nicely why heaping everything into XMLParser is a bad idea, but
since you went the extra mile and made a patch, I will try to provide a
more expanded rationale.

Putting _setevents aside for the moment, XMLParser is a clean and simple
API. Its output is only "push" (by calling callbacks on the target). It
doesn't deal with Elements at all. It decomposes the input XML and produces
push events. That's it. It doesn't even have to interact with the target to
a larger extent than pushing callback calls to it.

You propose to add a completely different mode of operation to it,
controlled by a constructor flag. In "collect_events mode", XMLParser
undergoes a transformation. It becomes a pull API. It starts interacting
with the target more deeply by using it to produce elements for events.
Coupled with "weird" targets, its semantics become totally unclear.

So it's a single class that acts completely differently based on an
argument. This is very unpythonic. But things get worse. XMLParser is not
supposed to be an isolated entity. Users should be able to implement custom
parsers that provide its API. When the API is simple - feed/close and some
callbacks, replacing/ehnancing/faking it is easy. When the API is complex
-- two different modes of operation -- it's considerably more difficult.

That's the gist of it. Sticking to the current design of separation of
concerns between the parser and the target/treebuilder is IMO the better
way looking forward. In the mean-time, a helper class for asynchronous
parsing is useful and timely. By restricting its API, changing its
underlying implementation will be easy in the future.

scoder · 2013-08-27T07:15:51Z

Putting _setevents aside for the moment,

Agreed, obviously.

XMLParser is a clean and simple API. Its output is only "push" (by calling callbacks on the target). It doesn't deal with Elements at all.

We already agreed on that, too. Keep things simple.

It decomposes the input XML and produces push events. That's it. It doesn't even have to interact with the target to a larger extent than pushing callback calls to it.

It does, though. That's essentially how iterparse works. I.e. the API between parser and target that I'm using here is already there. And has been for years. And I think that the feature in question shows that this was a particularly good design. Thank you Fredrik!

You propose to add a completely different mode of operation to it, controlled by a constructor flag. In "collect_events mode", XMLParser undergoes a transformation.

We clearly disagree here.

Coupled with "weird" targets, its semantics become totally unclear.

Try to come up with a patch for your own proposal that avoids that. You will notice that you will be using the existing protocol, thus the target will continue to be able to do what it wants. If it's a weird target, then it's a weird target because the user said so. But it's the target's very own concern to do expected things (build a tree) or do weird things (whatever an example for this is).

Except when you disallow using user defined targets, but then, that's an arbitrary restriction that ignores what is there already.

As long as you want to keep up the existing separation of concern between parser and target (and we agreed on that being a good thing), and as long as you don't impose arbitrary (and arbitrarily complex) restrictions on user code, you will not be able to prevent the target from doing "weird" things that impact the events being collected by whatever interface you choose for this. And why should you?

XMLParser is not supposed to be an isolated entity. Users should be able to implement custom parsers that provide its API. When the API is simple - feed/close and some callbacks, replacing/ehnancing/faking it is easy. When the API is complex -- two different modes of operation -- it's considerably more difficult.

That's only true when you use it for incremental event parsing. If you don't, that part of the API doesn't have to bother you. If you want to implement a custom parser that doesn't support event collection, then simply don't support the "collect_events" keyword argument and you're done. Implement read_events() by returning an empty iterator, if you like, that's up to you. If you want to write a target that prevents parsers from keeping tree fragments alive in memory, simply return None from your callback methods and you're done. If you want to write a target that changes the Elements that the parser collects, then you can do that. Yes, those are actually features. And you could already do most of this before.

Remember that it's the users using the parser and providing (or implementing) the target. We should let them.

I'm sorry if I sound offensive, but most of what I read as counter arguments in this ticket so far was either incomplete or (sometimes) even plain wrong. Some seemed pretty close to FUD, but maybe that's just me (and I'm guilty here too, I guess - sorry for that). Now, how could this discussion possibly give the me impression that everyone has the same level of understanding? Even your comment about "weird" targets, right above, after going through all of this discussion, makes me wonder if you really know how all of this works.

Please take a closer look at the original code before Antoine changed it. Play with the parser targets a bit. That will help you understand this issue better.

Also, if you want to come up with a patch that implements approach 3), i.e. a class between parser and target, then please do. I tried it (the incomplete and partially fake patch is attached to this ticket), and my take on it is that it's too complex and requires some rather non-trivial changes, including user visible ones. So, based on that experience, I decided that it would be better to keep things simple and integrate the feature more closely with what is there anyway. I think it might help if you went through the same experience.

ncoghlan · 2013-08-27T09:25:57Z

The whole point of the new API is not to replace XMLParser, but to
provide a convenience API to set up a particular combination of an
XMLParser with a particular kind of custom target. It just happens
that actually *implementing it* that way doesn't work right with the
current code. It is *not* necessary (or even appropriate) for it to
serve as a new building block, but as a convenience API to compose
*existing* building blocks in a particular way (or should be, except
that composition doesn't currently work right).

Once the underlying deficiencies are fixed, then the event capturing
target can be exposed directly, but in the meantime, the *composed*
version can be exposed by relying on internal APIs.

We could call the new class XMLParserWithEventCapturingTarget, but
that's a little clumsy :)

scoder · 2013-08-27T10:37:21Z

The whole point of the new API is not to replace XMLParser, but to provide a convenience API to set up a particular combination of an XMLParser with a particular kind of custom target.

Ok, but I'm saying that we don't need that. It's all there, it all comes together at the interfaces of the parser. The two-way communication between parser and target already exists and is used by iterparse().

So, what I'm advocating is that we should not complicate the module interface with a new class (and eventually two classes) at all. Instead, we should just expose the *existing* interface of the parser in two ways, once as callbacks and once as collected events. I find that much easier to explain than any of the other proposals I've seen so far.

This is really not about doing it this way because it's technically too difficult to do differently. It's about keeping things simple for users and well integrated with what's there.

BTW, Eli asked for working code before we discuss. I've provided it. Now I would like to see working code for the proposed three-level design in order to make sure it actually works and feels right. I've already said that I've tried it and it didn't feel right. We can continue to discuss hot air for another month or two, or we can stick to discussing real code and real APIs.

elibendersky · 2013-08-27T13:51:49Z

I'm sorry if I sound offensive, but most of what I read as counter

arguments in this ticket so far was either incomplete or (sometimes) even
plain wrong. Some seemed pretty close to FUD, but maybe that's just me (and
I'm guilty here too, I guess - sorry for that). Now, how could this
discussion possibly give the me impression that everyone has the same level
of understanding? Even your comment about "weird" targets, right above,
after going through all of this discussion, makes me wonder if you really
know how all of this works.

Please take a closer look at the original code before Antoine changed it.
Play with the parser targets a bit. That will help you understand this
issue better.

Ah, awesome. Well, you can't say I didn't warn you, can you?

I am done discussing this issue with you, Stefan.

scoder · 2013-08-27T14:22:23Z

Eli, I agree that we've put way more than enough time into the discussion by now. We all know each other's arguments and failed to convince each other.

Please come up with working code that shows that the approach you are advocating for the final implementation is doable and also helpful from an API user's point of view.

ncoghlan · 2013-08-27T22:06:21Z

Stefan, your proposed merged design isn't going to happen. Two alternative
ways of using the one class is far more complicated to explain to users
than exposing a helper that composes the existing API components to address
a particular use case (particularly when the helper is hiding the fact that
the theoretical underlying composition doesn't currently work the way it
should according to the public APIs).

As a higher level helper, users that are comfortable with and prefer the
lower level API are then clearly free to ignore. This recasting of the
nature of the API as a composition of the lower level components is a
*direct* response to your concern about it being a capability that the
lower level API should support directly.

You clearly don't like that design choice, but this isn't a democracy.
Arguing further on this point will just be ranting for the sake of ranting,
rather than having any chance of changing Eli's mind as module maintainer.

As far as naming goes, I still think it's better to drop "Parser" from the
name entirely. Call it "XMLEventReader" or something like that. The point
is to use it when you want to get the events out without blocking when no
events have been triggered and don't want to set up your own class to
receive event notifications through the target push API. The docs should
make it clear that this API is "pull only" - if people need callbacks or
other push triggers, then that's still what custom targets are for.

ncoghlan · 2013-08-27T22:28:34Z

Sorry, Stefan, I missed your last comment before posting mine. It appears
you had already reached the same conclusion I had regarding further high
level design discussion being pointless :)

elibendersky · 2013-08-27T23:15:20Z

W.r.t. the summary in http://bugs.python.org/msg196177, after some further chat with Nick, a name that sounds good for us for the class is XMLPullParser. It has both XML and Parser in it, so the intention is clear. On the other hand, Pull works well for analogy with the existing pull API in xml.dom.pulldom, *and* suggests that the push API is inactive and gets redirected to a pull API (thanks Nick for this insight).

The method names remain as suggested there.

scoder · 2013-08-28T04:39:48Z

the push API is inactive and gets redirected to a pull API

Given that this is aimed to become a redundant 'convenience' wrapper around something else at a point, I assume that you are aware that the above is just an arbitrary restriction due to the fact that the parser does not accept arbitrary targets as input.

Also, the refrence to pulldom is a bad one, because pulldom reads data as you request it ("pull"). This class creates events based on the data you *push* into it. So it's really a push API from the POV of the user, not so much a pull API.

Anyway, it should be possible to emulate both this class and the pie-in-the-sky three-level API based on what I suggested, so I can't see this addition being a problem. I may even end up making the same visual split in lxml.etree deliberately, in order to make it clear that this feature only applies to the push parser API, i.e. when using feed() and close(), as opposed to what parse() and fromstring() use. That would mean that there'd also be a corresponding class for the HTMLParser then, and both would inherit from the base parsers and thus obviously (and conveniently) accept arbitrary parser targets.

I think I'm ok with adding that class. I liked Greg Ewing's suggestion of calling it AsyncParser, though, i.e. AsyncXMLParser. It makes it obvious what the use case is and avoids any references to push or pull that will always confuse some users, based on what part of the feature they are focussing on in their code.

elibendersky · 2013-08-28T14:25:20Z

This patch implements the renaming and updates the documentation.

scoder · 2013-08-28T16:53:52Z

Any comments regarding my naming suggestion?

Calling it a "push" parser is just too ambiguous.

scoder · 2013-08-28T16:54:19Z

Erm, "pull" parser, but you see what I mean.

python-dev · 2013-08-30T12:51:42Z

New changeset 8fd72b1bb262 by Eli Bendersky in branch 'default':
Issue bpo-17741: Rename IncrementalParser and its methods.
http://hg.python.org/cpython/rev/8fd72b1bb262

elibendersky · 2013-08-30T12:53:50Z

This issue has become too large and discusses a few different things; hence I'm inclined to close it as fixed and open a new one as a placeholder for discussing the new design of the internals.

I'll do this in a couple of days if there are no objections.

python-dev · 2013-08-31T14:37:48Z

New changeset 1faaec66c73d by Eli Bendersky in branch 'default':
Fix XMLPullParser documentation to say "non-blocking" instead of "asynchronous".
http://hg.python.org/cpython/rev/1faaec66c73d

elibendersky · 2013-09-02T00:05:26Z

I'm closing this issue as fixed because the feature was implemented. As discussed, opened a new issue (bpo-18902) to discuss re-designing some parts of the current code that would allow a nicer implementation of this feature. Anyone interested in the subject - feel free to subscribe yourself there.

scoder · 2013-09-08T18:26:14Z

While refactoring the iterparse() implementation in lxml to support this new interface, I noticed that the close() method of the XMLPullParser does not behave like the close() method of the XMLParser. Instead of setting some .root attribute on the parser instance, the method should return the root element that the tree builder generated.

scoder · 2013-09-09T14:53:46Z

Looks like we missed the alpha2 release for the close() API fix. I recommend not letting yet another deadline go by.

scoder · 2013-09-09T19:33:04Z

Created separate bpo-18990 to keep this one closed as is.

scoder · 2013-09-13T06:23:50Z

The way the XMLPullParser is implemented in lxml.etree now is that it simply inherits from XMLParser. This would also make sense for ElementTree, even before supporting arbitrary targets. The patch in ticket bpo-18990 makes this simple to do.

For reference, here is the implementation in lxml.

Iterparse:

https://github.com/lxml/lxml/blob/master/src/lxml/iterparse.pxi

XMLPullParser:

https://github.com/lxml/lxml/blob/d9f7cd8d12a27cafc4d65c6e280ea36156e3b837/src/lxml/parser.pxi#L1357

SAX based parser that collects events and/or maps parser callbacks to callbacks on the target object:

https://github.com/lxml/lxml/blob/master/src/lxml/saxparser.pxi

bitdancer · 2014-03-10T22:05:14Z

As far as I can tell from scanning this ticket, there was no agreement on deprecating the parser argument to iterparse, but it has been documented as deprecated with no documentation about what to replace it with, nor any DeprecationWarning in the code that I can see. Should I remove the deprecation documentation and someone can figure out what to do about it in 3.5, or was the decision just not recorded here (or I missed it)? (It doesn't look like there's a programatic deprecation for the html argument either.)

python-dev · 2014-03-10T22:11:27Z

New changeset 31e6adf5bfba by R David Murray in branch 'default':
whatsnew: deprecation of ElementTree XMLParser *html* and iterparse *parser*.
http://hg.python.org/cpython/rev/31e6adf5bfba

scoder · 2014-03-11T07:20:28Z

My latest status is that a decision on the future of the "parser" argument is still pending. See bpo-20219.

It's correctly deprecated in the sense that passing any previously existing parser isn't going to be supported anymore, but passing an XMLPullParser (which is new in Py3.4 and thus can't have been used in existing code) now makes perfect sense (IMHO). So, in a way, it might end up as a sort of argument recycling rather than argument deletion.

ncoghlan · 2014-03-11T11:22:31Z

Yeah, I'd agree with Stefan here - documented deprecation is reasonable at this point (so that new users avoid it for now), but we may still undeprecate it later if we decide it makes sense to do so.

bitdancer · 2014-03-11T13:16:29Z

I think the current status of whatsnew is OK for this, then.

bitdancer · 2014-03-11T13:18:02Z

Someone should add programatic deprecation for the html argument, though, since people need to switch to keyword-based calls to XMLParser to prepare for that going away.

pitrou added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Apr 15, 2013

pitrou closed this as completed Apr 18, 2013

scoder added the topic-XML label Aug 9, 2013

pitrou reopened this Aug 9, 2013

elibendersky mannequin closed this as completed Sep 2, 2013

ezio-melotti transferred this issue from another repository Apr 10, 2022

event-driven XML parser #61941

event-driven XML parser #61941

Comments

pitrou commented Apr 15, 2013

pitrou commented Apr 15, 2013

pitrou commented Apr 15, 2013

elibendersky mannequin commented Apr 16, 2013

pitrou commented Apr 16, 2013

elibendersky mannequin commented Apr 18, 2013

python-dev mannequin commented Apr 18, 2013

pitrou commented Apr 18, 2013

scoder commented Aug 9, 2013

scoder commented Aug 9, 2013

scoder commented Aug 9, 2013

scoder commented Aug 9, 2013

scoder commented Aug 9, 2013

pitrou commented Aug 9, 2013

pitrou commented Aug 9, 2013

scoder commented Aug 9, 2013

scoder commented Aug 9, 2013

pitrou commented Aug 9, 2013

pitrou commented Aug 9, 2013

scoder commented Aug 9, 2013

scoder commented Aug 9, 2013

pitrou commented Aug 9, 2013

scoder commented Aug 9, 2013

pitrou commented Aug 9, 2013

scoder commented Aug 9, 2013

pitrou commented Aug 9, 2013

ethanfurman commented Aug 26, 2013

elibendersky mannequin commented Aug 27, 2013

scoder commented Aug 27, 2013

ncoghlan commented Aug 27, 2013

scoder commented Aug 27, 2013

elibendersky mannequin commented Aug 27, 2013

scoder commented Aug 27, 2013

ncoghlan commented Aug 27, 2013

ncoghlan commented Aug 27, 2013

elibendersky mannequin commented Aug 27, 2013

scoder commented Aug 28, 2013

elibendersky mannequin commented Aug 28, 2013

scoder commented Aug 28, 2013

scoder commented Aug 28, 2013

python-dev mannequin commented Aug 30, 2013

elibendersky mannequin commented Aug 30, 2013

python-dev mannequin commented Aug 31, 2013

elibendersky mannequin commented Sep 2, 2013

scoder commented Sep 8, 2013

scoder commented Sep 9, 2013

scoder commented Sep 9, 2013

scoder commented Sep 13, 2013

bitdancer commented Mar 10, 2014

python-dev mannequin commented Mar 10, 2014

scoder commented Mar 11, 2014

ncoghlan commented Mar 11, 2014

bitdancer commented Mar 11, 2014

bitdancer commented Mar 11, 2014