-
-
Notifications
You must be signed in to change notification settings - Fork 29.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
event-driven XML parser #61941
Comments
iterparse() is a blocking operation. It is not really suitable for event-driven applications (e.g. non-blocking I/O). Here is a patch adding a IncrementalParser class. |
Note that basing iterparse() on IncrementalParser and removing the API discrepancy between the Python and C modules helps make the etree code smaller. |
Thanks Antoine. This looks interesting - I'm somewhat swamped ATM but will try to review the patch in the next few days. Incidentally, since it is a new feature would it be worthwhile to discuss it on python-ideas? |
I've posted on the tulip mailing-list already: |
Antoine, the patch LGTM. There's some more cleaning that needs to be done in surrounding code, but I can do that later. Also I should probably update the documentation with a bit more details. Just add a NEWS entry when you commit. Thanks for working on this. |
New changeset f903cf864191 by Antoine Pitrou in branch 'default': |
Thank you, Eli. Now committed :) |
Here is a patch that cleans up the current implementation to avoid making the result of iterparse() an IncrementalParser (with all of its new API). Please note that the tulip mailing list is not an appropriate place to discuss additions to the XML libraries, and ElementTree in particular. |
Copying the discussion between Antoine and me from python-dev:
I see your point about close(). I assume your reasoning was to make the IncrementalParser an arbitrary stream end-point. However, it doesn't really make all that much sense to connect an arbitrary data source to it, as the source wouldn't know that, in addition to passing in data, it would also have to ask for events from time to time. I mean, you could do it, but then it would just fill up the memory with parser events and loose the actual advantages of incremental parsing. So, in a way, the whole point of the class is to *not* be an arbitrary stream end-point. Anyway, given that there isn't really the One Obvious Way to do it, maybe you should just add a docstring to the class (ahem), reference the stream protocol as the base for its API, and then rename it to IncrementalStreamParser. That would at least make it clear why it doesn't really fit with the rest of the module API (which was designed some decade before PEP-3156) and instead uses its own naming scheme. |
Actually, let me take that last paragraph back. There is an Obvious Way to do it, and that's the feed() and close() methods. They are the existing and established ElementTree interface for incremental parsing. The fact that close() doesn't clean up all state is IMHO a minor issue. The state will be cleaned up automatically once the iteration terminates, and that's the normal behaviour of iterators. So it actually fits both protocols quite nicely. I'd just leave the stream protocol out completely here. For one, the implementation isn't complete anyway (the connection_*() methods don't make much sense), and as I said, it's not very useful to consider the parser a general end-point to that protocol. I'd also suggest returning the iterator over the remaining events from close(), just like the TreeBuilder returns the tree. That covers the (less common) use case of first parsing everything and then processing the events. I'll add another patch that does that. |
Thinking about the original patch some more - I wonder why it doesn't use a wrapper for TreeBuilder to process the events. Antoine, is there a reason why you had to add this _setevents() method to the XMLParser, instead of making the IncrementalParser an IncrementalTreeBuilder? |
Oh, and could we reopen this ticket, please? |
The point is not to build a tree of potentially unbounded size (think XMPP). The point is to yield events in a non-blocking way (iterparse() is blocking, which makes it useless for non-blocking applications). An IncrementalTreeBuilder wouldn't have much point IMO. It is not more costly to accumulate a string and parse it at the end, than to progressively build a growing XML tree. |
About the patch: I think changing the API names now that alpha1 has been released is a bit gratuitous. I would like to keep the encapsulation part but ditch the naming changes. |
Ok, but that's the only difference. Instead of getting the events from the parser, you could equally well get them from the TreeBuilder, also in a non-blocking way. Sticking this functionality into a parser target object has the advantage that the parser interface wouldn't have to change. So, instead of introducing an entirely new parser interface, we'd just add a class that can be used as a parser target and provides an additional events() method. That's a substantially less invasive API change.
I don't think I understand what you mean. |
Sorry for being late, but I can't see it being my fault. A change in an alpha release is still way better than a change after a final release. Eventually, lxml will have to go on par here, so it's better to get it right now, before any harm is done. |
But your TreeBuilder is also growing a tree internally, and therefore I don't see the point of stuffing different kinds of functionality |
But worse than no change at all. Arguing about API naming is a bit futile, |
But IncrementalParser uses an XMLParser internally, which in turn uses a TreeBuilder internally. So what exactly is your point? |
<rant>It's easy to say that as a core developer with commit rights who can simply hide changes in a low frequented bug tracker without notifying those who have to know about these changes.</rant> It's pure luck I noticed this change during the alpha release cycle. I'm also not arguing about naming. I'm questioning the design, trying to get it into a shape that fits the existing APIs. Why do we need two incremental parsing interfaces in one and the same library that use completely different method names, although doing otherwise exactly the same? Why should the type of the *result* of the parsing process change the method name that you need to call to *insert* data into the parser? |
Well, I would rather like to understand yours. Whatever (as a matter of fact, iterparse() doesn't expose a |
My point is that the IncrementalParser uses a TreeBuilder that builds an XML tree in the back. So I'm wondering why you are saying that it doesn't build a tree.
And in fact, we don't even need an IncrementalParser, because XMLParser already *has* an incremental parsing interface. All that's missing is the bit that collects and exposes the events. And my point is that we shouldn't duplicate the existing *data entry* interface for that, especially not under different names for identical functionality, but only add an interface to access those events as *output*, i.e. to add the bit of the API that's actually missing. As an analogon, what would you say if I asked for adding a new, separate Mapping interface to Python that uses the method name "set_value()" instead of "__setitem__()" because I want it to read data from the hard drive and not from memory? |
It is the rule for most stdlib improvements that they go directly I'm honestly baffled that you think I am trying to "hide" things. (yes, we could theoretically run polls for every addition
Well, unless I'm missing something, TreeBuilder doesn't do parsing, The other incremental parsing API is actually iterparse(). The |
That is correct. However, the XMLParser has a feed() method that sends new data into the parser, and a close() method that tells the parser that it's done. So there already is an incremental parsing interface, and your change is duplicating that interface under a different name. Specifically, IncrementalParser has exactly the same interface as XMLParser when it comes to entering data, but uses different method names for it. This is a Bad Design because it introduces an unnecessary inconsistency in the API. However, what you are trying to change is not the way data is *entered* into the parser. What you are after is to change the way the *parsed* data is *presented* to the user. That is not the responsibility of the parser, it's the responsibility of the TreeBuilder. In your code, the TreeBuilder builds a tree, and the new interface *additionally* collects parse events in a list. So, the right way to do it would be to change the parser *target* to do both, i.e. to build a tree and collect events at the same time. |
Unless I'm reading it wrong, when _setevents() is called, the internal (the _setevents() method already existed on the C impl, by the way.
What difference would that make? In the end, you mustn't mix |
Stefan Behnel wrote:
I imagine for the same reasons that I offered to write Enum: I had definite ideas about how it should be, it is sometimes easier to explain with working code than with prose, and I wanted to contribute to Python. I have to agree with Eli, though. Wading through the accusations to get to the ideas is not helping your cause. |
On Mon, Aug 26, 2013 at 12:35 PM, Stefan Behnel <report@bugs.python.org>wrote:
OK, I looked at the patch. So I think I understood your verbal proposal Putting _setevents aside for the moment, XMLParser is a clean and simple You propose to add a completely different mode of operation to it, So it's a single class that acts completely differently based on an That's the gist of it. Sticking to the current design of separation of |
Agreed, obviously.
We already agreed on that, too. Keep things simple.
It does, though. That's essentially how iterparse works. I.e. the API between parser and target that I'm using here is already there. And has been for years. And I think that the feature in question shows that this was a particularly good design. Thank you Fredrik!
We clearly disagree here.
Try to come up with a patch for your own proposal that avoids that. You will notice that you will be using the existing protocol, thus the target will continue to be able to do what it wants. If it's a weird target, then it's a weird target because the user said so. But it's the target's very own concern to do expected things (build a tree) or do weird things (whatever an example for this is). Except when you disallow using user defined targets, but then, that's an arbitrary restriction that ignores what is there already. As long as you want to keep up the existing separation of concern between parser and target (and we agreed on that being a good thing), and as long as you don't impose arbitrary (and arbitrarily complex) restrictions on user code, you will not be able to prevent the target from doing "weird" things that impact the events being collected by whatever interface you choose for this. And why should you?
That's only true when you use it for incremental event parsing. If you don't, that part of the API doesn't have to bother you. If you want to implement a custom parser that doesn't support event collection, then simply don't support the "collect_events" keyword argument and you're done. Implement read_events() by returning an empty iterator, if you like, that's up to you. If you want to write a target that prevents parsers from keeping tree fragments alive in memory, simply return None from your callback methods and you're done. If you want to write a target that changes the Elements that the parser collects, then you can do that. Yes, those are actually features. And you could already do most of this before. Remember that it's the users using the parser and providing (or implementing) the target. We should let them. I'm sorry if I sound offensive, but most of what I read as counter arguments in this ticket so far was either incomplete or (sometimes) even plain wrong. Some seemed pretty close to FUD, but maybe that's just me (and I'm guilty here too, I guess - sorry for that). Now, how could this discussion possibly give the me impression that everyone has the same level of understanding? Even your comment about "weird" targets, right above, after going through all of this discussion, makes me wonder if you really know how all of this works. Please take a closer look at the original code before Antoine changed it. Play with the parser targets a bit. That will help you understand this issue better. Also, if you want to come up with a patch that implements approach 3), i.e. a class between parser and target, then please do. I tried it (the incomplete and partially fake patch is attached to this ticket), and my take on it is that it's too complex and requires some rather non-trivial changes, including user visible ones. So, based on that experience, I decided that it would be better to keep things simple and integrate the feature more closely with what is there anyway. I think it might help if you went through the same experience. |
The whole point of the new API is not to replace XMLParser, but to Once the underlying deficiencies are fixed, then the event capturing We could call the new class XMLParserWithEventCapturingTarget, but |
Ok, but I'm saying that we don't need that. It's all there, it all comes together at the interfaces of the parser. The two-way communication between parser and target already exists and is used by iterparse(). So, what I'm advocating is that we should not complicate the module interface with a new class (and eventually two classes) at all. Instead, we should just expose the *existing* interface of the parser in two ways, once as callbacks and once as collected events. I find that much easier to explain than any of the other proposals I've seen so far. This is really not about doing it this way because it's technically too difficult to do differently. It's about keeping things simple for users and well integrated with what's there. BTW, Eli asked for working code before we discuss. I've provided it. Now I would like to see working code for the proposed three-level design in order to make sure it actually works and feels right. I've already said that I've tried it and it didn't feel right. We can continue to discuss hot air for another month or two, or we can stick to discussing real code and real APIs. |
I'm sorry if I sound offensive, but most of what I read as counter
Ah, awesome. Well, you can't say I didn't warn you, can you? I am done discussing this issue with you, Stefan. |
Eli, I agree that we've put way more than enough time into the discussion by now. We all know each other's arguments and failed to convince each other. Please come up with working code that shows that the approach you are advocating for the final implementation is doable and also helpful from an API user's point of view. |
Stefan, your proposed merged design isn't going to happen. Two alternative As a higher level helper, users that are comfortable with and prefer the You clearly don't like that design choice, but this isn't a democracy. As far as naming goes, I still think it's better to drop "Parser" from the |
Sorry, Stefan, I missed your last comment before posting mine. It appears |
W.r.t. the summary in http://bugs.python.org/msg196177, after some further chat with Nick, a name that sounds good for us for the class is XMLPullParser. It has both XML and Parser in it, so the intention is clear. On the other hand, Pull works well for analogy with the existing pull API in xml.dom.pulldom, *and* suggests that the push API is inactive and gets redirected to a pull API (thanks Nick for this insight). The method names remain as suggested there. |
Given that this is aimed to become a redundant 'convenience' wrapper around something else at a point, I assume that you are aware that the above is just an arbitrary restriction due to the fact that the parser does not accept arbitrary targets as input. Also, the refrence to pulldom is a bad one, because pulldom reads data as you request it ("pull"). This class creates events based on the data you *push* into it. So it's really a push API from the POV of the user, not so much a pull API. Anyway, it should be possible to emulate both this class and the pie-in-the-sky three-level API based on what I suggested, so I can't see this addition being a problem. I may even end up making the same visual split in lxml.etree deliberately, in order to make it clear that this feature only applies to the push parser API, i.e. when using feed() and close(), as opposed to what parse() and fromstring() use. That would mean that there'd also be a corresponding class for the HTMLParser then, and both would inherit from the base parsers and thus obviously (and conveniently) accept arbitrary parser targets. I think I'm ok with adding that class. I liked Greg Ewing's suggestion of calling it AsyncParser, though, i.e. AsyncXMLParser. It makes it obvious what the use case is and avoids any references to push or pull that will always confuse some users, based on what part of the feature they are focussing on in their code. |
This patch implements the renaming and updates the documentation. |
Any comments regarding my naming suggestion? Calling it a "push" parser is just too ambiguous. |
Erm, "pull" parser, but you see what I mean. |
New changeset 8fd72b1bb262 by Eli Bendersky in branch 'default': |
This issue has become too large and discusses a few different things; hence I'm inclined to close it as fixed and open a new one as a placeholder for discussing the new design of the internals. I'll do this in a couple of days if there are no objections. |
New changeset 1faaec66c73d by Eli Bendersky in branch 'default': |
I'm closing this issue as fixed because the feature was implemented. As discussed, opened a new issue (bpo-18902) to discuss re-designing some parts of the current code that would allow a nicer implementation of this feature. Anyone interested in the subject - feel free to subscribe yourself there. |
While refactoring the iterparse() implementation in lxml to support this new interface, I noticed that the close() method of the XMLPullParser does not behave like the close() method of the XMLParser. Instead of setting some .root attribute on the parser instance, the method should return the root element that the tree builder generated. |
Looks like we missed the alpha2 release for the close() API fix. I recommend not letting yet another deadline go by. |
Created separate bpo-18990 to keep this one closed as is. |
The way the XMLPullParser is implemented in lxml.etree now is that it simply inherits from XMLParser. This would also make sense for ElementTree, even before supporting arbitrary targets. The patch in ticket bpo-18990 makes this simple to do. For reference, here is the implementation in lxml. Iterparse: https://github.com/lxml/lxml/blob/master/src/lxml/iterparse.pxi XMLPullParser: https://github.com/lxml/lxml/blob/d9f7cd8d12a27cafc4d65c6e280ea36156e3b837/src/lxml/parser.pxi#L1357 SAX based parser that collects events and/or maps parser callbacks to callbacks on the target object: https://github.com/lxml/lxml/blob/master/src/lxml/saxparser.pxi |
As far as I can tell from scanning this ticket, there was no agreement on deprecating the parser argument to iterparse, but it has been documented as deprecated with no documentation about what to replace it with, nor any DeprecationWarning in the code that I can see. Should I remove the deprecation documentation and someone can figure out what to do about it in 3.5, or was the decision just not recorded here (or I missed it)? (It doesn't look like there's a programatic deprecation for the html argument either.) |
New changeset 31e6adf5bfba by R David Murray in branch 'default': |
My latest status is that a decision on the future of the "parser" argument is still pending. See bpo-20219. It's correctly deprecated in the sense that passing any previously existing parser isn't going to be supported anymore, but passing an XMLPullParser (which is new in Py3.4 and thus can't have been used in existing code) now makes perfect sense (IMHO). So, in a way, it might end up as a sort of argument recycling rather than argument deletion. |
Yeah, I'd agree with Stefan here - documented deprecation is reasonable at this point (so that new users avoid it for now), but we may still undeprecate it later if we decide it makes sense to do so. |
I think the current status of whatsnew is OK for this, then. |
Someone should add programatic deprecation for the html argument, though, since people need to switch to keyword-based calls to XMLParser to prepare for that going away. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: