Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upAdd xml5 parser to html5ever #125
Conversation
|
It seems I'll have to upgrade Rust for Travis to pass. Depends on #124 |
|
@kmcallister Should I rebase now or wait for Tendril to be integrated? |
Changed elem_name in TreeSink to borrow instead of move. Rest of changes are caused by it. No change in behaviour detected. Change is prerequisite for XML5 parser; plus it avoids clones in a few places.
Adds XML5 parser based on [spec](https://github.com/annevk/xml5) by . Currently working draft resides on https://github.com/Ygg01/xml5_draft and is rendered using [Bikeshed](https://github.com/tabatkins/bikeshed). This patch is only concerned with making it work right, according to spec. Things to be done: 1. Add some support for doctype as suggested by annevk here: Ygg01/xml5_draft#2 2. Finish references in xml5 - basically, use all entity replacements html5 uses and add test for those. 3. Add Namespace support. 4. Unify two parsers using associated types. 5. Add C API for xml5 parser.
- Add Processing Instruction as a separate type of Nodes. This is a prerequisite for proper XML support.
- Add basic tokenization and tree building tests
- Add example of xml_tokenizer, similar to examples/tokenizer.rs - Add small commented out snippet that turns print-rcdom into an XML tree printer
|
|
|
Sorry to only say this after you’ve done so much work, but does this really need to be part of html5ever? How much code is there in common? |
|
@SimonSapin Does it need to be part of html5ever? It depends on whether the servo needs something that parses lax XML files? I think Kuchiki might find it interesting to parse XML. Servo? I dunno, maybe parsing XHTML files? The html5ever and xml5ever share a surprising amount of code. It could share more though. Lots of code is duplicated, but that is due to the way I handle Tokens (I have some tokens Major differences are state machine and tree builder, and anything that Token related like TokenSink. Rest of it, is pretty similar. Is there is no interest in parsing XML in Servo or other libraries, then that's that, but if the problem is lack of common code, I could attempt a second take that will share more code. |
|
Servo will probably want to have some form of XML parsing at some point, but it’s unclear yet whether XML5 is the way to go. As far as I know, XML5 has never been deployed in a web browser, or even used much beyond a few test cases. Don’t get me wrong, I like the ideas in XML5, but it’s not an established solution yet. My question regarding code in common is: could XML5 parsing be a separate library?( Either one that depends on html5ever if reusing some components is useful, or an independent one.) Or is there so much internals in common that you’d rather not duplicate it? |
I see. In that case could you close #43 ?
Difficult question. From my memory, it's plausible that I could separate it into a library. However there will be much code duplication. I will try during this weekend to separate it and see how it goes, but all signs point to that it's plausible. |
|
I see. Don’t worry about separating it yet, don’t spend too much time on it this week end. The Servo team is meeting next week, I’ll bring this up then so we can decide what to do. I appreciate the work you’ve done so far and I’d like to figure out something that’s not just dumping a bunch more work on you :) |
|
Agreed that we should discuss it at Whistler, although really, we can't know how good XML5 is without actually doing it. So I am mostly interested what the main drawbacks of landing this are assuming we discover later that it doesn't satisfy our requirements. |
|
@SimonSapin Well whatever work I put into it now, it's not going to go in vain either way :)
Good in what way? Speed? Memory consumption? Compatibility with older XML? Ease of removal? Main drawback I see is searching all of ways xml parser is referenced and not confusing it for parts of html5ever. However, I could make it easier to remove. I could pull my additions into xml folders so removing them is essentially the same as finding and deleting all xml subfolders within src, removing tests/examples that use them and removing all references that cause compile to fail. It was one of changes I had in mind, amongst other things. Only change that touches html5ever, that I know of is replacing |
|
I think Jack means good in terms of compatibility with existing web content. |
|
Hard to say without some examples of what said web content entails (MathML? SVG? XHTML?). On token level xml5 and html5 are quite similar. However, when it comes to tree construction they might interpret things differently (e.g. adoption agency algorithm doesn't exist in xml5). |
|
@SimonSapin @metajack Was a decision regarding this PR been made? |
|
Sorry, it turned out that we spent more time meeting with other Mozilla teams than actually working on Sevo last week, so this wasn’t discussed. I still feel that HTML parsing and XML(5) parsing are different things, and would prefer having different things in separate crates. Rather than ask you to do yet more work, I had a go at it: https://github.com/SimonSapin/xml5ever . I started from a rebase of this branch and a new empty crate, copied the lines showed as new in the diff into new files, and added/fixed stuff until it built and passed the tests. It depends on string-cache and tendril, but not html5ever. According to diff stats there are ~1042 lines duplicated with html5ever, including 289 (rcdom) only used in tests. What do you think?
Yes, predicting web-compatibility is hard. What Jack meant by actually doing it is that the only way to be sure is to ship a browser to enough users and wait for bug reports. The said web content is all of it. Any web site that users might care enough about to switch browsers if we break it. |
|
Well, it's good to know either way :) @SimonSapin Anyway, awesome that you separated it |
As I see it, Compare with Kuchiki, another implementation of html5ever’s At the moment, xml5ever’s
Well, my opinion is that it doesn’t need to be integrated and I’m not convinced it should. What do you think? If you think it should, why? |
Good question. I think since the specification is pliable, and XML5 could be modified, it would be possible to make it work with any web content and thus fit the role of XHTML parser. It's what I originally set out to do (add XML5 to html5ever). On technical side. It occurs to me, that most optimization, at least when it comes to do with tokenization, could apply to xml5ever as well as html5ever (e.g. tendrils, SIMD, etc.). Although to be fair, it's possible you could do extra XML optimization, that probably couldn't apply to HTML. In that sense, I think there could be lot of duplication between the two projects. Then again, the shared code could be divided into crates and shared between projects. PS. However, for the time being you're probably right :( I'll go ahead and close this PR. |
|
Tendril is a separate crate that both parsers can use. Could the same be done for other optimizations? |
|
Hard to make predicitions, but I don't see why not. |
|
Could you click the Fork button on https://github.com/SimonSapin/xml5ever ? Then I’ll re-fork from you so that you appear as upstream. Thanks! |
|
@SimonSapin Done - I guess. Never forked a fork before. Anyway, thanks sorting xml5ever out |
|
Alright, https://github.com/Ygg01/xml5ever it is :) |
Ygg01 commentedApr 13, 2015
This large commit concerns #43. It adds everything that is currently defined in spec.
Included inside:
Missing: