What should our XML story be, exactly? #820

domenic · 2014-07-05T20:51:41Z

(Spinning off of #818.)

I want to figure out our ideal XML story, and then see what changes we want to make before 1.0 in order to reduce upgrade pain, both from 0.x to 1.x and from 1.x to any eventual 2.x that might be necessary to fully reach the correct XML story.

As always, our guide should be, "what do browsers do?" I haven't found the specification here, but my understanding is that they switch based on mime type. So, image/svg+xml and application/xhtml+xml both go into "XML parsing mode." Of those two, SVGs produce SVGDocument instances, and and XHTML produces HTMLDocument instances.

My preference, I think, would be to only support XHTML, at least for now. That would mean, no SVG support at all. This would be a slight regression, but not much.

As for how we support both XHTML and HTML: I think we need an external mime-type switch. Probably with some inference, e.g. if we derive it from a file and the filename ends in .xhtml, we switch into XHTML mode? Or if we derive it from a URL, we check the headers? But it can also be overriden. So jsdom gets a new option, parsingMode, which is either "auto" (= default HTML, but try to infer), "xhtml", or "html".

I think for 1.0 at least, XHTML parsing mode just means "use htmlparser2", and not much else changes. Over time we could e.g. fix the serialization of XHTML-mode documents to be valid XHTML.

The text was updated successfully, but these errors were encountered:

domenic · 2014-07-06T00:20:13Z

This may be related... http://domparsing.spec.whatwg.org/

Sebmaster · 2014-07-07T09:33:01Z

That would mean, no SVG support at all.

SVG support comes basically mostly for free if we don't specifically block it. I mean, some resource loadings/properties might not be supported but the basic dom tree will work as soon as we get XHTML support.

This would be a slight regression, but not much.

I don't think so, SVG didn't work correctly at all up until now if #779 is anything to go by.

XHTML produces HTMLDocument instances

Currently we don't really change our instance types. We just have a Document type where we throw everything in and copy that over to HTMLDocument.

So jsdom gets a new option, parsingMode

Agreed, that API sounds nice.

I think for 1.0 at least, XHTML parsing mode just means "use htmlparser2", and not much else changes.

We have a problem with that since the user can currently configure the parser they want to use.
I vote for a clean cut here: drop all external parser support, bundle parse5 and htmlparser2 as dependency and use parse5 for general purpose, htmlparser2 for XML - done.
With this done, we make a clean cut to parsing really browser-like; it also allows us to use different parser outputs (should that ever be necessary) without having to consider other legacy parser outputs. If we want to be reaaaaally nice to legacy, we might add an option to use htmlparser2 even on non-XML documents.

To come back to Document types:

After getting 1.0 done (I hope I'll still work on this for some time, not sure though) I want to dive somewhat deeper into the DOM implementation and check spec compliance. The code in the depths of the levelX/ subdirs is a bit unwieldy, I actually want to check if a rewrite there might sensible.

Before that rewrite it might be good to get our w3c test framework up to scratch:

support all testmodes (there's a python server in there; either use that or rewrite in JS)
integrate as many available tests as possible

There's also a bug with document.write somewhere which I want to check in on. Apparently there's code to handle it, but it seems like it doesn't work (in all cases?).

This ToDo is in exactly reversed order.

domenic · 2014-07-08T20:16:23Z

SVG support comes basically mostly for free if we don't specifically block it. I mean, some resource loadings/properties might not be supported but the basic dom tree will work as soon as we get XHTML support.

Right, but it would work in the wrong way. For 1.0 I would feel better disallowing it, than adding (small) hacks to make it possible to shoehorn SVG into a HTMLDocument.

We have a problem with that since the user can currently configure the parser they want to use.
I vote for a clean cut here: drop all external parser support, bundle parse5 and htmlparser2 as dependency and use parse5 for general purpose, htmlparser2 for XML - done.

I am hesitant to do this. I like the configurable parser aspect of jsdom, and I know there is a small ecosystem of people using other parsers for their own purposes. It seems like a good separation of concerns. We should certainly have sane defaults, but I don't see too much benefit in dropping this feature.

The code in the depths of the levelX/ subdirs is a bit unwieldy, I actually want to check if a rewrite there might sensible.

That would be pretty great... hard, but great.

Sebmaster · 2014-07-08T20:42:54Z

For 1.0 I would feel better disallowing it, than adding (small) hacks to make it possible to shoehorn SVG into a HTMLDocument.

I don't think there are any hacks to make it work with SVG specifically. There's one to make it work with XML in general (documentElement patch), but nothing SVG specific in my PR.

So are we going to disallow my convenction, or disallow by throwing on detection of an SVG tag?

I am hesitant to do this.

Then we'll have to remove the getDefaultParser method and replace it with the parser detection based on the input.

domenic · 2014-07-08T20:48:54Z

There's one to make it work with XML in general (documentElement patch), but nothing SVG specific in my PR.

Hmm, so is this necessary for XHTML documents then? That would be the line for me.

Then we'll have to remove the getDefaultParser method and replace it with the parser detection based on the input.

I don't understand why? What is wrong with the current setup? Remember we'd differentiate XHTML vs. HTML via a parsingMode flag, not input-stream detection. (The "auto" setting tries to infer based on external signals like file extension or Content-Type of a HTTP response, not based on input stream.)

Sebmaster · 2014-07-08T20:58:05Z

Hmm, so is this necessary for XHTML documents then?

True, it's not, it's just for general non-HTML XML compatibility.

I don't understand why?

I think we'd have to have 2 different default parsers - one for pure HTML, one for XHTML, since auto could switch between either of those. If auto always used the default parser it wouldn't really matter, would it? I'm not sure I can explain it better right now - it's kind of late.

We also need to decide what to do if someone uses parsingMode = "xhtml" + parser: "parse5" since that's not possible.

domenic · 2014-07-08T21:46:34Z

I think we'd have to have 2 different default parsers - one for pure HTML, one for XHTML, since auto could switch between either of those.

Agreed.

After looking over things, I think the getDefaultParser/setDefaultParser methods can indeed be removed. They should not be public API.

We also need to decide what to do if someone uses parsingMode = "xhtml" + parser: "parse5" since that's not possible.

I think if parser is set, parsingMode gets ignored, and if things blow up, they deal with it. Perhaps if both are included we should show a warning.

domenic · 2014-09-19T01:10:17Z

OK, I think our story is figured out. Opening a new meta-issue #885 to catalogue the deficiencies in our new XML parsing mode.

Sebmaster mentioned this issue Jul 5, 2014

XML compat / parse5 #818

Closed

domenic added this to the 1.0 milestone Jul 9, 2014

domenic added the x(ht)ml label Jul 9, 2014

domenic mentioned this issue Jul 9, 2014

Determine of document is an XML DOC using non-naive logic #229

Closed

Sebmaster mentioned this issue Sep 17, 2014

Enable xhtml parsing #883

Closed

domenic closed this as completed Sep 19, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What should our XML story be, exactly? #820

What should our XML story be, exactly? #820

domenic commented Jul 5, 2014

domenic commented Jul 6, 2014

Sebmaster commented Jul 7, 2014

domenic commented Jul 8, 2014

Sebmaster commented Jul 8, 2014

domenic commented Jul 8, 2014

Sebmaster commented Jul 8, 2014

domenic commented Jul 8, 2014

domenic commented Sep 19, 2014

What should our XML story be, exactly? #820

What should our XML story be, exactly? #820

Comments

domenic commented Jul 5, 2014

domenic commented Jul 6, 2014

Sebmaster commented Jul 7, 2014

domenic commented Jul 8, 2014

Sebmaster commented Jul 8, 2014

domenic commented Jul 8, 2014

Sebmaster commented Jul 8, 2014

domenic commented Jul 8, 2014

domenic commented Sep 19, 2014