Skip to content

Loading…

Add urlmaster and testAutomated() for 4225 tests in url_resolution. #550

Closed
wants to merge 16 commits into from

4 participants

@deitch

Includes automated url testing. Needs to be reviewed for agreement that these are correct.

See #531

@domenic
Collaborator

OK, took a bit of time to review this. Sorry for the delay. Now that I've started to sink my teeth into it I should be able to correspond more frequently.

On major thing that sticks out to me: I think testing all those possible bases is probably not the way to go. Base, which is in my mind equivalent to document.location.href, should probably always be an absolute URL (whether file:// \or not.) That should cut down on the number of failures immensely.

More specific points from #531:

1) If an href is just an authority '//www.not.com', which is legit according to the RFC, it errors out on doc.getElementById("theID").href

This is definitely a bug.

2) Blank base: jsdom assigns the base of the current js file, which is definitely incorrect. It probably isn't a realistic scenario, but the RFC doesn't make it illegitimate. Either way, the URL of the JS file definitely is not right.

Well, I think the intent here is that the base should always be an absolute URL, as I say above, and if you haven't given it one, it should just use something reasonable like the JS file.

I guess the real issue here is that the detection logic is something like if (!config.url) instead of if (!isAbsoluteUrl(config.url)).

3) Hash base: base that is only a hash/fragment and a relative ref, e.g. base is '#abc' and ref is 'q/r'. In that case, jsdom just takes 'q/r', but my read of the RFC is that the URL should be absolutized.

I don't think this is important though, since document base's should always be absolute URLs.

@deitch

Except for the bug listed as #1 above, all of these relate to having weird/illegit base hrefs. According to http://www.w3.org/TR/html401/struct/links.html#h-12.4 you get your base from (in descending order):

  • base href in HTML
  • base href given by http headers (seriously, I have never seen this before, but apparently rfc2616, just strange)
  • location of current document

So it is very possible to have an illegit base, set by html or http header, and still be at a legit window.location.href.

On the other side of it, though, you are right, if the base really is blank, even if it is set to href="" in html, then it should default to current location, so you are right on the second one above. I will change the tests to show it. Looking forward to your feedback on the other.

@deitch

I just realized I didn't explain that very well (or maybe I did?):

1) Blank base: should fall back to document location, which is what you said, I will fix it.
2) Illegit (but non-blank) base: is possible by setting html base node or http header, so should be tested, is what I am positing.

@domenic
Collaborator

It looks like illegit base URLs set using <base> (or HTTP I suppose) are converted to absolute URLs:

http://www.w3.org/html/wg/drafts/html/master/infrastructure.html#base-urls

So I don't think it'll be necessary to test those cases, although it will be necessary for jsdom to ensure that the document's URL is always an absolute URL.

@deitch

It appears you are right. It looks like the steps for resolving any URL are:

1) Determine the base URL:
a) if there is no base element with an href attr, then use document URL
b) if there is a base element with an href attr, then resolve it relative to the document URL (exactly as if it were an anchor tag relative to the document or a base)
2) Resolve the anchor tag href relative to the base URL determined in 1 above.

I'll fix up the tests. Nice catch!

@deitch deitch Use latest urlmaster with proper testing for valid location URLs. Upd…
…ate url_resolution() tests automated to use better urlmaster.
729c90e
@deitch

OK, I am updating the pull request. urlmaster has moved up a number of revisions (0.2.5) to add numerous tests and add support for location+base+ref, and validated the root (location if provided, else base). Essentially, it conforms to the spec as discussed earlier.

Pushing to github now, take a look.

@domenic
Collaborator

Looking at this now. Having some minor issue with Windows backward slashes, will try to fix locally. Just wanted to let you know I didn't forget.

@domenic
Collaborator

For Windows compat, this should be locn = toFileUrl(__filename)

Agreed. I just pushed the change in. Makes it a lot easier and cleaner.

@domenic
Collaborator

After the above toFileUrl fix, I took a chance to look over them. Most look solid. Here's things I ran into:

ERROR file://www.not.com vs file://www.not.com/

locn file:///C:/Users/Domenic/Dropbox/Programming/GitHub/jsdom/test/jsdom/index.js base  with ref //www.not.com should resolve to file://www.not.com instead of file://www.not.com/

I think the extra slash is correct, especially since later tests on http/https include it. Similarly:

ERROR file://www.not.com#xyz vs file://www.not.com/#xyz

locn file:///C:/Users/Domenic/Dropbox/Programming/GitHub/jsdom/test/jsdom/index.js base /a/b with ref //www.not.com#xyz should resolve to file://www.not.com#xyz instead of file://www.not.com/#xyz

ERROR https://?when=now vs https:///?when=now

locn file:///C:/Users/Domenic/Dropbox/Programming/GitHub/jsdom/test/jsdom/index.js base  with ref https:?when=now should resolve to https://?when=now instead of https:///?when=now

Is it really correct that it throws away the entire path? Same for all the protocol ones, e.g. https: and https:#xyz


ERROR https:///q/r/c/f?when=now#xyz vs https://q/r/c/f?when=now#xyz

locn file:///C:/Users/Domenic/Dropbox/Programming/GitHub/jsdom/test/jsdom/index.js base //www.why.com/a/b/./c/d/.././e/../f?foo=bar with ref https:/q/r/./c/d/.././e/../f?when=now#xyz should resolve to
 https:///q/r/c/f?when=now#xyz instead of https://q/r/c/f?when=now#xyz

You don't seem to get the right number of slashes when transitioning from file to http.


ERROR file:/// vs file:///C:/Users/Domenic/Dropbox/Programming/GitHub/jsdom/test/jsdom/index.js

locn file:///C:/Users/Domenic/Dropbox/Programming/GitHub/jsdom/test/jsdom/index.js base  with ref / should resolve to file:/// instead of ...

You need to take into account Windows drive letters. It should actually be file:///C:/, not file:///.


But generally, this is in the right direction. It's uncovering lots of bugs.

Here's some patched jsdom code you can test against, in lib/level2/html.js:

  baseUrl: function(document) {
    var baseElements = document.getElementsByTagName('base'),
        baseUrl      = document.URL;

    if (baseElements.length > 0) {
      var baseHref = baseElements.item(0).href;
      if (baseHref) {
        baseUrl = URL.resolve(baseUrl, baseHref);
      }
    }

    return baseUrl;
  },
  resolve: function(document, href) {
    // When switching protocols, the path doesn't get canonicalized (i.e. .s and ..s are still left)
    var intermediate = URL.resolve(this.baseUrl(document), href);

    // This canonicalizes the path, at the cost of overwriting the hash.
    var nextStep = URL.resolve(intermediate, '#');

    // So, insert the hash back in, if there was one.
    var parsed = URL.parse(intermediate);
    return nextStep.slice(0, -1) + (parsed.hash || '');
  },

It seems to give the right thing most of the time?

@deitch

One of these days, I have to learn github-flavoured markdown correctly. I know how you did the code blocks, with 4 backticks before and after, or indented each line 4 times, but how did you get that cool separator between sections?

And how do you quote my comment in yours?

@deitch
ERROR file://www.not.com vs file://www.not.com/

I think the extra slash is correct, especially since later tests on http/https include it. Similarly:

This one is interesting. I just spent time hunting through RFC 3986 again. The scheme is followed by the authority. After that, it is followed by 4 path options. The first one is path-abempty, i.e. an empty path, which can be either '/' or empty. So both www.not.com and www.not.com/ are correct. This is true with a fragment/query and without a fragment/query. Neither of those are part of the path, so it is still path-abempty with or without them. Would be nice if the RFC just said, path-abempty is always '/' or is always empty. Probably lots of divergent browser behaviour history here.m

In our context, it is not correct of the tests to say that the above is an error, when either is legit. I think the test needs to be more tolerant, and accept both if the path is empty. I will change the tests to support it.


ERROR https:///q/r/c/f?when=now#xyz vs https://q/r/c/f?when=now#xyz

You don't seem to get the right number of slashes when transitioning from file to http.

I will investigate this.


ERROR https://?when=now vs https:///?when=now

Is it really correct that it throws away the entire path? Same for all the protocol ones, e.g. https: and https:#xyz

Yes, strange as it sounds. http://tools.ietf.org/html/rfc3986#section-5.2.2

The first section says that if the Reference has a scheme (e.g. https:) then we completely ignore the base.


ERROR file:/// vs file:///C:/Users/Domenic/Dropbox/Programming/GitHub/jsdom/test/jsdom/index.js

You need to take into account Windows drive letters. It should actually be file:///C:/, not file:///.

Should it? The absolute path is not /Users/Domenic/... but /C:/Users/Domenic/... which means that if the reference is /, then it should replace the entire thing. AFAIK, the file URL has no special understanding of drive letters. The problem is URLs have no way of taking it into account.

@deitch

Screw it, writing the tests this way is way too hard. I am just shifting it to adding the slash for path-abempty.

Drat, it gets even worse. The resolution is actually different depending on the platform. I don't even get that error on Mac. On Mac, the resolution by jsdom gives www.not.com and so complains the other way around.

I am going to have to leave the urlmaster one way or the other consistently, and change the testing in jsdom to be more tolerant, which is just a pain. Need to parse the URL, check for a blank path, and then modify the test.

@deitch

Just pushed changes to handle the first trailing slash issue (1 of 4 you raised).
Still working on the second.
Looking for your feedback on the 3rd and 4th.

@deitch

Actually, it looks like urlmaster is correct for the second, and jsdom is incorrect.

The ref is https:/q/r/./c/d/.././e/../f?when=now#xyz which says scheme = https, authority (i.e. host) = '', and the path = '/q/r/./c/d/.././e/../f' (I am ignoring the hash and query for now).

If we use what jsdom gives https://q/r/c/f?when=now#xyz it would imply that the scheme = https (correct), host = 'q' and path = '/r/c/f' (incorrect). '/q' is the beginning of the path and there is no hostname. The URL-correct way to do this is either https:/q/... or (more correctly) https:///q/r/....

@domenic
Collaborator

When I run the following test in all my web browsers, it gives two slashes, not three:

<html>
<head>
<base href="//www.why.com/a/b/./c/d/.././e/../f?foo=bar" />
</head>
<body>
<a href="https:/q/r/./c/d/.././e/../f?when=now#xyz" id="x"></a>
<script>
alert(document.getElementById("x").href);
</script>
</body>
</html>

It's possible the RFC and browser behavior conflict; for this I would look at the in-progress URL spec maybe?


Sounds good on throwing things away.


For file:// URLs on Windows: there are special rules for drive letters, at least in web browsers. They're in the URL spec I mentioned above, although I'm sad they're not in the RFC. Probably because nobody bothered to spell out what exactly the path was and what "/" means. But yes, in Windows, we need to consider drive letters :-/.

@deitch

It's possible the RFC and browser behavior conflict; for this I would look at the in-progress URL spec maybe?

Possible. I couldn't figure out how to pull this out of the in-progress spec. But https://q/r means go to host q via protocol https on the default port to get path /r. It cannot mean anything else.


I see the drive letters now in the whatwg; they're not in the RFC including 3986, which is authoritative. The WHATWG seems to be saying:

If you are using an absolute path on a file URL relative to a base path, and the base path has a drive letter while the ref does not, then you need to grab the drive letter from the base and prepend it to the absolute path in the ref.

So the WHATWG recommends this, while the RFC doesn't do it? What do we do?

@domenic
Collaborator

Possible. I couldn't figure out how to pull this out of the in-progress spec. But https://q/r means go to host q via protocol https on the default port to get path /r. It cannot mean anything else.

Let's look at this in more detail:

  • filename: file:///C:/Users/Domenic/Dropbox/Programming/GitHub/jsdom/test/jsdom/index.js
  • base //www.why.com/a/b/./c/d/.././e/../f?foo=bar
  • ref https:/q/r/./c/d/.././e/../f?when=now#xyz

The question is, what does it even mean to resolve the relative URL https:/q/r/... from file:///www.why.com? I guess you switch to https, then use /q/r/... to build a new absolute URL? To me, on a https protocol, /q/r/... would result in hostname q, path r/.... Maybe you can see this in the same way the browser eliminates triple slashes when you type them in a URL bar (try it).

Pinging @annevk for more insight.

So the WHATWG recommends this, while the RFC doesn't do it? What do we do?

We're trying to emulate browsers, so WHATWG is the way to go :).

@deitch

The question is, what does it even mean to resolve...

That's the real problem. https:/q/r makes no sense on its own. Logic would dictate, "take the https scheme, and then go back a level and grab the authority (host) from the base URL." But the RFC explicitly does not say that.

To me, on a https protocol, /q/r/... would result in hostname q, path r/..
Pretty sure that isn't correct, because a hostname always starts with //, whereas a single / means the beginning of an absolute path.

the same way the browser eliminates triple slashes when you type them in a URL bar
It might depend on the browser, but it doesn't matter: file:///a/b and file:/a/b are identical. If there are two slashes right after the : that ends the scheme, then the next characters until a / are the authority. If there is no authority, then /// and / are the same.


So the WHATWG recommends this, while the RFC doesn't do it? What do we do?
We're trying to emulate browsers, so WHATWG is the way to go :).

I can go for that. We'll be a little more intelligent about the Windows drive letters. The logic in the WHATWG is pretty sound.

@annevk

In browsers https:/q/r becomes https://q/r, except when the base URL is https itself. Say the base URL is https://x/ then https:/q/r becomes https://x/q/r. The URL Standard defines this in detail and annevk/url implements it to some extent, but needs updating as it does not yet cover the file URL scheme like the URL Standard does.

@deitch

Hey @annevk, welcome to the discussion!

What you are saying is: if (ref has scheme but not authority && base-scheme == ref-scheme) then result should use ref-scheme + base-authority + ref-path

I have no problem with that logic, but I don't see if in the URL Standard. http://tools.ietf.org/html/rfc3986#section-5.2.2 says

if defined(R.scheme) then
         T.scheme    = R.scheme;
         T.authority = R.authority;
         T.path      = remove_dot_segments(R.path);
         T.query     = R.query;
@annevk

http://url.spec.whatwg.org/#scheme-state step 2, substep 4 and 5. (I was not talking about RFC 3986, as @domenic said, it does not match browsers.) And just to link it, my implementation of that is https://github.com/annevk/url but it is out of date at the moment, especially for file URLs, but there are also some other issues.

@deitch

Ugh, I hate reading state machine logic. The RFC's if-then is so much cleaner.

Either way, I'll read it through and get back here.

The bigger question is, should we be RFC compliant or WHATWG compliant?

@domenic
Collaborator

We should be browser-compliant, which is WHATWG, not RFC.

Agreed that the state machine logic is undecipherable. Good luck, brave sir!?

@deitch

Or, you could spend a few days trying to decipher the federal tax code. After that, the state machine may look like child's play!

@deitch

Added windows drive support, per the WHATWG and your comments above.

@deitch

@annevk, I don't see it in scheme-state step 2, substeps 4-5. Can you explain what you see?

@annevk

What happens is that if input's scheme is equal to base's scheme and that scheme is a relative scheme (one of http/https/...) you continue parsing input as either a relative or authority (to be determined later). If they're not equal, but it's still a relative scheme you continue parsing input expecting authority, and failing the relative scheme criteria you continue parsing input as scheme data.

I'm not sure however if you can trivially patch a parser written towards the RFC to match what browsers do. I have not attempted to do such a thing myself.

@deitch

Seriously, could write a Big Bang Theory episode just around the logic in the WHATWG!

@deitch

In all seriousness, here is how I read it in scheme-state:

2.5: once I have my scheme (and it isn't "file"), go to 2 expected slashes, then from here until EOF or the next ending of authority character (slash, etc.) is the authority (including host, port, username, password, etc.)

2.4: once I have my scheme, it isn't "file", and I have a real base, and my scheme === base-scheme:

  • if I start with '//' then do like 2.5
  • if I do not start with '//' then I am in relative state, which means I use the scheme, authority, and even path of base, modified by my own path

Which appears to be what you said.

So if I have no authority in my ref, and ref-scheme === base-scheme, just use base-scheme + base-authority + ref-path.

You know, that WHATWG is a hell of a convoluted way to say something that simple! Modifying now...

@deitch

Added. Ball back in your court.

@annevk

Well in your "so simple" you didn't explain error reporting and all the other number of possible conditions, so yeah, that is a simpler answer to your specific question :-)

@domenic
Collaborator

I am still seeing urlmaster wanting three slashes in some https cases:

ERROR https:///q/r#xyz vs https://q/r#xyz

locn file:///C:/Users/Domenic/Dropbox/Programming/GitHub/jsdom/test/jsdom/index.js
base http://www.why.com/a/b/./c/d/.././e/../f
with ref https:/q/r#xyz

should resolve to https:///q/r#xyz instead of https://q/r#xyz

I'll try to be more responsive about testing after you fix it. We're closing in on this!!

@deitch

Yeah, that is probably the last piece. Working on it right now. Could use some input here, though.

According to the RFC: http:/q/r resolves to scheme = 'http', host = '', path = '/q/r'.
According to the WHATWG (and most browsers): http:/q/r resolves to scheme = 'http', host = 'q', path = '/r'

We agreed to follow WHATWG, which is fine, doing it now.

But FYI, nodejs require('url').parse('http:/q/r') resolves per the RFC, not the WHATWG. Does it matter?

@deitch

urlmaster updated to 0.2.10 with WHATWG-compliant (and not RFC-compliant) URL processing; jsdom urlmaster dependency bumped to require urlmaster >=0.2.10.

Ball in your court.

@domenic
Collaborator
ERROR https:///q/r vs https://q/r

locn file:///C:/Users/Domenic/Dropbox/Programming/GitHub/jsdom/test/jsdom/index.js base  with ref https:/q/r should resolve to https:///q/r instead of https://q/r

One too many slashes. Similar failures in other cases, I believe the common cause being empty base.

(And yes, we're targeting WHATWG, not nodejs URL parser.)

@deitch

@domenic is this after the latest 0.2.10 urlmaster update?

I just reran this manually, and I am getting the correct response:

> um.resolve('file:///C:/Users/Domenic/Dropbox/Programming/GitHub/jsdom/test/jsdom/index.js','','https:/q/r');
'https://q/r'
@domenic
Collaborator

Oh, you're right, I forgot to run npm install!! :(.

Now I'm getting "maximum call stack size exceeded." Can't look into it right now unfortunately due to having some Valentine's weekend plans...

@domenic
Collaborator

OK, the infinite recursion is occurring at this line:

var expected = um.resolveTrack(locn,base,refs);

Here locn is

file:///C:/Users/Domenic/Dropbox/Programming/GitHub/jsdom/test/jsdom/index.js

base is the empty string and refs is the large collection of things you test against.

@deitch

And what happened to those Valentine's weekend plans? :-)

@deitch

Now I am getting the same thing. Wonder why I didn't before... probably forgot to install or run something.

@deitch

I found it. It has nothing to do with the base being blank, which it handles quite nicely.

Here is the issue. What if the ref is just a protocol? https: and that's it. According to the RFC (which we follow unless the WHATWG overrides it), then the result should just be https: or more correctly https:///, which are essentially the same.

There is a bug introduced in 0.2.10 as a result of trying to handle the http:/q/r (i.e. no explicit authority) case that created the infinite loop. I am fixing this right now in 0.2.11.

The question is what it should resolve to? Nothing in the WHATWG deals with resolving these two URLs, just how to parse them. If we go back to the RFC for resolution, then, we end up with https:///. We get the same from node's require(url).resolve(locn,'https:') which usually follows the RFC.

I am fixing this according to that and bumping to 0.2.11, so we get rid of the infinite loop. If we want a different result, post it here.

@deitch

Had to bump 2 versions to 0.2.12. The infinite recursion bug was triggered by two similar but different use cases. Back in your court...

@annevk

Parsing and resolving is the same per the WHATWG document so it does handle that...

@domenic
Collaborator

Looks good. The remaining issues I see have to do with throwing away the query string, which I think is not intentional:

locn file:///C:/Users/Domenic/Dropbox/Programming/GitHub/jsdom/test/jsdom/index.js
base /
with ref https:?when=now#xyz
should resolve to https:///#xyz

I think it should be https:///?when=now#xyz

Similarly

locn file:///C:/Users/Domenic/Dropbox/Programming/GitHub/jsdom/test/jsdom/index.js
base /
with ref https:?when=now
should resolve to https:///

Also:

locn file:///C:/Users/Domenic/Dropbox/Programming/GitHub/jsdom/test/jsdom/index.js
base /
with ref ?when=now#xyz
should resolve to file:///C:?when=now#xyz

I think it should have a slash after the C:, i.e. file:///C:/?when=now#xyz

@deitch

OK, so 2 issues here. The first is the query string, the second is the slash after the windows drive.


Query string

RFC: Always use R.fragment (never B.fragment), but that is working OK, and if the R.scheme, then use R.scheme, R.authority, R.path, R.query. We use the WHATWG override if R.scheme === B.scheme and !R.authority, but that is not relevant here in the discussion of query string. I don't see anything in WHATWG to override that ( @annevk do you? ) So it looks like you are right. Will fix.


Slash after Windows drive

This is WHATWG only, doesn't appear in the RFC (since Windows-drive consideration is WHATWG only). I cannot figure out a clean yes/no on this reading WHATWG. But since in a clean and proper URL all paths after the host (and drive for Windows) should be absolute, I tend to agree.

@annevk? Weigh in?


Summary: I am fixing the query issue right now, want input from both of you on the slash after Windows drive issue, but tend to agree.

@deitch

Query string issue is pushed. Waiting for feedback from @domenic and @annevk on the slash after Windows drive.

@domenic
Collaborator

I can't decipher the spec either, but did run some tests in Chrome (which follows the WHATWG spec in most ways I've seen) and got the slash after the Windows drive letter.

@domenic
Collaborator

What about this one?

locn file:///C:/Users/Domenic/Dropbox/Programming/GitHub/jsdom/test/jsdom/index.js
base http:?foo=bar#abc
with ref /
should resolve to null

Also, more troubling:

locn file:///C:/Users/Domenic/Dropbox/Programming/GitHub/jsdom/test/jsdom/index.js
base http:?foo=bar#abc
with ref https://www.not.com/q/r
should resolve to null

and other similar incidents.

@deitch

OK, I am in Israel this week, so it is 9:09am. But you are in EST and it is 2:09am! I used to think I was dedicated. If you find some way to make (or supplement) a living off of OSS, let me know. :-)

Not that I am that mercenary, mind you, I really enjoy OSS, but...

I tend to agree on the slash. Let @annevk weigh in, and if all are in agreement, I will put it in.

@deitch

Actually, those don't surprise me at all. According to the specs, the steps to resolve a locn,base,ref are essentially:

  1. resolve base relative to locn to get my real base (call it B2)
  2. resolve ref relative to B2 to get my final result

If you resolve http:?foo=bar#abc relative to file:///C:/Users/Domenic/Dropbox/Programming/GitHub/jsdom/test/jsdom/index.js you get http:///?foo=bar#abc which is an illegitimate complete URL (it is OK as a ref, or even a base, if the locn has the same scheme http and has an authority, as @annevk pointed out with the WHATWG).

So it shows an illegit base and kicks it back.

The question is, how should we resolve it? In the real world, this can happen with a legit (as a base) URL base and a legit locn (always must be legit by definition) with a ref. So how would you resolve it?

@domenic
Collaborator

Well, I feel like if there's an absolute ref, it should always use that, no matter what the base---right? So the second case seems obvious.

As for the first, not sure---time to fire up Chrome! It says... empty string! So null doesn't seem so bad, I guess.

@deitch

FWIW, I just checked it out in Safari, Chrome and FF.

Safari & FF: Just accept it as is and resolve the ref compared to the new base, even though the base is illegit.
Chrome: with the full ref URL, accept as is; with the relative, ignores it entirely and just goes to about:blank

@deitch

Ha, crossed wires! :-)

@deitch

Seriously, when B2 is illegit, you can get some really weird stuff happening.

@annevk

I don't really understand what the question is.

@deitch

@annevk 2 questions:

Slash after Windows drive

When you have

locn file:///C:/Users/Domenic/Dropbox/Programming/GitHub/jsdom/test/jsdom/index.js
base /
with ref ?when=now#xyz
should resolve to file:///C:?when=now#xyz

Should it resolve to file:///C:/?when=now or file:///C:?when=now

In other words, should there always be a slash after the Windows drive letter?

Illegit Base

If you have

locn file:///C:/Users/Domenic/Dropbox/Programming/GitHub/jsdom/test/jsdom/index.js
base http:?foo=bar#abc
with ref /

Such that resolving base relative to locn gives a non-complete URL, and then you resolve ref against that, does it give null (like Chrome) or resolve it anyways, even though the resulting URL is non-usable (like FF & Safari)?

@annevk

In "Illegit Base" base would become http:///?foo=bar#abc and ref would then be http:///.

I think for "Slash after Windows drive" the WHATWG standard currently calls for removal of the slash as the drive occurs after three slashes rather than two, but that might be in error.

@annevk

(BTW, following the RFC except where WHATWG deviates seems like a weird strategy as the latter replaces the former.)

@deitch

@annevk you are totally wrong. It doesn't seem like a weird strategy; it is a weird strategy!

Except there are two mitigating factors:

  1. The RFC is far easier to understand, while the WHATWG is not only dense, but very densely written.
  2. I had already implemented the RFC, so modifying to suit WHATWG where necessary (an incremental strategy) led to a saner and survivable approach; redoing from scratch would have taken forever (i.e. never done).

End result is it works, I can live with it.

@deitch

@annevk yes, correct, on the "Illegit Base." Right now it just says, "that base is illegit, so why bother resolving, null response." That is roughly what Chrome does in some circumstances. @domenic is arguing, and I am starting to lean that way, that we should still resolve it completely (as much as possible), and if the resolved URL is not one we can go to, deal with it later.

Yeah, I just convinced myself. :-)

For "Windows Drive", I can go either way. "file:///C:" and "file:///C:/" mean the same thing. What should we do?

@deitch

Pushed out to handle invalid bases. Try now.

@domenic
Collaborator

Let's do file:///C:/ since it matches browsers.

Also... I think we're finally done!!! :O

Time to work on fixing resourceLoader.baseUrl and resourceLoader.resolve, I think!

@deitch

I can go for that.

@deitch

Trailing slash fixed (as well as tests in urlmaster for it, of course).

@domenic
Collaborator

So, tests look good. You want to take a shot at the implementation that makes them pass, or shall I?

@deitch

My travel schedule is nothing short of insane, done 18,000 miles in 4.5 days this past week, and it keeps going. I would love to, but don't rely on me until sometime in April at earliest. I am sorry.

@domenic
Collaborator

@deitch poke :).

@deitch

I guess neither of us has gotten to it yet, eh?

I'll see if I can get to it.

@deitch

Since it has been a while, can you review before I attack this?

As I understand it, we now all agree that the tests are valid and should pass, we just need to change the source jsdom code to allow the tests to pass, correct? Please confirm.

Since it has been a while, and there have been several commits, should I do a new fork and pull, and re-add the same tests, so the branch is pretty close to current master? Does it matter?

@domenic
Collaborator
@deitch

OK, so I know that it was just as much of a pain to merge in CVS and any other old system when you branched a while back, and then had to merge in, but not matter how many times I do it, I can never get it right and always get confused.

Seems to me the easiest process here is:

  1. git checkout upstream
  2. git fetch upstream (which points to tmpvar/jsdom)
  3. rerun tests to have a clean baseline
  4. git checkout url-testing
  5. git merge upstream
  6. rerun test to make sure it matches the baseline plus just the expected failures
  7. make the repairs

Does that make sense?

@deitch

Yes, hats off, 0.6.1 works like a charm, all tests pass.

deitch added some commits
@deitch deitch Merge branch 'upstream' into url-testing
Conflicts:
	package.json
c905b79
@deitch deitch Fix missing comma in package.json 4305be1
@deitch deitch Added test.equal error message for url_testing to include the normali…
…zed URL paths
0c3c7c4
@deitch deitch URL parsing properly handles all urlmaster test cases.
1) add changes recommended by @domenic at tmpvar#550 (comment) to properly handle all URL resolution
2) Make sure to URL.resolve() even absolute href URLs, so that /./ and /../ get resolved
1f570a9
@deitch deitch url_testing should standardize resultant paths for absolute URL testi…
…ng via um.addPathEmpty()

Different platforms resolve absolute root paths, e.g. http://www.example.com - some add trailing /, some do not. Either way, we add to empty path in test results so we can test that it is correct.
a9ef596
@deitch

It was mostly your fixes above, with some tweaks, plus some test changes to get it to work. But it now includes all potential URLs... and all tests pass 100%.

Pull! Pull! Pull! :-)

@tmpvar
Owner
@domenic
Collaborator

Still getting lots of test failures having to do with Windows drive letters :(. E.g.

ERROR file:///C:/ vs file:///

locn 'file:///C:/Users/Domenic/Dropbox/Programming/GitHub/jsdom/test/jsdom/index.js' base 'a/b' with ref '/' should resolve to 'file:///C:/' (file:///C:/) instead of 'file:///' (file:///)

I pushed a cleaner version of this branch up at https://github.com/tmpvar/jsdom/compare/url-resolution. In particular I squashed all the test commits down to one, and converted tabs to spaces.

Let me know if you think you can crack the Windows drive letters; otherwise I'll try to give it a shot this weekend (since I am, after all, the one with the Windows machine).

@deitch

I only have one Windows box available, my son's, on which I have never installed node.

But I can crack this without Windows:

  1. Please run the tests and post the entire output of errors (all those locn lines) here or elsewhere I can download it (.txt file is best).
  2. I can fix on that basis (I already extracted the jsdom resolution code into a separate module, so I can run them all).
  3. I will fix, repush, and you can run the tests again on your Windows box to validate.

Plan?

@domenic
Collaborator

I got it :).

Merged as 1be2448 and f14c41b for your changes, plus 278543b to get it working on Windows.

THANK YOU SO MUCH <3 for your extensive work on this saga, starting from that tiny little bug #531 six months ago. You deserve a medal, sir!!

I've filed nodejs/node-v0.x-archive#5452 and nodejs/node-v0.x-archive#5453 in Node to complain about why we can't just use URL.resolve directly, BTW.

@domenic domenic closed this
@deitch

Awesome! Is it really six months?

It is good to work with @domenic and @tmpvar.

Medal? Maybe we can come up with the "jsdom award of excellent!" LOL!

@deitch deitch deleted the deitch:url-testing branch
@domenic domenic referenced this pull request in nodejs/node-v0.x-archive
Open

url: resolve strips drive letters from Windows file URLs #5452

@papandreou papandreou pushed a commit to papandreou/jsdom that referenced this pull request
@deitch deitch URL parsing properly handles all urlmaster test cases.
1) add changes recommended by @domenic at tmpvar#550 (comment) to properly handle all URL resolution
2) Make sure to URL.resolve() even absolute href URLs, so that /./ and /../ get resolved
f14c41b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jan 6, 2013
  1. @deitch
Commits on Jan 21, 2013
  1. @deitch

    Use latest urlmaster with proper testing for valid location URLs. Upd…

    deitch committed
    …ate url_resolution() tests automated to use better urlmaster.
Commits on Jan 26, 2013
  1. @deitch
  2. @deitch
Commits on Jan 31, 2013
  1. @deitch
Commits on Feb 5, 2013
  1. @deitch
Commits on Feb 13, 2013
  1. @deitch
Commits on Feb 17, 2013
  1. @deitch

    Require urlmaster 0.2.12

    deitch committed
Commits on Feb 19, 2013
  1. @deitch

    urlmaster v0.2.13 includes proper handling of reference query string …

    deitch committed
    …when no authority present
  2. @deitch
  3. @deitch
Commits on May 9, 2013
  1. @deitch

    Merge branch 'upstream' into url-testing

    deitch committed
    Conflicts:
    	package.json
  2. @deitch
  3. @deitch
  4. @deitch

    URL parsing properly handles all urlmaster test cases.

    deitch committed
    1) add changes recommended by @domenic at tmpvar#550 (comment) to properly handle all URL resolution
    2) Make sure to URL.resolve() even absolute href URLs, so that /./ and /../ get resolved
  5. @deitch

    url_testing should standardize resultant paths for absolute URL testi…

    deitch committed
    …ng via um.addPathEmpty()
    
    Different platforms resolve absolute root paths, e.g. http://www.example.com - some add trailing /, some do not. Either way, we add to empty path in test results so we can test that it is correct.
Showing with 59 additions and 20 deletions.
  1. +15 −16 lib/jsdom/level2/html.js
  2. +3 −0 package.json
  3. +41 −4 test/jsdom/index.js
View
31 lib/jsdom/level2/html.js
@@ -67,10 +67,13 @@ core.resourceLoader = {
baseUrl: function(document) {
var baseElements = document.getElementsByTagName('base'),
- baseUrl = document.URL;
+ baseUrl = document.URL, baseHref;
if (baseElements.length > 0) {
- baseUrl = baseElements.item(0).href || baseUrl;
+ baseHref = baseElements.item(0).href;
+ if (baseHref) {
+ baseUrl = URL.resolve(baseUrl, baseHref);
+ }
}
return baseUrl;
@@ -78,25 +81,21 @@ core.resourceLoader = {
resolve: function(document, href) {
// if getAttribute returns null, there is no href
// lets resolve to an empty string (nulls are not expected farther up)
- if(href === null)
+ if(href === null){
return '';
-
- if (href.match(/^\w+:\/\//)) {
- return href;
- }
+ }
var baseUrl = this.baseUrl(document);
- // See RFC 2396 section 3 for this weirdness. URLs without protocol
- // have their protocol default to the current one.
- // http://www.ietf.org/rfc/rfc2396.txt
- if (href.match(/^\/\//)) {
- return baseUrl ? baseUrl.match(/^(\w+:)\/\//)[1] + href : null;
- } else if (!href.match(/^\/[^\/]/)) {
- href = href.replace(/^\//, "");
- }
+ // When switching protocols, the path doesn't get canonicalized (i.e. .s and ..s are still left)
+ var intermediate = URL.resolve(this.baseUrl(document), href);
+
+ // This canonicalizes the path, at the cost of overwriting the hash.
+ var nextStep = URL.resolve(intermediate, '#');
- return URL.resolve(baseUrl, href);
+ // So, insert the hash back in, if there was one.
+ var parsed = URL.parse(intermediate);
+ return nextStep.slice(0, -1) + (parsed.hash || '');
},
download: function(url, referrer, callback) {
var path = url.pathname + (url.search || ''),
View
3 package.json
@@ -70,6 +70,9 @@
"contextify": "~0.1.5"
},
"devDependencies" : {
+ "console.log" : "*",
+ "html5" : ">=0.3.8",
+ "urlmaster" : ">=0.2.15",
"nodeunit": "~0.8.0",
"optimist": "*"
},
View
45 test/jsdom/index.js
@@ -2,6 +2,7 @@ var path = require("path");
var fs = require("fs");
var jsdom = require('../../lib/jsdom');
var toFileUrl = require('../util').toFileUrl(__dirname);
+var URL = require('url');
exports.tests = {
build_window: function(test) {
@@ -678,7 +679,7 @@ exports.tests = {
},
url_resolution: function(test) {
- var html = '\
+ var um = require('urlmaster'), html = '\
<html>\
<head></head>\
<body>\
@@ -690,11 +691,12 @@ exports.tests = {
<a href="//example.com/protocol/avoidance.html" id="link6">protocol</a>\
</body>\
</html>';
+
function testLocal() {
var url = '/path/to/docroot/index.html';
var doc = jsdom.jsdom(html, null, {url: url});
- test.equal(doc.getElementById("link1").href, 'http://example.com', 'Absolute URL should be left alone');
+ test.equal(um.addPathEmpty(doc.getElementById("link1").href), 'http://example.com/', 'Absolute URL should be left alone except for possible trailing slash');
test.equal(doc.getElementById("link2").href, '/local.html', 'Relative URL should be resolved');
test.equal(doc.getElementById("link3").href, '/path/to/docroot/local.html', 'Relative URL should be resolved');
test.equal(doc.getElementById("link4").href, '/path/local.html', 'Relative URL should be resolved');
@@ -705,7 +707,7 @@ exports.tests = {
function testRemote() {
var url = 'http://example.com/path/to/docroot/index.html';
var doc = jsdom.jsdom(html, null, {url: url});
- test.equal(doc.getElementById("link1").href, 'http://example.com', 'Absolute URL should be left alone');
+ test.equal(um.addPathEmpty(doc.getElementById("link1").href), 'http://example.com/', 'Absolute URL should be left alone except for possible trailing slash');
test.equal(doc.getElementById("link2").href, 'http://example.com/local.html', 'Relative URL should be resolved');
test.equal(doc.getElementById("link3").href, 'http://example.com/path/to/docroot/local.html', 'Relative URL should be resolved');
test.equal(doc.getElementById("link4").href, 'http://example.com/path/local.html', 'Relative URL should be resolved');
@@ -719,17 +721,52 @@ exports.tests = {
base = doc.createElement("base");
base.href = 'http://example.com/path/to/docroot/index.html';
doc.getElementsByTagName("head").item(0).appendChild(base);
- test.equal(doc.getElementById("link1").href, 'http://example.com', 'Absolute URL should be left alone');
+ test.equal(um.addPathEmpty(doc.getElementById("link1").href), 'http://example.com/', 'Absolute URL should be left alone except for possible trailing slash');
test.equal(doc.getElementById("link2").href, 'http://example.com/local.html', 'Relative URL should be resolved');
test.equal(doc.getElementById("link3").href, 'http://example.com/path/to/docroot/local.html', 'Relative URL should be resolved');
test.equal(doc.getElementById("link4").href, 'http://example.com/path/local.html', 'Relative URL should be resolved');
test.equal(doc.getElementById("link5").href, 'http://example.com/path/to/docroot/index.html#here', 'Relative URL should be resolved');
test.equal(doc.getElementById("link6").href, 'http://example.com/protocol/avoidance.html', 'Relative URL should be resolved');
}
+
+ function testAutomated() {
+ /*
+ RFC resolution cases from http://tools.ietf.org/html/rfc3986#section-5.2
+ urlmaster builds them all for us
+ */
+ // create a doc with all of the possible bases and all of the possible refs
+ var bases = um.generateAll({scheme:"http",auth:'www.why.com',path:'/a/b',query:'?foo=bar',frag:'#abc'},true),
+ refs = um.generateAll({scheme:"https",auth:'www.not.com',path:'/q/r',query:'?when=now',frag:'#xyz'},true);
+
+ // build html with every possible link
+ var html = '\
+ <html>\
+ <head><base href=""></base></head>\
+ <body>';
+ refs.forEach(function(ref,i){
+ html += '<a href="'+ref+'" id="link'+i+'">link'+i+'</a>\n';
+ });
+ html += "</body></html>";
+ var locn = toFileUrl(__filename);
+
+ // now check each base case
+ bases.forEach(function(base,i){
+ var doc = jsdom.jsdom(html, null, {url: locn}), expected = um.resolveTrack(locn,base,refs);
+ // set up the base
+ doc.getElementsByTagName("base")[0].setAttribute("href",base);
+ refs.forEach(function(ref,j){
+ var href = doc.getElementById("link"+j).href, result = expected[j][3];
+ // empty path must accept href with or without a slash; see http://tools.ietf.org/html/rfc3986#section-3.3
+ // so we consistently push them towards having
+ test.equal(um.addPathEmpty(href), um.addPathEmpty(result), 'locn \''+locn+'\' base \''+base+'\' with ref \''+ref+'\' should resolve to \''+result+'\' ('+um.addPathEmpty(result)+') instead of \''+href+'\' ('+um.addPathEmpty(href)+')');
+ });
+ });
+ }
testLocal();
testRemote();
testBase();
+ testAutomated();
test.done();
},
Something went wrong with that request. Please try again.