Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support relative URLs #531

Open
sholladay opened this issue Jul 17, 2020 · 67 comments
Open

Support relative URLs #531

sholladay opened this issue Jul 17, 2020 · 67 comments
Labels
addition/proposal New features or enhancements needs implementer interest Moving the issue forward requires implementers to express interest topic: api topic: model For issues with the abstract-but-normative bits

Comments

@sholladay
Copy link

sholladay commented Jul 17, 2020

The new URL() constructor currently requires at least one of its arguments to be an absolute URL.

new URL('./page.html', 'https://site.com/help/');  // OK
new URL('./page.html', '/help/');  // Uncaught TypeError: URL constructor: /public_html/ is not a valid URL.

That requirement is painful because determining which absolute URL to use as a base can be difficult or impossible in many circumstances. In a regular browser context, document.baseURI should be used. In Web Workers, self.location should be used. In Deno, window.location should be used but only if the --location command line option was used. In Node, there is no absolute URL to use. Trying to write isomorphic code that satisfies this requirement is quite error prone.

Additionally, in many cases it would be useful to parse and resolve relative URLs against each other without knowing an absolute base URL ahead of time.

// Desired output - these currently do not work
new URL('/to', '/from').toString();  // '/to'
new URL('/to', '//from.com/').toString();  // '//from.com/to'

The lack of support for use cases involving only relative URLs is causing me to remove WHATWG URL from Ky, a popular HTTP request library, in favor of our own string replacement. See: sindresorhus/ky#271

Desired API and whether to update the existing new URL() API or create a new API?

From my perspective, updating the new URL() constructor so it can handle a relative URL in the baseUrl argument would be ideal, i.e. remove the requirement for an absolute base in favor of simply parsing any missing URL parts as empty strings (as is currently done when a URL lacks a query, for example). But I understand that changing new URL() at this point may be difficult and it may be more practical to instead create a new API; perhaps new PartialURL() or split out the validation, parsing, and resolution algorithms into individual methods.

For my purposes, I need to at least be able to parse and serialize a relative URL, without having to provide an absolute base URL. A method that resolves two relative URLs against each other and returns the resulting relative URL would also be useful, e.g. URL.resolve('./from/index.html', './to') -> ./from/to.

@annevk
Copy link
Member

annevk commented Jul 17, 2020

Well, its purpose is to create a URL and those are by definition not relative. I could see wanting something specialized for path/query/fragment manipulation though. Are there any popular libraries that handle that we could draw inspiration from?

@sholladay
Copy link
Author

Where is it defined that a URL must contain a scheme and a host in order to be a valid URL?

Even if such a definition exists, new URL() is the first API in the web ecosystem that I have encountered that has this limitation, making it quite surprising.

Beyond that, the WHATWG URL spec itself defines relative URLs...

https://url.spec.whatwg.org/#relative-url-string

As for existing implementations, see Node's url.parse() and url.resolve(), among others. I've used these extensively to manipulate URLs where the scheme and/or host is not known ahead of time and will be determined later by the end-user or browser, depending on where the URL is ultimately used.

@annevk
Copy link
Member

annevk commented Jul 17, 2020

It defines them as input (though only in the context of a base URL, which at least browsers always use), it doesn't define them as data structures. The data structure is defined at https://url.spec.whatwg.org/#url-representation (though it's fair to say that does make it seem like more is optional than in reality is optional; something to improve).

@sholladay
Copy link
Author

sholladay commented Jul 17, 2020

I get that browsers need an absolute base URL to actually perform a request. And thus it makes sense for the URL specification to define what an absolute base URL is and discuss resolving relative URLs in the context of an absolute base URL, etc.

What doesn't make sense to me is why new URL() imposes this limitation. I cannot think of anything else on the web platform that does this. Even HTML's <base> tag supports relative URLs, despite the fact that it is specifically meant for defining the base URL.

I can see some value in an API that tests whether a URL is absolute. So perhaps part of the problem here is that new URL() actually does a lot of things: parsing, resolving, and validating. These could be broken down into separate methods. I don't think that is strictly necessary, though it would be one way to solve this.

@annevk
Copy link
Member

annevk commented Jul 17, 2020

Browsers only have a single URL parser that works as new URL() does (and as defined at https://url.spec.whatwg.org/#url-parsing). E.g., when parsing <base href> the location of the document is used. And in fact, the entirety of the web platform does this as it all builds upon this standard and its primitives.

@sholladay
Copy link
Author

Browsers only have a single URL parser that works as new URL() does

Sure, as I said, it's completely reasonable that a browser needs to resolve to an absolute URL. But I'm not building a browser and I have a suspicion that most new URL() users aren't, either. I'm building software for the web platform that is environment agnostic and needs the same functionality as new URL() even if the scheme or host is not yet known. Use cases and relevant code linked to above.

@mgiuca
Copy link
Collaborator

mgiuca commented Jul 20, 2020

To try and clarify this issue: it seems that you're not asking for a definitional change but an actual behavioural change to the Web-facing URL API.

Specifically, the changes you seem to be asking for are:

  1. If the base argument is not supplied, it defaults to document.location (the current page's URL), rather than the current behaviour which requires the url argument to be absolute if base is omitted.
  2. If the base argument is not absolute, it is first resolved against document.location (the current page's URL), rather than the current behaviour which unconditionally requires the base argument to be absolute.

So for example, if you executed these on https://github.com/whatwg/url/issues/531, all of the following are currently errors, and they would change to work as follows:

// Proposed API.
> new URL('to');
"https://github.com/whatwg/url/issues/to"

> new URL('to', '/from/');
"https://github.com/from/to"

> new URL('to', '//from.com/');
"https://from.com/to"

Technically, this is all feasible, but I don't think it's necessary or desirable. It's rather trivial to write code using the current API that behaves like this if you want it to:

// Current API.
> new URL('to', document.location);
"https://github.com/whatwg/url/issues/to"

> new URL('to', new URL('/from/', document.location));
"https://github.com/from/to"

> new URL('to', new URL('//from.com/', document.location));
"https://from.com/to"

I personally prefer not to change this. The current API forces you to be explicit about incorporating the current document's location, so it's clear to anyone reading the code that the current page's URL might leak into the result. When you don't use document.location as a base, it's a pure mathematical function of the inputs, and will produce the same output on any web page. That's a good property which I don't think we should break.

@sholladay
Copy link
Author

No. I want to be able to parse and resolve relative URLs in an environment-agnostic way, for example on the server. It's completely unacceptable to rely on the DOM. The point of this issue is new functionality, which would behave exactly like new URL() does now, except it would support relative URLs in both arguments and it would return the resolved and parsed relative URL. That's it. I'm not asking for magical implicit resolution to an absolute URL. Just allow baseUrl to be relative and if it is relative, then return a relative URL.

I don't care if this is a change to the constructor or exposed as some new method.

@mgiuca
Copy link
Collaborator

mgiuca commented Jul 20, 2020

Ohh, I see what you want now. (Tip: When filing a bug asking for a change to API behaviour, please give sample input and output so it's clear what you want.)

So am I right in thinking that this is what you want for my three examples:

// Proposed API.
> new URL('to');
"to"

> new URL('to', '/from/');
"/from/to"

> new URL('to', '//from.com/');
"//from.com/to"

(Noting that I'm using strings to represent the output above, but it would actually be a URL object.)

OK that makes sense. It does mean changing the URL object to allow representation of all kinds of relative URLs (scheme-relative, host-relative, path-relative, query-relative and fragment-relative). Though maybe that's helpful in explaining in general all of those different kinds of relative, which currently are not captured in the spec other than as details of the parser algorithm.

@sholladay
Copy link
Author

To be fair, I referenced Node's url.resolve() as an example of an existing implementation that produces the expected output (approximately). But point taken. Yes, you are correct about the desired output.

This would be a massive help to a lot of libraries and tools, especially those that aim to be isomorphic.

@masinter
Copy link

For multipart/related, we invented a scheme "thismessage:". You could use "thismessage::/" as the base if you didn't have one, and remove it when if was there when done. https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml#thismessage

@sholladay
Copy link
Author

Interesting. I did actually consider something exactly like that using invalid: as a scheme, but it's a hack and we'd like to avoid it. In Ky, we were able to use a regex string replacement for the query part of the URL, which also isn't great, but that was sufficient for the one place we still used new URL() - we removed all other usage of new URL() due to the aforementioned problems. There are other situations I've encountered, though, where something more complicated is needed. Parsing and resolving relative URLs is really something that should be built into the standard web APIs.

@brainkim
Copy link

brainkim commented Aug 27, 2020

Hi, I’m in a similar situation. I’m prototyping a bundler and I keep running into issues using the WHATWG URL class, specifically because it does not parse origin-relative URLs. The use-case is that I want to specify a common prefix for the public distribution of static files; for instance, the prefix can be the string "/static/", implying that the origin is the same origin as the server, but it can also be an absolute URL on a different origin ("https://mycdn.com/"). Some common operations I need include resolving relative and absolute URLs against this base, detecting if another URL is “outside” the base, and getting the relative path of a URL relative to the base, all of which could be done if an origin relative URL could be passed to the URL constructor, something like new URL("main.js", "/static/").

If anyone has any solutions, I’d love to hear about it. I’m loathe to abandon the URL class completely because of all the work it does in parsing URLs, but right now I have a Frankenstein system with URLs, the path/posix module, and regexes that I’d like to abstract.

@annevk
Copy link
Member

annevk commented Aug 27, 2020

@brainkim for that specific case it seems you could work around this by using a fake origin such as https://fakehost.invalid and removing it later on.

Also, if we did something here it would not be by changing new URL(). The output of that has to be "complete" and useful in a wide variety of contexts that expect a scheme and such.

@brainkim
Copy link

@annevk

I’m currently experimenting with using a custom protocol for the base (currently local:///) and it actually seems to be working out. It seems like it’s important to use 1 or 3 slashes so that the constructor does not interpret the first path part as a host. I still need posix path helpers to deal with pathname, and I have lots of code I’m not sure about like url.pathname.startsWith(publicPrefix.pathname) but this slowly seems to be turning into an acceptable solution.

Are there any thoughts on the fake protocol to use? I’m checking against https://en.wikipedia.org/wiki/List_of_URI_schemes to make sure I’m not stepping on well-known protocols. Maybe there is a very good reason not to use local:///? I’ve also considered internal:///, self:///, and relative:///? I want some name which indicates that the URL should be relative to the origin assigned to the server.

@masinter
Copy link

You could use thismessage:/ which was set up exactly for this purpose when defining multipart/related

@brainkim
Copy link

@masinter Looks good. From https://www.w3.org/wiki/UriSchemes/thismessage:

defined for the sole purpose of resolving relative references within a multipart/related structure when no other base URI is specified

The “multipart form” part threw me off earlier but I think this is acceptable.

@zamfofex
Copy link

I hope @alwinb doesn’t mind me advertising their library here (nor anyone else, for that matter), but I recently found it through #405 (comment), and it allow manipulating relative URLs and resolving them against other (relative or absolute) URLs in a way that complies to this specification.

It’s really simple, actually!

let url = new Url("../messages/goodbye.txt")
url = url.set({file: "hello.txt"})
console.log(url.host, [...url.dirs], url.file) // null, ["..", "messages"], "hello.txt"

console.log(new Url("https://example.com/things/index.html").goto(url).force().normalize().href) // "https://example.com/messages/hello.txt"

A couple notes:

  • .normalize() will collapse . and .. appropriately.
  • .force() will ensure special URLs have a host. (In this example, it’s unnecessary).
  • URL objects appear to be immutable. (From what I was able to check.)
  • When parsing a relative URL, you can specify the parsing mode (“special” vs. “file” vs. “regular”) with an argument to the Url constructor. (It defaults to non‐file special, i.e. similar to http[s] and ws[s].)
  • You can construct URL object from “parts” instead of from a string. (Relevant to Consider adding a constructor to make a URL from parts #354.)
  • By default, .toString() will produce a string that can contain non‐ASCII characters. .toASCII() (or equivalently, .toJSON() or .href) will produce an ASCII string, using percent‐encodings and punycode as appropriate.

Maybe this library can serve as inspiration of some kind for an API for the spec.

@alwinb
Copy link
Contributor

alwinb commented Oct 1, 2020

@zamfofex thank you, that is a nice summary!

I think that the most important part is not the API though, but the model of URLs underneath.

The parser that is used in the standard at the moment, simply cannot support relative URLs (without major changes, at least). And after having worked on my library, I can understand why, because it was a really complicated and frustrating process to come up with something compliant that could! I'd forgive people for thinking that it cannot be done at all.

I'll sketch part of my solution, for the discussion here.


The force operation is one key part of the solution.
Consider the issue of repeated slashes:

  1. http:foo/bar
  2. http:/foo/bar
  3. http://foo/bar
  4. http:///foo/bar

According to the standard all of these 'parse' (ie. parse-and-resolve) to the same URL. However, when 'parsed against a base URL' they behave differently. So you cannot just use:

  • special-url := [special-scheme :] [(/|\)* authority] [path-root] [relative-path] [? query] [# hash]

or something like that, as a grammar, because then you'd fail to resolve correctly when a base URL is supplied. (I'm using square brackets for optional rules here). So you need to start off with a classic rule that has two slashes before the authority.

My first parser phase is very simple and parses them as such:

  1. (scheme"http") (dir"foo") (file"bar")
  2. (scheme"http") (path-root"/") (dir"foo") (file"bar")
  3. (scheme"http") (auth-string"foo") (path-root"/") (file"bar")
  4. (scheme"http") (auth-string "") (path-root"/") (dir"foo") (file"bar")

From there,

  • It detects drive letters, via an operation on this structure, and it parses the authority from the auth-string.
  • Then, the goto operation, is quite like the 'non-strict merge' of RFC 3986. So this is nice, it is just a classic algorithm, and it is very simple.
  • Finally, force, solves the problem of the multiple slashes. If the (special) URL does not have an authority, or if its authority is empty, then it 'steals' an authority-string from the first non-empty dir-or-file, and it invokes the authority parser on that.
    I like this solution, because it matches the standard, but it also respects the RFC. This is indeed a 'force' that is only applied as an error-recovery strategy.

@alwinb
Copy link
Contributor

alwinb commented Oct 21, 2020

I did a branch of jsdom/whatwg-url a while ago that uses a modular parsing/resolving algorithm, passes all of the tests (well, except 5/1305 that I was looking to get some help with) and has everything in place to start supporting relative URLs.

I did not post it because the changes are so large, as-is, that it would not be feasible to adopt them in the standard. I was thinking about a way to provide the same benefits incrementally and with less intrusive changes, so that it could be merged into the spec gracefully. However, I have the impression that even if I'd manage to do that, the changes will be resisted for reasons that are not technical but social and emotional. So I am leaving it here as is. I am disappointed by the situation, I hope it will work out eventually, because support for relative URLs would be very useful to people, and also because a modular/ compositional approach enables you to talk with precision about the constituents that URLs are made of, improving the spec itself and all the discussions around it.

There have been good reasons why this has not been done before. It is a messy problem especially in combination with the different browser behaviours. I've built on that work and solved the issue, but as usual, there's more to it than solving the technical challenges.

Part of the discussion around this was in #479.

The branch, as-is... is here: https://github.com/alwinb/whatwg-url/tree/relative-urls. The readme is no longer accurate, Sorry for that.

@annevk
Copy link
Member

annevk commented Oct 22, 2020

I think the main reason we have not made a lot of progress here is lack of browser-related use cases. Apart from browsers the API is only supported by Node.js. That's not enough for https://whatwg.org/working-mode#changes. Perhaps that https://github.com/WICG/urlpattern brings some change to this, but it's a bit too early to say. Now I might well be wrong and there is in fact a lot of demand for this inside the browser or by web developers using a library to solve this in browsers today. If someone knows that to be the case it would be great if they could relay that.

@sholladay
Copy link
Author

Our use case is in the browser, I only mentioned other environments as an example of how it could benefit the larger community. Ky targets browsers primarily. We just don't want to specifically rely on the DOM or window. So we try to avoid referencing document.baseURI or window.location. That makes it difficult for us to use new URL() because it doesn't support relative URLs, which we are sometimes given as input because we are operating in a browser and relative URLs are a common occurrence in browser land.

@annevk
Copy link
Member

annevk commented Oct 23, 2020

Thanks for your reply Seth, could you perhaps go into some more detail as to why you want to avoid window.location and where these relative URLs are common?

@masinter
Copy link

you might check with @jyasskin for another use of relative URLs for browsers. Relative URLs were an important part of multipart/related capture of relationship of components in a saved web page. It was the reason for the invention of the "thismessage" scheme (for supplying a base when none was present.)

@jyasskin
Copy link
Member

Re @masinter, web packages don't currently have any fields that allow relative URLs. If we change that, I don't think we'd need to expose the relative-ness to Javascript—we'd just resolve them against the package's base URL, like we do for the relative URLs in HTML.

@alwinb
Copy link
Contributor

alwinb commented Oct 25, 2020

I'm not completely sure I accurately understand the last comment, but I think that what @jyasskin calls 'exposing relative-ness' is just what this issue is asking for. It is asking for an addition to the API that exposes a parsed version of what is called a "relative reference" in the parlance of RFC 3986 (I usually call it a relative URL).

I'm arguing in favour of it because I would like the standard to define an analogue of "relative reference". This is not currently the case, so in places where relative references are useful or needed, people cannot refer to the standard for guidance.

@annevk points out that for such a change to be considered, they need examples where relative references are useful in a browser context, so we're looking for such use cases.

@ti1024
Copy link

ti1024 commented May 4, 2021

Thanks for your reply Seth, could you perhaps go into some more detail as to why you want to avoid window.location and where these relative URLs are common?

@annevk points out that for such a change to be considered, they need examples where relative references are useful in a browser context, so we're looking for such use cases.

I think that there are natural cases where generating relative URLs is useful in a web app.

Suppose that some component A generates a link to another component B which takes a query parameter. For example, component A is at http://example.com/inbox and component B is at http://example.com/message?id=<the ID of a message>.

One approach is to generate an absolute URL, so that the DOM will be like <a href="http://example.com/message?id=abcde">Open message</a>. But this introduces unnecessary dependency on the domain name. This causes inconveniences such as that the domain name has to be faked in unit tests.

Another approach is to generate a relative URL, so that the DOM will be like <a href="/message?id=abcde">Open message</a>, and leave the relative-to-absolute conversion to the browser. To do so, it would be useful to write code like

const url = new URL('/message');
url.searchParams.set('id', messageId);
const link = createElement('a');
link.href = url.href;
...

but this code does not currently work because new URL('/message') throws.

@masinter
Copy link

masinter commented Feb 3, 2022

alwinb/url-specification#16

looks like good progress @alwinb

@karwa
Copy link
Contributor

karwa commented Feb 4, 2022

I'm not a big fan of combining that with the existing API for absolute URLs. A lot of legacy libraries went that way, and it ends up having all kinds of problems - from poor performance to non-obvious semantics. For example, in a strongly-typed language, you can have a function which accepts a parameter of type URL; but if that single URL type supports both absolute URLs and relative references, pretty much anything (including "foo") counts as a URL, which is generally not what developers expect (at least for the sorts of applications in my domain, perhaps expectations on the web are different). I think the use-cases are distinct enough that they warrant a separate API.

Also, I'm not sure it's obvious that we need all of the quirky web-compatibility behaviour that the relative-string parser does, treating back-slashes like forward-slashes and such. For a lot of use-cases, you have better control over the inputs and can do perfectly fine with only a sanitised subset of that behaviour.

One thing that's worth noting though: currently you don't only need the scheme to know how to interpret a URL. For example, in this case, a correct interpretation also requires you to know the base URL's path:

// "C|" is not interpreted as a drive letter if the base path has a drive letter
(input: "/a/../C|/Windows", base: "file:///D:/Music")  --> "file:///D:/C|/Windows"

// Same input string, but base doesn't have a drive letter.
// Now "C|" is considered a drive letter.
(input: "/a/../C|/Windows", base: "file:///Dx/Music")  --> "file:///C:/Windows"

(This is #574, and hopefully fixable)

@alwinb
Copy link
Contributor

alwinb commented Feb 5, 2022

Passing a fallback protocol to the parser to select certain behaviour for the otherwise ambiguous, scheme-less URLs does work, and this is what I have done so far.

But having to pass options around, becomes cumbersome and I can see how that would cause confusing problems. So I’ve taken on the challenge to structure things in a way that avoids that, as much as possible. And there are some interesting things to note about this.

What works well for most issues, is to loosen the constraints on URLs somewhat whilst modifying or combining them, and to enforce them later by calling a separate method to convert the (possibly) relative URL to an absolute/resolved URL, something that was suggested to me by @zamfofex.

  • An example: The URL standard requires that special-scheme URLs have an authority with a host that is either an IP address or a valid domain. If the domain is not valid, then the parser fails. Now, first of all, http:foo is actually a usable relative URL. It is host-relative; if it is resolved, then it will take the host from the base URL if that is an http URL too. (RFC 3986 calls this non-strict resolution). Thus, a relative http URL need not have a host, and if you do enforce that too soon, then you cannot express reference resolution in a way that matches the standard.

  • Second, if the API allows modifying eg. the scheme, then it is possible to create an http URL with an opaque host that maybe cannot be parsed as a domain (it could have encoded forbidden-domain-codepoints, for example). Here, rather than throwing an error right away, it is again useful to allow (for relative/ non-resolved) http URLs to temporarily have an opaque host. The host might be changed later in code, or it may be possible to parse it as a domain just in time before resolving it.

The IETF standards have made this distinction between such more- and less constrained URLs before. The name for such more tolerant and/or relative URL is an URIReference, or an IRIReference. The WHATWG equivalent to that would be slightly more tolerant still, as it would allow a few more codepoints, and invalid-percent-escape sequences in various components so as to remain consistent with WHATWG URLs.

@alwinb
Copy link
Contributor

alwinb commented Feb 8, 2022

Alright. I think I’m getting there. I am trying to get an implementation together that can serve as an API proposal. It may take a bit of time still, but I’ll do my best.

@karwa
Copy link
Contributor

karwa commented Feb 8, 2022

I'm starting to think that would work best as a separate document. It would need a reference implementation and its own comprehensive test-suite; essentially being another URL standard. What's more is that I'm not sure how many implementations would actually want this. I'm not convinced the use-cases are entirely clear.

I've seen HTTP routing mentioned as a major use-case, but HTTP origin-form request targets (of the form: /foo?bar) are not relative URL references. They are combined path-and-queries, and the HTTP standard says that you should use string concatenation (not relative reference resolution) to reconstruct the effective request URL. If you don't want to use string concatenation, you could split the request target and use the existing pathname and search setters. In any case, relative URLs don't come in to HTTP routing whatsoever.

To put it another way, what do you think GET //foo/bar?baz should resolve to? As a relative reference, this would be a hostname of foo and path of bar. Following the HTTP standard, it would be a path of //foo/bar.

By the way, I'm not the only one to bring this up - it was also mentioned in this comment on the NodeJS issue, and was positively received but ultimately appears to have been ignored.

There may be use-cases which require manipulating a relative reference in a scheme, host and base path-independent context, but I haven't seen the long list of convincing use-cases such a large change to the standard would require. Right now it seems to be based on a misconception, using Github reactions to put pressure on the standard.

We are nowhere near the proposal stage, IMO. Personally I'd be -1 if it was proposed today without multiple clear and convincing use-cases, backed up by experience using a reference implementation, explanations of why it needs the behaviour that it is being proposed and cannot be simplified, and as mentioned, a very extensive suite of tests.

@alwinb
Copy link
Contributor

alwinb commented Feb 8, 2022

I don't understand why you would reply in such a way. I'll try to get something out there so that there is something concrete to try out and play with and discuss further.

@karwa
Copy link
Contributor

karwa commented Feb 8, 2022

@alwinb So far, the most convincing use-case that I've seen is that somebody on the NodeJS issue wanted to construct relative references. OK, that's a thing.

But does that require the full complexity of the URL parser and all the work that you've done to establish a formal grammar and theory of URLs (which is certainly very interesting, don't get me wrong)? I'm not sure that it does. I think we could solve that with a simple data structure containing the broken-down URL components -- e.g. a list of path segments you could directly append to/remove from, and some simple methods to define how we serialise that structure as a relative reference.

I think it's important to remember that the URL parser in this standard does not represent the cleanest definition of URLs. The thing that gives it value is that it incorporates all of the weird compatibility hacks actors on the web platform need to consider when interpreting URLs. With every change, we have to consider what browser X, Y, and Z does, and whether we can call it a reasonable interpretation of how people expect URLs on the web to work.

For new APIs which don't have any of those compatibility concerns, I think we should be striving for the simplest design that solves the problems we actually have. Propagating those hacks beyond what is needed for compatibility should not be goal IMO. And those APIs should ideally be used in production for a while before proposing standardisation.

But instead, you're just coming and saying you've solved all the problems on your own, in a purely academic exercise that has not been used in production and is apparently not even driven by specific problems encountered in a real application (or was it?). I'm not sure what you expect other than scepticism. I would really love to offer more support, but I just can't consider this as a realistic proposition.

I would suggest following an approach similar to URLPattern - another new API that is closely related to this standard. Produce a focussed document and API for a specific problem, cut everything that isn't needed, produce an implementation, let users work through the issues, and build up a test suite as edge-cases are discovered, etc.

Anyway, that's my opinion. I'm not an editor of this standard so it's "non-normative" 😅, but I'm offering it because you have clearly studied the standard, and I'd rather help you to not waste your time developing overly-broad, overly-abstract specifications.

@alwinb
Copy link
Contributor

alwinb commented Feb 9, 2022

I think we could solve that with a simple data structure containing the broken-down URL components -- e.g. a list of path segments you could directly append to/remove from, and some simple methods to define how we serialise that structure as a relative reference.

URLs are that simple datastructure that contain the broken down components. And this is about coming up with an API for that.

I think it's important to remember that the URL parser in this standard does not represent the cleanest definition of URLs.

I know the standard well enough to say that the hacks can be described in a clean way. The differencess with RFC3987 are painfully small.

For new APIs which don't have any of those compatibility concerns, I think we should be striving for the simplest design that solves the problems we actually have.

And I was hoping to come up with a very simple API.

But instead, you're just coming and saying you've solved all the problems on your own, in a purely academic exercise that has not been used in production and is apparently not even driven by specific problems encountered in a real application (or was it?).

I reject that story line. I was pushed into this position and had no other choice than to respond. Yes, I have written implementations. And do not dismiss academic exercise as useless.

@alwinb
Copy link
Contributor

alwinb commented Feb 9, 2022

I am no longer interested in sharing my expertise with the WHATWG.

@zamfofex
Copy link

zamfofex commented Feb 9, 2022

I wanted to note that Alwin Blok’s implementation is fairly concrete and does pass most (if not all) of the WPT tests. I feel like the spec is really cleanly written, and in my opinion is much clearer to follow than the WHATWG spec. (Now, this is not to dissmiss the WHATWG’s spec and their efforts, but I feel like the way Alwin’s spec is laid out makes it much easier to follow for me.)

I do not want to get into reasons as to why trying to at least formally explain discrepancies between the WHATWG spec and the RFCs might be useful, as that is off‐topic for this issue (though I believe there are at least some).

Now, I will note: I do understand that it might be the case that relative URLs are not practically useful. To be honest, I don’t know for sure either way. I feel like that there at least should be an API to convert a URL record into a relative URL string.

From what I can see, people want to manipulate URLs in the back‐end (and sometimes in the front‐end too) in a way that allows for them to produce a relative URL string. So that e.g. they could change something like <img src="flower.png"> into <img src="/assets/flower.png"> by resolving "flower.png" against "/assets/".

Perhaps a more narrowed‐scope API could be to allow people to relativize a URL based on a URL component name. So that e.g. for url = new URL("https://example.org/hello?world#test"), then url.relativize("path") would effectively strip away everything before the path and return "/hello?world#test", likewise for e.g. url.relativize("query") === "?world#test".

(On an unrelated note: I feel like y’all are just being dismissive of each other’s efforts and interest in helping out, and I feel like that is really counter‐productive. If you truly have the intention of helping, I feel like you should be encouraging each other to both pursue investigations and researching new approaches as well as focusing on the current WHATWG spec and fixing the issues within it by actively trying to gather concrete use‐cases.)

@karwa
Copy link
Contributor

karwa commented Feb 9, 2022

I feel like that there at least should be an API to convert a URL record into a relative URL string.

Right, so the question I was raising is whether that data structure even needs to be a URL record. Unlike what Alwin says, URLs are not simply a data structure of broken-down components which users can manipulate willy-nilly. There are internal invariants which must be upheld to ensure that, for example, serialising that URL record and parsing it again results in an equivalent URL record.

That means, for example, you wouldn't be able to just insert some "." or ".." components in the record's list of path components. Yet those are the sort of components users of a relative URL type are likely most interested in.

In general, I feel it is good practice in software engineering to design and define as little as you can get away with, to solve the problems you actually have, and only add complexity as it becomes necessary, and only if it is worth the cost. Maybe we don't even need to define how you parse a relative URL string in to broken-down components with no base URL as context? Maybe 90% of use-cases can be served by just serialising a programmatically-constructed list of components.

I don't know; I'm just speculating. But I think that's the position we should begin from. Start simple.

I feel like y’all are just being dismissive of each other’s efforts and interest in helping out, and I feel like that is really counter‐productive. If you truly have the intention of helping, I feel like you should be encouraging each other to both pursue investigations and researching new approaches

Nobody's questioning each other's intentions. But at the same time, if you want to propose a significant expansion of a high-impact industry standard, you should be prepared for some scrutiny. What sort of standard would not thoroughly scrutinise every proposed change or addition?

Throwing a temper tantrum and storming off in a huff as soon as your proposal hits basic questions like: "are your ideas appropriately scoped for the problem?", or "do you have practical evidence of your proposal solving an issue in a production environment?" is not okay. It's an attempt to shut down debate. Obviously I feel bad that Alwin appears to be leaving as a result of my questions, but what am I supposed to do? Am I supposed to not ask questions out of fear that he is going to pack up and withdraw his proposals at any moment?

As well as the point about dismissing others work (which didn't happen, by the way; I did mention that his alternative standard was interesting), I think it's important to stress that that sort of behaviour cannot be acceptable. It's a sort of emotional extortion of people who are asking questions - which is exactly what they should be doing, and exactly what we need them to do.

My understanding is that, whilst this is an open and welcoming community, we also have basic standards of conduct and professionalism, designed to allow a fair and healthy debate of the issues. IMHO, if a contributor is unable to meet those standards, it is probably better that they not participate.

@karwa
Copy link
Contributor

karwa commented Feb 9, 2022

Oh, and one more thing: the WPT tests are known to have some significant gaps. Passing them is certainly an encouraging sign, but far from definitive. It is an ongoing process to improve their coverage of the standard.

@zamfofex
Copy link

zamfofex commented Feb 9, 2022

That is all fair enough.

Though I will note that a lot of what is described in Alwin’s spec doesn’t need to be exposed by an API. It can serve only as a mechanism for the spec itself to talk about URLs and describe them.

Personally, I feel like the way URLs were modeled and described in Alwin’s spec makes it clearer to follow it. You can read the description of the operations therein, and it’s immediately obvious what they do and how they work. You don’t need to mentally try to follow a state machine algorithm, you can just understand what each (very tersely described) operation does individually at a glance. In my opinion, the definitions are each succinct and simple, and they all come together to describe URLs succinctly.

I definitely agree that, if relative URL manipulation is incoporated, it should not be overloaded into the existing new URL(...) APIs, and I also understand that it is awkward to introduce a completely new API for it. There would definitely be an issue with two similar APIs doing similar things that you have to choose between. I don’t have a good solution for that, and it might be too difficult to avoid it.

serialising that URL record and parsing it again results in an equivalent URL record.

About this specifically: This is true in Alwin’s spec and implementation too. It just turns out that the way he modeled it, it allows for relative URLs to be represented too.


I also want to note that I was also criticising Alwin’s behavior. I don’t think the way he was acting was appropriate, it does really seem like he just gave up on arguing as soon as people criticised his work. I think this should be an effort to come up with something that works well for everyone, instead of assuming people are working against each other.

@alwinb
Copy link
Contributor

alwinb commented Feb 12, 2022

Alright, I am going to do my best to answer all the questions, so that I don't leave you hanging like this.

I'm starting to think that would work best as a separate document. It would need a reference implementation and its own comprehensive test-suite; essentially being another URL standard.

Apart from the separate document, I agree with this. Therefore, I have written a new specification very carefully, so that it agrees with the current standard, and I have created a reference implementation. The reference implementation currently passes all of the wpt tests except for 6 IDNA related ones, because I have not implemented domainToASCII properly.

There is room for improvement. The wpt test suite is indeed lacking. Me and others have found differences that were not caught by the tests, though they have been easy to fix so far.

There are no tests yet for relative references, nor an API, because the API design is not done yet.

Fuzz testing would be great, but I am just one man, and I have not prioritised this.

What's more is that I'm not sure how many implementations would actually want this. I'm not convinced the use-cases are entirely clear.

I think this has already been partly addressed. Note though that this is not just about relative URLs, but about recovering the proper structural framework that the WHATWG has abandoned. This is one and the same problem. It helps you solve the political one too.

In any case, relative URLs don't come in to HTTP routing whatsoever.

Semantically these are a subclass of relative URLs that do not have a scheme, nor a host. It will be possible to represent them with the API. They are prefixed with a . when serialising to a relative URL. Similar things are currently done in the standard.

This brings up more arguments. HTTP uses a specific subclass of URI. The WHATWG standard can generate URL strings that are invalid URIs, thus browsers presumaby are able to make invalid HTTP requests. In addition, the HTTP spec requires percent decoding of unreserved characters, which is not covered by the WHATWG. For this it is important to accurately describe the differences between the WHATWG and the RFCs - which was one of my main motivations - and to provide more advanced tools for normalising URLs in different ways.

There may be use-cases which require manipulating a relative reference in a scheme, host and base path-independent context, but I haven't seen the long list of convincing use-cases such a large change to the standard would require.

I can agree with this in the sense that relative URLs require a massive change that is very difficult to pull off. However, relative URLs alone are only an expression of much bigger problems with the stadard.

We are nowhere near the proposal stage, IMO.

Again, I agree with this. I used the word proposal in a confusing way. What I meant was that I wanted to make a first pass at an API around my more low level implementation, and show it here so that we could together investigate ideas. I did not mean to suggest that we would already start changing the standard text, far from it.

But does that require the full complexity of the URL parser and all the work that you've done to establish a formal grammar and theory of URLs

Not the WHATWG parser! Otherwise, Yes it does. URLs are too complex to manipulate without a theory to back it up. It is important to do it well, so that relative references that are created by software are aligned, to prevent fragmentation. It helps avoid bugs with percent coding. As a bonus, using API for this is more ergonomic than cutting and pasting strings. We have (hopefully) stopped doing that with SQL and we should stop doing that with URLs too. And again, relative URLs are related to solving much larger problems with the standard.

I think it's important to remember that the URL parser in this standard does not represent the cleanest definition of URLs.

I replied to this already. My specification shows that whatwg URLs and their hacks can be specified very cleanly.

For new APIs which don't have any of those compatibility concerns, I think we should be striving for the simplest design that solves the problems we actually have. Propagating those hacks beyond what is needed for compatibility should not be goal IMO.

I agree somewhat, except it turns out that the hacks are not so bad -- apart from possibly the strange behaviour of setters, but those are not hard to characterise.

But instead, you're just coming and saying you've solved all the problems on your own, in a purely academic exercise

I have worked on this for five years and yes, I solved most all of the major problems on my own. I very carefully and foolishly considered every possible use case and edge case of your standard. Not an academic exercise. The 'theory' is a by-product of the library.

A few people have read my specification and their comments have helped me a lot. @zamfofex has pretty much solved the last problems for me.

that has not been used in production and is apparently not even driven by specific problems encountered in a real application (or was it?).

Not used in production, I guess. Used in every day programming and private projects. The whatwg API never addresses my use case.

There is at least one API wrapper around my implementation already. It is here: astro-community. This may be something to be proud of maybe, because the author appears to be an influential person. It shows that I anticipated the use cases correctly.

Produce a focussed document and API for a specific problem

This is a new step that I was hoping to do together with you here.

produce an implementation

Done, except for the API wrapper. Unless you count my reurl library (but I don't find that API appropriate for a standard, I think).

let users work through the issues, and build up a test suite as edge-cases are discovered, etc.

Yes, this was the idea, looking for feedback.


There are internal invariants which must be upheld to ensure that, for example, serialising that URL record and parsing it again results in an equivalent URL record.

All taken into consideration. Be careful with the word equivalent here, equal is more fitting.

See this thread and this comment especially.

That means, for example, you wouldn't be able to just insert some "." or ".." components in the record's list of path components.

An URI, and an URIReference can contain such components. There are subclasses of URI, specifically, path-normalised URIs that cannot contain such components. I am doing the same in my work (which really is not that different from the RFCs). The WHATWG is in trouble is because it threw these distinctions out.

The (non-whatwg) resolution operator agrees with normalisation a follows.

normalise (strict-resolve (url1, url2)) == strict-resolve (normalise (url1), normalise (url2)).

Actually it is this:
normalise (strict-resolve (url1, url2)) == normalise (strict-resolve (normalise (url1), normalise (url2))).

(!) This is a property that should be maintained as much as possible, it is very powerful.

In general, I feel it is good practice in software engineering to design and define as little as you can get away with, to solve the problems you actually have, and only add complexity as it becomes necessary, and only if it is worth the cost.

Mostly agree, but I'm a bit more nuanced about that. Some things cannot be done without the theory, and searching for a general theory often exposes symmetries that you can use to simplify your code. Also something like, say TypeScript, I think you cannot create in this way.


The remaining conversation was about me leaving.

Throwing a temper tantrum and storming off in a huff … is not okay. It's an attempt to shut down debate.

I know what I am talking about and have been offering my work for free. I don't have to do that.

@zamfofex
Copy link

@alwinb: Just one more question: More concretely, how would you propose a change to the existing APIs in the WHATWG spec?

You have come up with a really nice specification, and your implementation works well, but when it comes down to proposing a direct change (or addition) to the existing shipped new URL(...) API from browsers, what do you think would be a succinct way to incorporate relative URLs?

Because I think it is currently expected that an instance of the URL class will be absolute, and it’s really unfortunate (and disruptive) to have to break that assumption. As I said, I don’t think it is good to introduce a different class either, as that would likely introduce an unfortunate choice to people wanting to manipulate URLs.

Personally, I feel like it is difficult to come up with a way to introduce this to the API without big issues appearing. It is unfortunate, I feel, but it means we might be stuck with an absolute‐only API in browsers.

@alwinb
Copy link
Contributor

alwinb commented Feb 12, 2022

As I said, I don’t think it is good to introduce a different class either, as that would likely introduce an unfortunate choice to people wanting to manipulate URLs.

A new class may be unavoidable unless serious trade-offs are made. On the other hand it will be possible to use the new API (with additional methods) to exactly replicate the behaviour of the existing API.

@lonr
Copy link

lonr commented Mar 31, 2022

I vote for RelativeURL (or named URLPathAndHash, or URLAbsolutePath).
I made an implementation. It uses code from jsdom/whatwg-url but parses URLs from the path state.

@d9k
Copy link

d9k commented Dec 5, 2022

Example hack:

const FAKE_HOST = 'https://fake-host';

export default function urlAddParams(url: string, urlParams: object = {}): string {
    let urlObj: URL;

    try {
        urlObj = new URL(url);
    } catch {
        /** FIXME remove hack when https://github.com/whatwg/url/issues/531 is ready */
        urlObj = new URL(url, FAKE_HOST);
    }

    Object.entries(urlParams).forEach(([paramName, paramValue]) =>
        urlObj.searchParams.append(paramName, paramValue)
    );

    return `${urlObj}`.replace(FAKE_HOST, '');
}

@jimmywarting
Copy link

my example of coming up with something like a relative URL constructor (b/c NodeJS dose not have an origin or location like deno)

const URLfrom = ((_URL => (origin = 'file://') => {
  return /** @type {typeof URL} */ (Object.assign(function URL(url, base) {
    return new _URL(url, new _URL(base, origin))
  }, _URL))
})(globalThis.URL))

// Example usage:

const RelativeURL = URLfrom('https://httpbin.org')

new RelativeURL('/get').toString() // https://httpbin.org/get
RelativeURL.createObjectURL(new Blob()) 

// More examples:

const RelativeURL = URLfrom(process.cwd() || import.meta.url || 'file://')

new RelativeURL('./readme.md').toString() // file:///Users/username/Projects/relative-url/readme.md

or if you just want to have something minimalistic

const url = (url, base) => new URL(url, new URL(base, import.meta.url))
url('./readme.md').toString() // file:///Users/username/Projects/relative-url/readme.md

@atumas-bananamilk

This comment was marked as spam.

@alwinb
Copy link
Contributor

alwinb commented Nov 17, 2023

I am entirely preoccupied with other things, have been for more than a year. But this issue is on my stack and I do intend to finish my work.

There's no hard technical problems left, only some superficial design decisions and the question where, how and what to publish.

I am still very angry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
addition/proposal New features or enhancements needs implementer interest Moving the issue forward requires implementers to express interest topic: api topic: model For issues with the abstract-but-normative bits
Development

No branches or pull requests