Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse() does not handle custom protocol with root slash, like whatever:///whatever #11

Closed
felixfbecker opened this issue Nov 20, 2016 · 14 comments
Labels

Comments

@felixfbecker
Copy link

felixfbecker commented Nov 20, 2016

As this package is called sabre/uri, and not sabre/url, I expected it to handle any valid URI. For example, it should handle any URI scheme. However, that is not the case, because it uses parse_url internally. The docs clearly say that

This function is intended specifically for the purpose of parsing URLs and not URIs. However, to comply with PHP's backwards compatibility requirements it makes an exception for the file:// scheme where triple slashes (file:///...) are allowed. For any other scheme this is invalid.

http:// and file:// work, any other URI results in an Errror: Unsupported operand types.

Examples:

Uri\parse('ftp:///whatever')
Uri\parse('ssh://git@github.com/fruux/sabre-uri.git')

This really limits the usefulness of this function.

@evert
Copy link
Member

evert commented Nov 21, 2016

Hi!

My goal is definitely to support as many schemes as possible. I was under the impression that parse_url worked fine for a bunch of different schemes. At the very least mailto: works.

I question the validity of both those urls. As far as I can tell a triple-slash in ftp uri's is not allowed. The ssh uri is not officially registered anywhere, and I also question if it's correct. The colon in particular is weird. The only draft spec for the ssh uri I could find actually uses the format ssh://host:port/path. A path after a colon is weird and would be different than any other uri.

@felixfbecker
Copy link
Author

felixfbecker commented Nov 21, 2016

There is a difference between URL and URI:

A URI can be further classified as a locator, a name, or both. The
term "Uniform Resource Locator" (URL) refers to the subset of URIs
that, in addition to identifying a resource, provide a means of
locating the resource by describing its primary access mechanism
(e.g., its network "location"). The term "Uniform Resource Name"
(URN) has been used historically to refer to both URIs under the
"urn" scheme [RFC2141], which are required to remain globally unique
and persistent even when the resource ceases to exist or becomes
unavailable, and to any other URI with the properties of a name.

An individual scheme does not have to be classified as being just one
of "name" or "locator". Instances of URIs from any given scheme may
have the characteristics of names or locators or both, often
depending on the persistence and care in the assignment of
identifiers by the naming authority, rather than on any quality of
the scheme. Future specifications and related documentation should
use the general term "URI" rather than the more restrictive terms
"URL" and "URN" [RFC3305].

A valid URI is anything that follows has this form:


      URI         = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

      hier-part   = "//" authority path-abempty
                  / path-absolute
                  / path-rootless
                  / path-empty

There is no constraint on what schemes are allowed. What I gave were just examples, they don't have to be syntactically correct according to the scheme, but they are syntactically correct according to the definition of URIs. For example, I could invent my own URI protocol:

felixfbecker://yepitsawesome?isitreally#yesitis

and that would be a valid URI, but parse_url, and it turn Sabre\Uri\parse will fail to parse it.

You might think this is not a big use case but my language server gets URIs from the outside and the only thing I know is that they are valid URIs, not the protocol. VS Code for example sends new files with a URI like inmemory://1. Other clients have a virtual file system and send vfs:///somefile. I need a way to parse these, without imposing any constraints on the scheme - I just want to split it up into components.

@evert
Copy link
Member

evert commented Nov 21, 2016

That last example you gave actually parses just fine again, because it's valid.

Can you give an example of a URI that's valid and does not parse?

@felixfbecker
Copy link
Author

felixfbecker commented Nov 21, 2016

\Sabre\Uri\parse('whatever:///whatever');

Seems to be related to the root slash - which works fine for file:// URIs, but not anything else

@evert
Copy link
Member

evert commented Nov 21, 2016

I had to do a bunch of digging but eventually found this:

If the URI scheme defines a default for host, then that default applies when the host subcomponent is undefined or when the registered name is empty (zero length). For example, the "file" URI scheme is defined so that no authority, an empty host, and "localhost" all mean the end-user's machine, whereas the "http" scheme considers a missing authority or empty host invalid.

So the triple slash issue is a legit one. I wouldn't go as far as saying that the library only parsers URLs and not URIs though, but rather that it can't parse a URI with an empty hostname if it's not a file: scheme.

@felixfbecker
Copy link
Author

Yep, sorry. I just saw that comment on parse_url which states that it only parses URLs and file:// is only there for legacy reasons, so I too quickly assumed that no custom protocol is supported.

@felixfbecker felixfbecker changed the title URI parser cannot handle URIs.... parse() does not handle custom protocol with root slash, like whatever:///whatever Nov 21, 2016
@felixfbecker
Copy link
Author

For my use case it would be enough if it doesn't parse, but at least doesn't generate an Error - I think it is a PHP4-style error, so we could silence it with @.

@evert evert added the bug label Nov 24, 2016
@evert
Copy link
Member

evert commented Nov 24, 2016

Marking this as a bug, but I want to say that given your situation Sabre\Uri simply not be the best choice. One reason for this is that this:

inmemory://1

Could technically pass as a valid uri.. maybe? I'm not even sure. 1 here will be interpreted as the host name. If I understand this URI correctly, this really should have been something like this:

inmemory:1

But ideally even this:

urn:x-vendorname:inmemory:1

If you run into URIs of the first kind, and you don't expect just any uri, the thing you are actually probably interested in is:

  1. The scheme
  2. Everything that's not the scheme

In that case you might simply be better off doing a split on :. It just doesn't feel like these are real uris, but rather just something that's made to look like one.

Just a thought! I still want to fix this though, even though it's a bit depressing because the solution to this means a custom parser which is gonna be slower for the 99.99% use-case.

@felixfbecker
Copy link
Author

@evert I agree with you - I think a URN would be a better fit for this. But I get these from the client and can't control them. What I am actually interested in is only the path, I need to match it against a few conditions. It's okay if the condition doesn't match because path is actually null because the URI is one of the weird inmemory://n URIs, but getting a PHP4-style error (that gets turned into an exception) should be avoided at all cost. Splitting by : will not be sufficient, because other URIs I get (those that are not the weird inmemory scheme) may have a query and/or hash fragment (I think they could also be relative). So I would like to rely on the parse() function to give me just the path fragment, if there is one. If not, or the URI is weird, just set all the values to null like it is done already, but please avoid triggering an error. Would that be a compromise you can live with? Using parse_url, but putting an @ in front of it?

@evert
Copy link
Member

evert commented Nov 24, 2016

I would ideally like the API to parse any valid URI you throw at it, but always throw an exception for those that aren't.

What's stopping you from catching the exception and handling it that way?

@felixfbecker
Copy link
Author

Seems a bit weird to use @ on such a high-level third-party function, but I could do it that way of course

@evert
Copy link
Member

evert commented Nov 24, 2016

What about a try..catch block? Don't use @.

@felixfbecker
Copy link
Author

It's a PHP4-style error iirc. You can only catch it if you have an error handler registered, and that is an assumption I don't want to make. Maybe Sabre could @ and then throw an error depending on the return and error_get_last?

@evert
Copy link
Member

evert commented Nov 24, 2016

I see, yes that would be perfect! Totally agree with that idea

evert added a commit that referenced this issue Dec 6, 2016
This is a new implementation of PHP's parse_url. It's likely not
perfect, but it catches specific cases that the built-in function does
not support.

Fixes Issue #9
Fixes Issue #11

@felixfbecker / @sii / @PVince81 Does this sufficiently solve your
problem?
@evert evert closed this as completed Dec 7, 2016
evert added a commit that referenced this issue Dec 7, 2016
This is a new implementation of PHP's parse_url. It's likely not
perfect, but it catches specific cases that the built-in function does
not support.

Fixes Issue #9
Fixes Issue #11

@felixfbecker / @sii / @PVince81 Does this sufficiently solve your
problem?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants