Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grammar specification for URLs #24

Closed
EricEdens opened this issue May 27, 2015 · 24 comments
Closed

Grammar specification for URLs #24

EricEdens opened this issue May 27, 2015 · 24 comments

Comments

@EricEdens
Copy link

Any thoughts on providing a formal grammar? The current spec defines URLs in terms of algorithms - while that's helpful to quickly write an implementation in an arbitrary language, it makes it tough to write code generators that will update parsers as the spec evolves.

@EricEdens
Copy link
Author

To clarify, this isn't a nagging feature request that's asking you to do more work -- I'm happy to help write a grammar :)

More specifically, I was surprised to not see a grammar, and I'm wondering whether that was an intentional omission, or whether it's just something that's low priority.

@domenic
Copy link
Member

domenic commented Jun 5, 2015

I don't remember the details exactly but I believe there are several large parts of the spec that cannot be captured by any kind of grammar.

This spec https://specs.webplatform.org/url/webspecs/develop/#parsing-rules attempts to codify URLs as a grammar (via railroad diagrams) but almost every grammar production also needs some Turing-complete post-processing steps---every step that doesn't say "Parse input according to the above railroad diagram" is outside the grammar.

@rubys
Copy link
Member

rubys commented Jun 5, 2015

Also note the 'note' at the end of the section that @domenic pointed to.

Note: extracting the railroad diagrams from this specification and interpreting them in isolation will produce incorrect results. In particular, the definitions provided in the prose modifies these parsing algorithms in important ways including returning failure and early termination.

@annevk
Copy link
Member

annevk commented Jun 9, 2015

@EricEdens here are some reasons to avoid grammar:

  1. Encourages independently written code. This catches bugs and leads to a better overall ecosystem. (Note how many grammar-based RFCs have not resulted in many compliant implementations.)
  2. It cannot actually be done.

Feel free to suggest a grammar, but I don't think we'll ever have one that's considered normative.

@EricEdens
Copy link
Author

@annevk , a problem that I've seen with other specs is that they provide a single grammar. If the grammar is strict enough to produce valid URLs, then it won't accept real-world URLs. If it accepts real-world URLs, then it will produce non-compliant URLs.

A better approach would be to have two grammars. One that generates compliant URLs, and one that accepts real-world URLs.

A good example is '' vs '/'. The strict grammar would only allow '/'; the looser grammar would allow either. A parser that used the loose grammer would be responsible for converting '' to '/' after the parse tree has been created.

  1. It cannot actually be done.

Sounds like a challenge to me :)

@domenic
Copy link
Member

domenic commented Jun 10, 2015

Unless you come up with a Turing-complete grammar specification language, then no matter how many grammars you have (two, five, whatever) it will not be enough. And we usually call such languages "programming languages".

@rubys
Copy link
Member

rubys commented Jun 10, 2015

@EricEdens: feel free to use https://github.com/webspecs/url/blob/develop/url.pegjs as a starting point.

@annevk
Copy link
Member

annevk commented Jun 11, 2015

@EricEdens actually, after parsing I think most URLs will be "valid". Do you have some examples?

@EricEdens
Copy link
Author

@annevk, I haven't seen any such examples in the WHATWG spec. My previous comment was brainstorming on why previous grammar-based RFCs have led to non-compliant implementations. If the RFC supplies a single grammar, then either the grammar will fail to accept all real-world URLs, or alternatively, it will accept real-world URLs, but will also generate those same warty URLs that are outside the recommended syntax.

In any case, I think we can close this issue since you guys have answered my original question :) I'm really liking the language productions in https://specs.webplatform.org/url/webspecs/develop/#parsing-rules. I'll try writing an implementation against that!

@rubys
Copy link
Member

rubys commented Jun 11, 2015

I'm really liking the language productions in https://specs.webplatform.org/url/webspecs/develop/#parsing-rules. I'll try writing an implementation against that!

FYI, there already is one:

https://github.com/webspecs/url/tree/develop/reference-implementation#readme

@EricEdens
Copy link
Author

@rubys, I saw that! I was thinking of writing one in Java? I know there's already galimatias, so this would be mostly for fun. Also, I can file feedback while working through the spec.

@rubys
Copy link
Member

rubys commented Jun 11, 2015

@EricEdens : have you considered submitting patches to galimatias? I've looked at a lot of implementations, and will say that I consider that one particularly well written. It isn't completely current with respect to the spec. Also, while it has support for a robust error handling strategy, it doesn't currently make as much use of it as it should. By that I mean that while it will parse most invalid URLs correctly, it will often do so silently.

@EricEdens
Copy link
Author

@rubys - I'll take a look into submitting patches to galimatias. That would probably have a greater benefit to the community.

What version of the spec do you recommend following? I've seen https://specs.webplatform.org/url/webspecs/develop/ and https://url.spec.whatwg.org/.

@rubys
Copy link
Member

rubys commented Jun 12, 2015

What version of the spec do you recommend following?

TL;DR: https://url.spec.whatwg.org/

Longer answer: https://github.com/webspecs/url#the-url-standard

The short version of the longer answer is that if/when the spec at webspecs gets enough attention; it will could replace the version at the whatwg. So again, following what is at the whatwg is the right answer.

@masinter
Copy link

@domenic For new specifications of new languages with draconian error handling, a design can try to define a single syntax with a single grammar, as a means of reducing complexity. But with something widely deployed, as URLs are, grammars won't do to describe how parsers should be written so that they accept what producers will send.

But it still might be possible to write a grammar describing a safe subset that future producers can follow such that the URLs produced will be parsed correctly by most deployed consumers. A parser couldn't use such a grammar but it would guide future URL producers such as forms. Right?

@sjamaan
Copy link

sjamaan commented Sep 11, 2018

I would like to offer my opinion from an implementor's perspective and hopefully convince the WG to restore a formal grammar. Let me start by providing some background on where I'm coming from. Feel free to skip this next section.

My background

I am the co-maintainer of the uri-generic egg for CHICKEN Scheme. This implementation attempts to follow RFC 3986 to the letter, and this has resulted in what IMO is a very high-quality implementation (at least, as far as parsing is concerned; URL construction still has some known issues). Oftentimes when we ran into issues, we've compared it with other implementations. It turns out that many of these are lacking in some way or another. I think the main reason is that they're not attempting to really implement the formal grammar (even if they claim to be RFC compliant), while we do. We even have a growing repository of alternative implementations using different parser generators which all pass the same test suite! (feel free to now call me a smug Lisp/Scheme weenie 😃)

I wasn't aware of the WHATWG spec until I saw it mentioned in a libcurl post.
It piqued my interest because I'm always looking for more test cases. The web platform test suite looks like a big, juicy set to start using in our egg's tests. I'd also consider implementing the WHATWG spec if this increases compatibility with other implementations.

What I expect from a spec

As an implementor, I routinely check the RFC's ABNF as a guide to determine what a valid URL should look like. If someone finds a certain URL our implementation doesn't parse, or if it parses an URL that it shouldn't, the first thing I do is go back to the ABNF in the RFC to verify the behaviour. It is compact, to the point and, for a trained eye, it is trivial to quickly determine if a parser should accept a given (sub)string or not.

The collected ABNF of RFC 3986 is a brief three screenful. In contrast, the algorithm in the WHATWG spec is roughly eighteen screenful. It is an overly detailed and nonstandard way of defining a grammar. This makes it harder to determine which language is accepted by this algorithm. It also makes it hard for me to determine what the changes are, compared to the RFC. Implementing the WHATWG spec would (for me) involve a complete rewrite.

The specification is so focused on the mechanics of a specific manual parsing technique that it almost precludes parser generators or other implementations. Parser generators have a long tradition in theory and practice, and can generate efficient language recognisers. Even today, it is an active research field; PEG grammars for example have been "discovered" as recently as 2004.

The way I think about it is that the purpose of this spec is to define what a URL "officially" looks like. So, as an implementor, I don't understand the hesitation to supply a formal grammar. Not having one will likely result in different people interpreting the spec differently. This results in less interoperability, which defeats the point of a spec.

Other reasons why I think a formal grammar is important

Finally, I would like to emphasise the importance of parsers based on formal grammars over ad hoc ones for security reasons. Let's say you have a pipeline of multiple processors which use different URL parsers. For example, you might have a HTML parser on a comment form which cleans URLs by dropping JavaScript and data URLs, among other things, or a mail client which blocks intranet or file system-local URLs before invoking an HTML viewer. If these are all ad hoc "informal" parsers that try to "fix" syntactically invalid URLs, it is nigh-impossible to verify that filtering them for "safe" URLs is correct. That's because it's impossible to decide which language is really accepted by an ad hoc implementation. An implementation further down the stack might interpret an URL (radically) different from one up the stack and you have a nice little exploit in the making.

If you're not convinced by my measly attempts at explaining this idea, please watch the talk "The Science of Insecurity". Meredith Patterson states the case much more eloquently than I ever could. This talk was an absolute eye-opener for me.

With this context, it baffled me to read the statement that "there are several large parts of the spec that cannot be captured by any kind of grammar". This is literally equivalent to saying "we can't know if an URL will be valid without evaluating the algorithm". This means you cheerfully drag the halting problem into what should be a simple, straightforward notation (come on, URLs aren't that ill-defined!). As far as I can tell, the RFC defines a regular grammar. The decision to go from a regular to an unrestricted grammar should not be taken lightly!

@TimothyGu
Copy link
Member

@sjamaan

If an overview of what a URL looks like is desired, https://url.spec.whatwg.org/#url-writing should be more than enough. With the exception of the definition of a valid domain, it could possibly even be translated into a formal grammar for valid URLs! though I will admit I haven't tried doing so. It would certainly make comparing with the RFC much easier though.

The main reason why this spec exists is the fact that it defines error handling rigorously. This is precisely what a formal grammar that defines what a valid URL is cannot do. Web content is unfortunately overwhelmingly erroneous, and part of the mission of the WHATWG is to make the web interoperable, which includes handling errors in a uniform fashion. For URL parsers outside of a browser, failing on an invalid URL may be an option. For web browsers however, it is not for the most part.

With regards to security concerns, a solution would be for everyone to adopt this standard's error handling behavior :D Indeed, we are seeing more and more adoption of this standard in the standard library of a language or runtime, like with Rust's url crate, and Node.js' url module.


I hope this has answered your question about:

it baffled me to read the statement that "there are several large parts of the spec that cannot be captured by any kind of grammar". This is literally equivalent to saying "we can't know if an URL will be valid without evaluating the algorithm".

Unlike the RFC, we intend to fully define the behavior when an invalid URL is encountered. This leads to a well-defined difference between "valid" and "parsable": the former is generally easy to tell, the latter unfortunately possibly not so much.

@GPHemsley
Copy link
Member

GPHemsley commented Sep 12, 2018

I concur with @TimothyGu's response.

I also wonder whether it's truly impossible and/or unhelpful to take the "basic URL parser", throw out everything that causes a validation error or returns failure, and create a formal grammar from that. The spec is written from the point of view of attempting to avoid being hindered by recoverable issues, for the sake of interoperability on the Web—but just because you can parse something into a URL doesn't mean you want to encourage any old thing to be considered a URL.

I think a case could be made that a formal grammar from which a finite state acceptor can be made would be useful to certain audiences. ("Hey you! Person who entered this supposed URL! I won't let you continue unless you do it right.")

If nothing else, it would help those who might endeavor to write a regular expression to detect URLs and auto-link them, for example. You're not about to go implement the URL parser and run it on any arbitrary text in order to do that.

@sjamaan
Copy link

sjamaan commented Sep 12, 2018

@sjamaan

If an overview of what a URL looks like is desired, https://url.spec.whatwg.org/#url-writing should be more than enough. With the exception of the definition of a valid domain, it could possibly even be translated into a formal grammar for valid URLs!

That sounds interesting, I might give that a try!

The main reason why this spec exists is the fact that it defines error handling rigorously. This is precisely what a formal grammar that defines what a valid URL is cannot do.

I see there are two error states: "return failure" and "validation error". I think "return failure" is not specific enough to indicate what failure, so this can be handled simply by rejecting the language. "validation error" could be simply handled by a production that's annotated with "validation error" (like a semantic action in a parser generator). I also see "validation error, return failure", which to me seems a bit inconsistent/unnecessary.

The only problem I see so far is the encoding override stuff, but apparently that's considered legacy anyway, so maybe it can be discussed separately somehow.

Web content is unfortunately overwhelmingly erroneous, and part of the mission of the WHATWG is to make the web interoperable, which includes handling errors in a uniform fashion. For URL parsers outside of a browser, failing on an invalid URL may be an option. For web browsers however, it is not for the most part.

See above. And even if adding error states to BNF is deemed impossible, this could be handled by a piece of prose. This doesn't reduce the usefulness of a formal grammar, which is about what syntax is acceptable, which is useful for every implementation to verify whether they are correctly accepting valid URLs. If an URL is not accepted, one would want to check whether the spec intended it to be invalid. Just eyeballing a BNF for this takes a couple of seconds, minutes at most. Mentally running the algorithm would also work but it's more involved as you need to keep track of state and so on.

With regards to security concerns, a solution would be for everyone to adopt this standard's error handling behavior :D Indeed, we are seeing more and more adoption of this standard in the standard library of a language or runtime, like with Rust's url crate, and Node.js' url module.

I'll study that documentation, this sounds interesting and a useful addition to uri-generic too.

I hope this has answered your question about:

it baffled me to read the statement that "there are several large parts of the spec that cannot be captured by any kind of grammar". This is literally equivalent to saying "we can't know if an URL will be valid without evaluating the algorithm".

Unlike the RFC, we intend to fully define the behavior when an invalid URL is encountered. This leads to a well-defined difference between "valid" and "parsable": the former is generally easy to tell, the latter unfortunately possibly not so much.

Parser generators are notoriously bad at error handling, but that's mostly because you'd need to extend the grammar with explicit failure states. I think this would actually be a place where the WHATWG spec could add value; by providing an officially sanctioned BNF augmented with error states!

Writing down such a BNF would be a bit more involved, but might still be worthwhile.

Thank you for taking the time to write out a detailed response. I'll think this through a bit more and hope to come up with a grammar soonish. Would that be doable via pull request?

@annevk
Copy link
Member

annevk commented Sep 12, 2018

@sjamaan I think we should have a separate discussion about such a grammar, preferably in a new issue dedicated to it. I could see us including it in the specification as a non-normative guide, but I'm less sure about making it normative, especially if it doesn't handle all the edge cases, such as encodings.

There has been an attempt in the past describing the formal processing model through railroad diagrams, but that ultimately wasn't successful: http://intertwingly.net/blog/2014/10/21/pegurl-js#c1414151932.

@sjamaan
Copy link

sjamaan commented Sep 12, 2018

@annevk I can't really make out why the previous approach failed. If I am to put time into this I want to make sure to avoid pitfalls from earlier attempts, to avoid it being a waste of time.

@annevk
Copy link
Member

annevk commented Sep 12, 2018

I think the main reason was that only one person really cared about that approach and most implementers were fine with an imperative definition. It also was incomplete, harder to read for some, and didn't match the existing requirements in all cases. Another oft-overlooked issue with declarative approaches is that the language used to describe the feature also needs a complete definition from first principles.

For a non-normative supplement some of that can be overlooked (though I try hard to make non-normative text to be fully accurate as well), but if your goal is to replace the existing parser specification just realize that the bar is (justifiably) high.

@sjamaan
Copy link

sjamaan commented Sep 12, 2018

The language used to describe the feature also needs a complete definition from first principles.

Could you unpack that a bit? Does it mean I need to explain BNF? Is merely linking to RFC 5234 / STD 68 not enough (assuming ABNF suffices here)?

@annevk
Copy link
Member

annevk commented Sep 12, 2018

If ABNF were to suffice that'd be sufficient. (Though one thing that's often unclear to me with ABNF is how the parsed construct is translated into an object.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

8 participants