Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid URIs are not detected. #161

Closed
gotascii opened this issue Jun 10, 2014 · 10 comments
Closed

Invalid URIs are not detected. #161

gotascii opened this issue Jun 10, 2014 · 10 comments
Labels

Comments

@gotascii
Copy link

I was comparing the behavior of URI.parse and Addressable::URI.parse and noticed that URI.parse will strip leading and trailing whitespace from the uri. Addressable::URI did not appear to strip any space and, upon further inspection, seems to allow any manner of invalid uri to be provided:

Addressable::URI.parse("\c$/BADCATZ ").to_s
Addressable::URI.parse(" http: ").to_s
Addressable::URI.parse("**4 \c\a\t ~").to_s

I am probably doing something wrong but URI.parse detects all of the examples above as invalid, which is what I would expect. Is this expected behavior from Addressable::URI.parse?

@bitboxer
Copy link

This is what I am wondering, too. I stumbled upon this because validates_url is using the addressable gem to find out if an url is correct and things like http://hello this is not an url are totally valid for addressable.

@geku
Copy link

geku commented Feb 26, 2015

From my point of view such URLs should be invalid. According to RFC3986:

A host identified by a registered name is a sequence of characters
usually intended for lookup within a locally defined host or service
name registry, though the URI's scheme-specific semantics may require
that a specific registry (or fixed name table) be used instead. The
most common name registry mechanism is the Domain Name System (DNS).
A registered name intended for lookup in the DNS uses the syntax
defined in Section 3.5 of [RFC1034] and Section 2.1 of [RFC1123].
Such a name consists of a sequence of domain labels separated by ".",
each domain label starting and ending with an alphanumeric character
and possibly also containing "-" characters. The rightmost domain
label of a fully qualified domain name in DNS may be followed by a
single "." and should be if it is necessary to distinguish between
the complete domain name and some local domain.

reg-name = *( unreserved / pct-encoded / sub-delims )

whereas

unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

Hence, whitespace is not an allowed character at least for the host.

@sporkmonger
Copy link
Owner

Addressable, for the most part, takes a very liberal view towards validating URLs. It was originally written because the Ruby standard library's URI implementation took the opposite approach and raised exceptions for things that were either weird, but actually valid URIs or invalid, but close enough that just parsing the URI without complaint would have resulted in being able to do something useful instead of being forced to just handle the exception and fail. For example, in older versions of Ruby, trying to parse 'http://user_name.example.com' would result in an invalid URI error. The '_' character is explicitly part of the unreserved character class, and should be allowed in a hostname. However, it's invalid according to DNS, and so the Ruby URI implementation rejected it as an invalid URI. But you can open science_boy.blogspot.com in your browser just fine. The meant, for example, that you could read that blog in your browser, but you couldn't parse the RSS feed with Ruby.

So this kind of thing tends to inform my philosophy on when to raise exceptions and when not to – namely that I raise an exception only when a URI is invalid according to RFC 3986 and I ignore what related specifications have to say on the matter. And even then I tend to err on the side of not raising exceptions to allow those following Postel's Law to do something useful.

In this case however, I completely agree on your reading of what should and shouldn't be allowed in the host component and I do think it would be reasonable to raise an exception for whitespace in the hostname, so long as the heuristic_parse method did not.

@ptoomey3
Copy link
Contributor

Just to add some additional context for this issue. Even if you decide that throwing an exception is not the correct approach, it would be nice if Addressable::URI was at least was consistent with how most (all?) browsers act with whitespace. A core use case of parsing URLs is to ensure they are well formed and can be parsed into something expected. This can backfire if the parser doesn't work the way most browsers do. Here is a quick example I just ran into:

uri = Addressable::URI.parse(" //google.com/foo/bar")
=> #<Addressable::URI:0x884d1e70 URI: //google.com/foo/bar>
irb(main):002:0> uri.relative?
=> true
irb(main):003:0> uri.host
=> nil

# vs. 

``` ruby
uri = Addressable::URI.parse("//google.com/foo/bar")
=> #<Addressable::URI:0x884ba9dc URI://google.com/foo/bar>
irb(main):005:0> uri.relative?
=> true
irb(main):006:0> uri.host
=> "google.com"

Browsers seem to treat both situations the same (they ignore leading/trailing whitespace). So, in order to reliably model how a browser will act I need to perform a url.strip before passing a URL to Addressable::URI#parse. That feels relatively unexpected and puts a fair degree of URL nuance on the caller.

@sporkmonger
Copy link
Owner

sporkmonger commented Jun 15, 2016

Historically, I've advocated for developers using heuristic_parse rather than parse when they want browser-like parsing. That's what that method is for. Since this issue comes up a lot I think maybe the core problem is documentation rather than implementation.

@dvogel
Copy link

dvogel commented Jul 28, 2016

I definitely want an indication of error when the host part violates RFC3986 because I'm trying to provide feedback to the user that the URL they entered was invalid. Given that we have #heuristic_parse to handle situations where you want to avoid failure for "fixable" URLs, I think it's reasonable to raise InvalidURIError (or possibly a new sub-class of that InvalidURIHostError) if the hostname doesn't conform to RFC3986.

@vamsi-krishna-E0029
Copy link

Even heuristic_parse fails in some cases

uri = Addressable::URI.heuristic_parse("google.")
=> #<Addressable::URI:0x884d1e70 URI: http://google.>
irb(main):002:0> uri.host
=> "google."

The "google." is not a valid host.

I think the host validation should be changed here. I do a heuristic_parse and again do the validation with a regex as of now.

HOST_REGEX = /^(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)*([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9])$/

I think we can use this regex in host= method so that it gives an InvalidURIError

@zastruga
Copy link

zastruga commented Sep 8, 2016

An even simpler case I ran into that fails with both Addressable::URI.heuristic_parse(" http://google.com") as well as Addressable::URI.parse(" http://google.com"), with a leading space. As above, I "fixed" this in my code by just stripping the URL that is passed to Addressable::URI, but it feels like this is a basic thing that should at least be handled by heuristic_parse.

@sporkmonger
Copy link
Owner

@ovamsikrishna What are you expecting to have happen there? Browsers don't resolve google. either, but that is a valid hostname. In fact, google.com. and google.com are equivalent. Similarly, localhost. and localhost. Addressable is 100% doing the right thing there.

@sporkmonger
Copy link
Owner

sporkmonger commented Nov 3, 2016

@zastruga I just put in 8d040f6 to strip whitespace before parsing in heuristic_parse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants