Invalid URIs are not detected. #161

gotascii · 2014-06-10T00:09:56Z

I was comparing the behavior of URI.parse and Addressable::URI.parse and noticed that URI.parse will strip leading and trailing whitespace from the uri. Addressable::URI did not appear to strip any space and, upon further inspection, seems to allow any manner of invalid uri to be provided:

Addressable::URI.parse("\c$/BADCATZ ").to_s
Addressable::URI.parse(" http: ").to_s
Addressable::URI.parse("**4 \c\a\t ~").to_s

I am probably doing something wrong but URI.parse detects all of the examples above as invalid, which is what I would expect. Is this expected behavior from Addressable::URI.parse?

The text was updated successfully, but these errors were encountered:

bitboxer · 2014-12-19T11:11:18Z

This is what I am wondering, too. I stumbled upon this because validates_url is using the addressable gem to find out if an url is correct and things like http://hello this is not an url are totally valid for addressable.

geku · 2015-02-26T13:56:33Z

From my point of view such URLs should be invalid. According to RFC3986:

A host identified by a registered name is a sequence of characters
usually intended for lookup within a locally defined host or service
name registry, though the URI's scheme-specific semantics may require
that a specific registry (or fixed name table) be used instead. The
most common name registry mechanism is the Domain Name System (DNS).
A registered name intended for lookup in the DNS uses the syntax
defined in Section 3.5 of [RFC1034] and Section 2.1 of [RFC1123].
Such a name consists of a sequence of domain labels separated by ".",
each domain label starting and ending with an alphanumeric character
and possibly also containing "-" characters. The rightmost domain
label of a fully qualified domain name in DNS may be followed by a
single "." and should be if it is necessary to distinguish between
the complete domain name and some local domain.

reg-name = *( unreserved / pct-encoded / sub-delims )

whereas

unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

Hence, whitespace is not an allowed character at least for the host.

sporkmonger · 2015-03-27T16:34:23Z

Addressable, for the most part, takes a very liberal view towards validating URLs. It was originally written because the Ruby standard library's URI implementation took the opposite approach and raised exceptions for things that were either weird, but actually valid URIs or invalid, but close enough that just parsing the URI without complaint would have resulted in being able to do something useful instead of being forced to just handle the exception and fail. For example, in older versions of Ruby, trying to parse 'http://user_name.example.com' would result in an invalid URI error. The '_' character is explicitly part of the unreserved character class, and should be allowed in a hostname. However, it's invalid according to DNS, and so the Ruby URI implementation rejected it as an invalid URI. But you can open science_boy.blogspot.com in your browser just fine. The meant, for example, that you could read that blog in your browser, but you couldn't parse the RSS feed with Ruby.

So this kind of thing tends to inform my philosophy on when to raise exceptions and when not to – namely that I raise an exception only when a URI is invalid according to RFC 3986 and I ignore what related specifications have to say on the matter. And even then I tend to err on the side of not raising exceptions to allow those following Postel's Law to do something useful.

In this case however, I completely agree on your reading of what should and shouldn't be allowed in the host component and I do think it would be reasonable to raise an exception for whitespace in the hostname, so long as the heuristic_parse method did not.

ptoomey3 · 2015-09-30T19:45:56Z

Just to add some additional context for this issue. Even if you decide that throwing an exception is not the correct approach, it would be nice if Addressable::URI was at least was consistent with how most (all?) browsers act with whitespace. A core use case of parsing URLs is to ensure they are well formed and can be parsed into something expected. This can backfire if the parser doesn't work the way most browsers do. Here is a quick example I just ran into:

uri = Addressable::URI.parse(" //google.com/foo/bar")
=> #<Addressable::URI:0x884d1e70 URI: //google.com/foo/bar>
irb(main):002:0> uri.relative?
=> true
irb(main):003:0> uri.host
=> nil

# vs. 

``` ruby
uri = Addressable::URI.parse("//google.com/foo/bar")
=> #<Addressable::URI:0x884ba9dc URI://google.com/foo/bar>
irb(main):005:0> uri.relative?
=> true
irb(main):006:0> uri.host
=> "google.com"

Browsers seem to treat both situations the same (they ignore leading/trailing whitespace). So, in order to reliably model how a browser will act I need to perform a url.strip before passing a URL to Addressable::URI#parse. That feels relatively unexpected and puts a fair degree of URL nuance on the caller.

sporkmonger · 2016-06-15T21:58:52Z

Historically, I've advocated for developers using heuristic_parse rather than parse when they want browser-like parsing. That's what that method is for. Since this issue comes up a lot I think maybe the core problem is documentation rather than implementation.

dvogel · 2016-07-28T17:21:26Z

I definitely want an indication of error when the host part violates RFC3986 because I'm trying to provide feedback to the user that the URL they entered was invalid. Given that we have #heuristic_parse to handle situations where you want to avoid failure for "fixable" URLs, I think it's reasonable to raise InvalidURIError (or possibly a new sub-class of that InvalidURIHostError) if the hostname doesn't conform to RFC3986.

vamsi-krishna-E0029 · 2016-08-24T06:01:22Z

Even heuristic_parse fails in some cases

uri = Addressable::URI.heuristic_parse("google.")
=> #<Addressable::URI:0x884d1e70 URI: http://google.>
irb(main):002:0> uri.host
=> "google."

The "google." is not a valid host.

I think the host validation should be changed here. I do a heuristic_parse and again do the validation with a regex as of now.

HOST_REGEX = /^(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)*([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9])$/

I think we can use this regex in host= method so that it gives an InvalidURIError

zastruga · 2016-09-08T15:32:59Z

An even simpler case I ran into that fails with both Addressable::URI.heuristic_parse(" http://google.com") as well as Addressable::URI.parse(" http://google.com"), with a leading space. As above, I "fixed" this in my code by just stripping the URL that is passed to Addressable::URI, but it feels like this is a basic thing that should at least be handled by heuristic_parse.

sporkmonger · 2016-11-03T23:49:50Z

@ovamsikrishna What are you expecting to have happen there? Browsers don't resolve google. either, but that is a valid hostname. In fact, google.com. and google.com are equivalent. Similarly, localhost. and localhost. Addressable is 100% doing the right thing there.

sporkmonger · 2016-11-03T23:54:21Z

@zastruga I just put in 8d040f6 to strip whitespace before parsing in heuristic_parse.

sporkmonger added the Accepted label Mar 27, 2015

dentarg mentioned this issue Jan 13, 2016

Only RFC 3986 compliant URLs should be valid? twingly/twingly-url#74

Closed

sporkmonger mentioned this issue Jun 15, 2016

Underscore in #normalized_host is replaced with %5F #233

Closed

sporkmonger closed this as completed Nov 4, 2016

dentarg mentioned this issue Aug 18, 2018

Examples of how addressable more closely conforms to RFC... #313

Closed

jherdman mentioned this issue Sep 5, 2018

String "uri" format validation doesn't work voxpupuli/json-schema#353

Closed

golgote mentioned this issue Oct 12, 2018

Whitespace on the host name golgote/neturl#11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid URIs are not detected. #161

Invalid URIs are not detected. #161

gotascii commented Jun 10, 2014

bitboxer commented Dec 19, 2014

geku commented Feb 26, 2015

sporkmonger commented Mar 27, 2015

ptoomey3 commented Sep 30, 2015

sporkmonger commented Jun 15, 2016 •

edited

Loading

dvogel commented Jul 28, 2016

vamsi-krishna-E0029 commented Aug 24, 2016

zastruga commented Sep 8, 2016

sporkmonger commented Nov 3, 2016

sporkmonger commented Nov 3, 2016 •

edited

Loading

Invalid URIs are not detected. #161

Invalid URIs are not detected. #161

Comments

gotascii commented Jun 10, 2014

bitboxer commented Dec 19, 2014

geku commented Feb 26, 2015

sporkmonger commented Mar 27, 2015

ptoomey3 commented Sep 30, 2015

sporkmonger commented Jun 15, 2016 • edited Loading

dvogel commented Jul 28, 2016

vamsi-krishna-E0029 commented Aug 24, 2016

zastruga commented Sep 8, 2016

sporkmonger commented Nov 3, 2016

sporkmonger commented Nov 3, 2016 • edited Loading

sporkmonger commented Jun 15, 2016 •

edited

Loading

sporkmonger commented Nov 3, 2016 •

edited

Loading