Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only RFC 3986 compliant URLs should be valid? #74

Closed
walro opened this issue Jan 12, 2016 · 14 comments · Fixed by #159
Closed

Only RFC 3986 compliant URLs should be valid? #74

walro opened this issue Jan 12, 2016 · 14 comments · Fixed by #159
Labels

Comments

@walro
Copy link
Contributor

walro commented Jan 12, 2016

For example we, currently, deem the following url as valid:

irb(main):002:0> Twingly::URL.parse("http://club].no/").valid?
=> true

But the RFC has ] listed as a reserved character.

stdlib's URI is a bit more restrictive here:

irb(main):006:0> URI.parse("http://club].no")
URI::InvalidURIError: bad URI(is not URI?): http://club].no
    from /Users/robin/.rubies/ruby-2.2.3/lib/ruby/2.2.0/uri/rfc3986_parser.rb:66:in `split'
    from /Users/robin/.rubies/ruby-2.2.3/lib/ruby/2.2.0/uri/rfc3986_parser.rb:72:in `parse'
    from /Users/robin/.rubies/ruby-2.2.3/lib/ruby/2.2.0/uri/common.rb:226:in `parse'
    from (irb):6
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/lib/bundler/cli/console.rb:14:in `run'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/lib/bundler/cli.rb:308:in `console'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/lib/bundler/vendor/thor/lib/thor/command.rb:27:in `run'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/lib/bundler/vendor/thor/lib/thor/invocation.rb:126:in `invoke_command'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/lib/bundler/vendor/thor/lib/thor.rb:359:in `dispatch'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/lib/bundler/vendor/thor/lib/thor/base.rb:440:in `start'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/lib/bundler/cli.rb:10:in `start'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/bin/bundle:20:in `block in <top (required)>'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/lib/bundler/friendly_errors.rb:7:in `with_friendly_errors'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/bin/bundle:18:in `<top (required)>'
    from /Users/robin/.gem/ruby/2.2.3/bin/bundle:23:in `load'
    from /Users/robin/.gem/ruby/2.2.3/bin/bundle:23:in `<main>'
@walro walro added the question label Jan 12, 2016
@dentarg
Copy link
Collaborator

dentarg commented Jan 12, 2016

Another one, from sporkmonger/addressable#198

[5] pry(main)> Twingly::URL.parse("http://www,google.com").valid?
=> true

@dentarg dentarg changed the title Only RFC3986 compliant urls should be valid? Only RFC3986 compliant URLs should be valid? Jan 12, 2016
@dentarg dentarg changed the title Only RFC3986 compliant URLs should be valid? Only RFC 3986 compliant URLs should be valid? Jan 12, 2016
@dentarg
Copy link
Collaborator

dentarg commented Jan 13, 2016

This would help the case where we get an exception from an underlaying library using the stricter URI module. Would spare us from having to rescue URI::InvalidURIError in all the projects.

https://app.getsentry.com/twingly/klondike/issues/106202251/

URI::InvalidURIError: bad URI(is not URI?): http://[app-id].appspot.com/feeds/posts/default
  from bunny/consumer_work_pool.rb:88:in `run_loop'
  from bunny/consumer_work_pool.rb:88:in `catch'
  from bunny/consumer_work_pool.rb:89:in `block in run_loop'
  from bunny/consumer_work_pool.rb:89:in `loop'
  from bunny/consumer_work_pool.rb:94:in `block (2 levels) in run_loop'
  from bunny/consumer_work_pool.rb:94:in `call'
  from bunny/channel.rb:1722:in `block in handle_frameset'
  from bunny/consumer.rb:56:in `call'
  from bunny/consumer.rb:56:in `call'
  from twingly/amqp/subscription.rb:40:in `block in each_message'
  from twingly/amqp/subscription.rb:40:in `call'
  from /app/lib/work_queue.rb:19:in `block in each_ping'
  from /app/lib/work_queue.rb:19:in `call'
  from /app/lib/client.rb:29:in `block in start'
  from /app/lib/client.rb:46:in `handle_incoming_ping'
  from /app/lib/custom_domain_detector/blogger.rb:14:in `blog?'
  from /app/lib/custom_domain_detector/blogger.rb:34:in `feed_is_blogger?'
  from /app/lib/http_client.rb:34:in `head'
  from faraday/connection.rb:140:in `head'
  from faraday/connection.rb:377:in `run_request'
  from faraday/rack_builder.rb:139:in `build_response'
  from faraday/rack_builder.rb:192:in `build_env'
  from faraday/connection.rb:406:in `build_exclusive_url'
  from uri/generic.rb:1098:in `merge'
  from uri/generic.rb:1140:in `merge0'
  from uri/rfc3986_parser.rb:116:in `convert_to_uri'
  from uri/rfc3986_parser.rb:72:in `parse'
  from uri/rfc3986_parser.rb:66:in `split'

@dentarg
Copy link
Collaborator

dentarg commented Jan 13, 2016

Related Addressable issue: sporkmonger/addressable#161 which has been "Accepted"

@dentarg
Copy link
Collaborator

dentarg commented Jan 13, 2016

Perhaps we should look into why we can't use URI.parse.

https://bugs.ruby-lang.org/projects/ruby-trunk/repository/revisions/46491 "support RFC3986"

@dentarg
Copy link
Collaborator

dentarg commented Jan 13, 2016

Perhaps we should look into why we can't use URI.parse.

Perhaps this is the reason:

From https://bugs.ruby-lang.org/issues/9990

this issue is demonstrate that there exists no way to properly encode a URI so that URI.parse will accept it.

@sporkmonger
Copy link

Hey all, hopefully you've read my comments on the issue you linked to. I've been thinking about adding a second method to Addressable, sane? which would "validate" based on whether a URI made sense as opposed to whether it was valid according to RFC 3986. There's a lot of misinformation about URI validation, in large part because if you were to strictly apply the EBNF in RFC 3986, virtually all strings would be considered "valid". Addressable takes the stance that very few input strings to the parse method should raise exceptions because it makes a great number of very legitimate things that you might want to do w/ iffy URIs extremely hard to write. A possible solution would be a sane? method that doesn't result in raised exceptions, doesn't attempt to follow RFCs and instead just checks to make sure a URI makes sense according to normal usage.

@dentarg
Copy link
Collaborator

dentarg commented Jun 15, 2016

@sporkmonger it sounds like sane? would help a lot when passing on the URI to other libraries not using twingly-url or Addressable (i.e. they are using URI.parse)

...but someday I hope that we can write our own HTTP wrapper-library that uses twingly-url, so we could try to visit even the iffy URIs :-)

@dentarg
Copy link
Collaborator

dentarg commented Sep 12, 2016

Perhaps something to discuss here, should we make URLs with too long hostnames invalid?

References: #66 (comment), http://stackoverflow.com/questions/14402407/maximum-length-of-a-domain-name-without-the-http-www-com-parts

@sporkmonger
Copy link

That's unfortunate. My inclination would normally be to not enforce this
limit, but it's probably more important to be consistent here.

On Mon, Sep 12, 2016, 8:47 AM Patrik Ragnarsson notifications@github.com
wrote:

Perhaps something to discuss here, should we make URLs with too long
hostnames invalid?

References: #66 (comment)
#66 (comment),

http://stackoverflow.com/questions/14402407/maximum-length-of-a-domain-name-without-the-http-www-com-parts


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#74 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAAG8pzS2lIjgsZwVDvFuIVWnz6xmauuks5qpXI-gaJpZM4HDPcV
.

@smileart
Copy link

smileart commented Aug 18, 2017

The other day I was looking for a way to validate URLs before actually requesting them and ended up here coming from this issue.

I used this table to check all the libs/methods available. And for twingly-url got the following results:

  VALID_URLS = [
	'http://foo.com/blah_blah',
	'http://foo.com/blah_blah/',
	'http://foo.com/blah_blah_(wikipedia)',
	'http://foo.com/blah_blah_(wikipedia)_(again)',
	'http://www.example.com/wpstyle/?p=364',
	'https://www.example.com/foo/?bar=baz&inga=42&quux',
	'http://✪df.ws/123',
	'http://userid:password@example.com:8080',
	'http://userid:password@example.com:8080/',
	'http://userid@example.com',
	'http://userid@example.com/',
	'http://userid@example.com:8080',
	'http://userid@example.com:8080/',
	'http://userid:password@example.com',
	'http://userid:password@example.com/',
	# 'http://142.42.1.1/',
	# 'http://142.42.1.1:8080/',
	'http://➡.ws/䨹',
	"http://⌘.ws\\"
	'http://⌘.ws/',
	'http://foo.com/blah_(wikipedia)#cite-1',
	'http://foo.com/blah_(wikipedia)_blah#cite-1',
	'http://foo.com/unicode_(✪)_in_parens',
	'http://foo.com/(something)?after=parens',
	'http://☺.damowmow.com/',
	'http://code.google.com/events/#&product=browser',
	'http://j.mp',
	# 'ftp://foo.bar/baz',
	'http://foo.bar/?q=Test%20URL-encoded%20stuff',
	# 'http://مثال.إختبار',
	# 'http://例子.测试',
	# 'http://उदाहरण.परीक्षा',
	'http://-.~_!$&\'()*+,;=:%40:80%2f::::::@example.com',
	'http://1337.net',
	'http://a.b-c.de',
	# 'http://223.255.255.254'
  ]

  INVALID_URLS = [
	'http://',
	'http://.',
	'http://..',
	'http://../',
	'http://?',
	'http://??',
	'http://??/',
	'http://#',
	'http://##',
	'http://##/',
	# 'http://foo.bar?q=Spaces should be encoded',
	'//',
	'//a',
	'///a',
	'///',
	'http:///a',
	# 'foo.com',
	'rdar://1234',
	'h://test',
	# 'http:// shouldfail.com',
	':// should fail',
	# 'http://foo.bar/foo(bar)baz quux',
	'ftps://foo.bar/',
	'http://-error-.invalid/',
	# 'http://a.b--c.de/',
	# 'http://-a.b.co',
	# 'http://a.b-.co',
	'http://0.0.0.0',
	'http://10.1.1.0',
	'http://10.1.1.255',
	'http://224.1.1.1',
	'http://1.1.1.1.1',
	'http://123.123.123',
	'http://3628126748',
	'http://.www.foo.bar/',
	# 'http://www.foo.bar./',
	# 'http://.www.foo.bar./',
	'http://10.1.1.1',
	'http://10.1.1.254'
  ]

In the both arrays I’ve commented false-negatives and false-positives accordingly.

In the end of the day I chose to use @dperini’s famous Regexp wrapped into a gem by @amogil which shows much better results on the test samples provided (1 and 3 “errors” on each array accordingly). So I wonder if it the list of URL samples which is inadequate and incorrect or is it actual twingly-url’s validation limits? Thanks in advance.

@dentarg
Copy link
Collaborator

dentarg commented Aug 18, 2017

@smileart A quick comment on the general purpose of twingly-url: we want to reject URLs that causes other libraries, typically HTTP libraries, to raise exceptions. We could of course try harder to fix what we think are broken URLs, e.g. converting foo.com to http://foo.com instead of treating it as an invalid URL and just discard it. But that hasn't been important enough for us.

Thanks for the links, URLs sure are interesting stuff! :-)

@dentarg
Copy link
Collaborator

dentarg commented Aug 18, 2017

Oops, sorry @smileart, I didn't check https://mathiasbynens.be/demo/url-regex so I thought the URLs come from our tests, and was checked with another tool. I think we could take a closer look here and fix most of the cases accordingly. Some cases are already covered in other issues, e.g. #76. Thanks for taking the time to comment. :)

@smileart
Copy link

smileart commented Aug 18, 2017

@dentarg Sure, not a problem. Thanks for the lib! In fact, with some cases fixed it surely could be utterly useful. Cheers! 👍

@jarthod
Copy link

jarthod commented Oct 21, 2021

Hey all, hopefully you've read my comments on the issue you linked to. I've been thinking about adding a second method to Addressable, sane? which would "validate" based on whether a URI made sense as opposed to whether it was valid according to RFC 3986. There's a lot of misinformation about URI validation, in large part because if you were to strictly apply the EBNF in RFC 3986, virtually all strings would be considered "valid". Addressable takes the stance that very few input strings to the parse method should raise exceptions because it makes a great number of very legitimate things that you might want to do w/ iffy URIs extremely hard to write. A possible solution would be a sane? method that doesn't result in raised exceptions, doesn't attempt to follow RFCs and instead just checks to make sure a URI makes sense according to normal usage.

@sporkmonger hello 👋

I've been hitting this issue too recently, noticing the validation was a bit too permissive in Addressable for my use case (preventing people from entering invalid URLs in https://updown.io). I totally understand the rational for keeping the parsing liberal and I just want to vote in favor of this sane? method because it would suit my use-case perfectly.

I already had to do this for years (to valide updown 100k+ URLs) so I can contribute a starting implementation I'm currently using to validate the hostname looks valid (or is an IP):

HostnameRegex = /(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-_]*[a-zA-Z0-9])\.)*([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-_]*[A-Za-z0-9])/
PortRegex = /\:[0-9]{2,5}/
HostRegex = /^(#{HostnameRegex})(#{PortRegex})?$/

uri = Addressable::URI.parse(...)
valid_ip = IPAddr.new(uri.host) rescue nil
(valid_ip or HostRegex.match?(uri.normalized_host))

I am matching against normalized_host here to allow for unicode hostname (which is IMO sane as long as you normalize it before passing it to a library which don't support it), but if you prefer to validate against host to forbid them by default that is acceptable too.

I don't really like the IPAddr parsing with the rescue nil code but that's an easy and robust way I found to do it, making sure it's a valid IPv4 or IPv6 (much harder to do with Regexp)

Let me know what you think, if you have other starting code, and if you're interested in a PR for this.

goulvench added a commit to goulvench/activevalidators that referenced this issue Mar 19, 2022
dentarg added a commit to dentarg/twingly-url that referenced this issue Oct 8, 2022
These are now invalid after twingly#158. I think we can close twingly#74.

Close twingly#74
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants