Only RFC 3986 compliant URLs should be valid? #74

walro · 2016-01-12T15:35:17Z

For example we, currently, deem the following url as valid:

irb(main):002:0> Twingly::URL.parse("http://club].no/").valid?
=> true

But the RFC has ] listed as a reserved character.

stdlib's URI is a bit more restrictive here:

irb(main):006:0> URI.parse("http://club].no")
URI::InvalidURIError: bad URI(is not URI?): http://club].no
    from /Users/robin/.rubies/ruby-2.2.3/lib/ruby/2.2.0/uri/rfc3986_parser.rb:66:in `split'
    from /Users/robin/.rubies/ruby-2.2.3/lib/ruby/2.2.0/uri/rfc3986_parser.rb:72:in `parse'
    from /Users/robin/.rubies/ruby-2.2.3/lib/ruby/2.2.0/uri/common.rb:226:in `parse'
    from (irb):6
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/lib/bundler/cli/console.rb:14:in `run'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/lib/bundler/cli.rb:308:in `console'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/lib/bundler/vendor/thor/lib/thor/command.rb:27:in `run'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/lib/bundler/vendor/thor/lib/thor/invocation.rb:126:in `invoke_command'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/lib/bundler/vendor/thor/lib/thor.rb:359:in `dispatch'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/lib/bundler/vendor/thor/lib/thor/base.rb:440:in `start'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/lib/bundler/cli.rb:10:in `start'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/bin/bundle:20:in `block in <top (required)>'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/lib/bundler/friendly_errors.rb:7:in `with_friendly_errors'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/bin/bundle:18:in `<top (required)>'
    from /Users/robin/.gem/ruby/2.2.3/bin/bundle:23:in `load'
    from /Users/robin/.gem/ruby/2.2.3/bin/bundle:23:in `<main>'

The text was updated successfully, but these errors were encountered:

dentarg · 2016-01-12T15:46:59Z

Another one, from sporkmonger/addressable#198

[5] pry(main)> Twingly::URL.parse("http://www,google.com").valid?
=> true

dentarg · 2016-01-13T10:53:21Z

This would help the case where we get an exception from an underlaying library using the stricter URI module. Would spare us from having to rescue URI::InvalidURIError in all the projects.

https://app.getsentry.com/twingly/klondike/issues/106202251/

URI::InvalidURIError: bad URI(is not URI?): http://[app-id].appspot.com/feeds/posts/default
  from bunny/consumer_work_pool.rb:88:in `run_loop'
  from bunny/consumer_work_pool.rb:88:in `catch'
  from bunny/consumer_work_pool.rb:89:in `block in run_loop'
  from bunny/consumer_work_pool.rb:89:in `loop'
  from bunny/consumer_work_pool.rb:94:in `block (2 levels) in run_loop'
  from bunny/consumer_work_pool.rb:94:in `call'
  from bunny/channel.rb:1722:in `block in handle_frameset'
  from bunny/consumer.rb:56:in `call'
  from bunny/consumer.rb:56:in `call'
  from twingly/amqp/subscription.rb:40:in `block in each_message'
  from twingly/amqp/subscription.rb:40:in `call'
  from /app/lib/work_queue.rb:19:in `block in each_ping'
  from /app/lib/work_queue.rb:19:in `call'
  from /app/lib/client.rb:29:in `block in start'
  from /app/lib/client.rb:46:in `handle_incoming_ping'
  from /app/lib/custom_domain_detector/blogger.rb:14:in `blog?'
  from /app/lib/custom_domain_detector/blogger.rb:34:in `feed_is_blogger?'
  from /app/lib/http_client.rb:34:in `head'
  from faraday/connection.rb:140:in `head'
  from faraday/connection.rb:377:in `run_request'
  from faraday/rack_builder.rb:139:in `build_response'
  from faraday/rack_builder.rb:192:in `build_env'
  from faraday/connection.rb:406:in `build_exclusive_url'
  from uri/generic.rb:1098:in `merge'
  from uri/generic.rb:1140:in `merge0'
  from uri/rfc3986_parser.rb:116:in `convert_to_uri'
  from uri/rfc3986_parser.rb:72:in `parse'
  from uri/rfc3986_parser.rb:66:in `split'

dentarg · 2016-01-13T10:58:27Z

Related Addressable issue: sporkmonger/addressable#161 which has been "Accepted"

dentarg · 2016-01-13T11:12:36Z

Perhaps we should look into why we can't use URI.parse.

https://bugs.ruby-lang.org/projects/ruby-trunk/repository/revisions/46491 "support RFC3986"

dentarg · 2016-01-13T11:14:12Z

Perhaps we should look into why we can't use URI.parse.

Perhaps this is the reason:

From https://bugs.ruby-lang.org/issues/9990

this issue is demonstrate that there exists no way to properly encode a URI so that URI.parse will accept it.

sporkmonger · 2016-06-15T22:07:55Z

Hey all, hopefully you've read my comments on the issue you linked to. I've been thinking about adding a second method to Addressable, sane? which would "validate" based on whether a URI made sense as opposed to whether it was valid according to RFC 3986. There's a lot of misinformation about URI validation, in large part because if you were to strictly apply the EBNF in RFC 3986, virtually all strings would be considered "valid". Addressable takes the stance that very few input strings to the parse method should raise exceptions because it makes a great number of very legitimate things that you might want to do w/ iffy URIs extremely hard to write. A possible solution would be a sane? method that doesn't result in raised exceptions, doesn't attempt to follow RFCs and instead just checks to make sure a URI makes sense according to normal usage.

dentarg · 2016-06-15T23:02:58Z

@sporkmonger it sounds like sane? would help a lot when passing on the URI to other libraries not using twingly-url or Addressable (i.e. they are using URI.parse)

...but someday I hope that we can write our own HTTP wrapper-library that uses twingly-url, so we could try to visit even the iffy URIs :-)

dentarg · 2016-09-12T15:39:41Z

Perhaps something to discuss here, should we make URLs with too long hostnames invalid?

References: #66 (comment), http://stackoverflow.com/questions/14402407/maximum-length-of-a-domain-name-without-the-http-www-com-parts

sporkmonger · 2016-09-12T15:58:04Z

That's unfortunate. My inclination would normally be to not enforce this
limit, but it's probably more important to be consistent here.

On Mon, Sep 12, 2016, 8:47 AM Patrik Ragnarsson notifications@github.com
wrote:

Perhaps something to discuss here, should we make URLs with too long
hostnames invalid?

References: #66 (comment)
#66 (comment),

http://stackoverflow.com/questions/14402407/maximum-length-of-a-domain-name-without-the-http-www-com-parts

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#74 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAAG8pzS2lIjgsZwVDvFuIVWnz6xmauuks5qpXI-gaJpZM4HDPcV
.

smileart · 2017-08-18T13:27:17Z

The other day I was looking for a way to validate URLs before actually requesting them and ended up here coming from this issue.

I used this table to check all the libs/methods available. And for twingly-url got the following results:

  VALID_URLS = [
	'http://foo.com/blah_blah',
	'http://foo.com/blah_blah/',
	'http://foo.com/blah_blah_(wikipedia)',
	'http://foo.com/blah_blah_(wikipedia)_(again)',
	'http://www.example.com/wpstyle/?p=364',
	'https://www.example.com/foo/?bar=baz&inga=42&quux',
	'http://✪df.ws/123',
	'http://userid:password@example.com:8080',
	'http://userid:password@example.com:8080/',
	'http://userid@example.com',
	'http://userid@example.com/',
	'http://userid@example.com:8080',
	'http://userid@example.com:8080/',
	'http://userid:password@example.com',
	'http://userid:password@example.com/',
	# 'http://142.42.1.1/',
	# 'http://142.42.1.1:8080/',
	'http://➡.ws/䨹',
	"http://⌘.ws\\"
	'http://⌘.ws/',
	'http://foo.com/blah_(wikipedia)#cite-1',
	'http://foo.com/blah_(wikipedia)_blah#cite-1',
	'http://foo.com/unicode_(✪)_in_parens',
	'http://foo.com/(something)?after=parens',
	'http://☺.damowmow.com/',
	'http://code.google.com/events/#&product=browser',
	'http://j.mp',
	# 'ftp://foo.bar/baz',
	'http://foo.bar/?q=Test%20URL-encoded%20stuff',
	# 'http://مثال.إختبار',
	# 'http://例子.测试',
	# 'http://उदाहरण.परीक्षा',
	'http://-.~_!$&\'()*+,;=:%40:80%2f::::::@example.com',
	'http://1337.net',
	'http://a.b-c.de',
	# 'http://223.255.255.254'
  ]

  INVALID_URLS = [
	'http://',
	'http://.',
	'http://..',
	'http://../',
	'http://?',
	'http://??',
	'http://??/',
	'http://#',
	'http://##',
	'http://##/',
	# 'http://foo.bar?q=Spaces should be encoded',
	'//',
	'//a',
	'///a',
	'///',
	'http:///a',
	# 'foo.com',
	'rdar://1234',
	'h://test',
	# 'http:// shouldfail.com',
	':// should fail',
	# 'http://foo.bar/foo(bar)baz quux',
	'ftps://foo.bar/',
	'http://-error-.invalid/',
	# 'http://a.b--c.de/',
	# 'http://-a.b.co',
	# 'http://a.b-.co',
	'http://0.0.0.0',
	'http://10.1.1.0',
	'http://10.1.1.255',
	'http://224.1.1.1',
	'http://1.1.1.1.1',
	'http://123.123.123',
	'http://3628126748',
	'http://.www.foo.bar/',
	# 'http://www.foo.bar./',
	# 'http://.www.foo.bar./',
	'http://10.1.1.1',
	'http://10.1.1.254'
  ]

In the both arrays I’ve commented false-negatives and false-positives accordingly.

In the end of the day I chose to use @dperini’s famous Regexp wrapped into a gem by @amogil which shows much better results on the test samples provided (1 and 3 “errors” on each array accordingly). So I wonder if it the list of URL samples which is inadequate and incorrect or is it actual twingly-url’s validation limits? Thanks in advance.

dentarg · 2017-08-18T13:34:08Z

@smileart A quick comment on the general purpose of twingly-url: we want to reject URLs that causes other libraries, typically HTTP libraries, to raise exceptions. We could of course try harder to fix what we think are broken URLs, e.g. converting foo.com to http://foo.com instead of treating it as an invalid URL and just discard it. But that hasn't been important enough for us.

Thanks for the links, URLs sure are interesting stuff! :-)

dentarg · 2017-08-18T13:44:05Z

Oops, sorry @smileart, I didn't check https://mathiasbynens.be/demo/url-regex so I thought the URLs come from our tests, and was checked with another tool. I think we could take a closer look here and fix most of the cases accordingly. Some cases are already covered in other issues, e.g. #76. Thanks for taking the time to comment. :)

smileart · 2017-08-18T13:48:19Z

@dentarg Sure, not a problem. Thanks for the lib! In fact, with some cases fixed it surely could be utterly useful. Cheers! 👍

jarthod · 2021-10-21T07:09:54Z

Hey all, hopefully you've read my comments on the issue you linked to. I've been thinking about adding a second method to Addressable, sane? which would "validate" based on whether a URI made sense as opposed to whether it was valid according to RFC 3986. There's a lot of misinformation about URI validation, in large part because if you were to strictly apply the EBNF in RFC 3986, virtually all strings would be considered "valid". Addressable takes the stance that very few input strings to the parse method should raise exceptions because it makes a great number of very legitimate things that you might want to do w/ iffy URIs extremely hard to write. A possible solution would be a sane? method that doesn't result in raised exceptions, doesn't attempt to follow RFCs and instead just checks to make sure a URI makes sense according to normal usage.

@sporkmonger hello 👋

I've been hitting this issue too recently, noticing the validation was a bit too permissive in Addressable for my use case (preventing people from entering invalid URLs in https://updown.io). I totally understand the rational for keeping the parsing liberal and I just want to vote in favor of this sane? method because it would suit my use-case perfectly.

I already had to do this for years (to valide updown 100k+ URLs) so I can contribute a starting implementation I'm currently using to validate the hostname looks valid (or is an IP):

HostnameRegex = /(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-_]*[a-zA-Z0-9])\.)*([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-_]*[A-Za-z0-9])/
PortRegex = /\:[0-9]{2,5}/
HostRegex = /^(#{HostnameRegex})(#{PortRegex})?$/

uri = Addressable::URI.parse(...)
valid_ip = IPAddr.new(uri.host) rescue nil
(valid_ip or HostRegex.match?(uri.normalized_host))

I am matching against normalized_host here to allow for unicode hostname (which is IMO sane as long as you normalize it before passing it to a library which don't support it), but if you prefer to validate against host to forbid them by default that is acceptable too.

I don't really like the IPAddr parsing with the rescue nil code but that's an easy and robust way I found to do it, making sure it's a valid IPv4 or IPv6 (much harder to do with Regexp)

Let me know what you think, if you have other starting code, and if you're interested in a PR for this.

Warning: this allows a lot more URLs. See twingly/twingly-url#74 (comment)

These are now invalid after twingly#158. I think we can close twingly#74. Close twingly#74

walro added the question label Jan 12, 2016

dentarg changed the title ~~Only RFC3986 compliant urls should be valid?~~ Only RFC3986 compliant URLs should be valid? Jan 12, 2016

dentarg changed the title ~~Only RFC3986 compliant URLs should be valid?~~ Only RFC 3986 compliant URLs should be valid? Jan 12, 2016

walro mentioned this issue Jan 29, 2016

domain, sld, trd (maybe others) can be nil #52

Closed

dentarg mentioned this issue Jun 15, 2016

HOST accepts comma chars sporkmonger/addressable#198

Closed

dentarg mentioned this issue Sep 12, 2016

Another mismatch between libidn and pure Ruby punycode implementation sporkmonger/addressable#239

Closed

dentarg mentioned this issue Jan 23, 2020

URI Parse Error sporkmonger/addressable#376

Closed

goulvench added a commit to goulvench/activevalidators that referenced this issue Mar 19, 2022

Validate URLs using Addressable gem to allow non-ascii characters

3d14141

Warning: this allows a lot more URLs. See twingly/twingly-url#74 (comment)

goulvench mentioned this issue Mar 19, 2022

Validate URLs using Addressable gem to allow non-ascii characters (closes #111) franckverrot/activevalidators#112

Open

dentarg added a commit to dentarg/twingly-url that referenced this issue Oct 8, 2022

Add more invalid URLs to specs

c148bed

These are now invalid after twingly#158. I think we can close twingly#74. Close twingly#74

dentarg mentioned this issue Oct 8, 2022

Add more invalid URLs to specs #159

Merged

Chrizpy closed this as completed in #159 Oct 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only RFC 3986 compliant URLs should be valid? #74

Only RFC 3986 compliant URLs should be valid? #74

walro commented Jan 12, 2016

dentarg commented Jan 12, 2016

dentarg commented Jan 13, 2016

dentarg commented Jan 13, 2016

dentarg commented Jan 13, 2016

dentarg commented Jan 13, 2016

sporkmonger commented Jun 15, 2016

dentarg commented Jun 15, 2016

dentarg commented Sep 12, 2016

sporkmonger commented Sep 12, 2016

smileart commented Aug 18, 2017 •

edited

Loading

dentarg commented Aug 18, 2017

dentarg commented Aug 18, 2017

smileart commented Aug 18, 2017 •

edited

Loading

jarthod commented Oct 21, 2021

Only RFC 3986 compliant URLs should be valid? #74

Only RFC 3986 compliant URLs should be valid? #74

Comments

walro commented Jan 12, 2016

dentarg commented Jan 12, 2016

dentarg commented Jan 13, 2016

dentarg commented Jan 13, 2016

dentarg commented Jan 13, 2016

dentarg commented Jan 13, 2016

sporkmonger commented Jun 15, 2016

dentarg commented Jun 15, 2016

dentarg commented Sep 12, 2016

sporkmonger commented Sep 12, 2016

smileart commented Aug 18, 2017 • edited Loading

dentarg commented Aug 18, 2017

dentarg commented Aug 18, 2017

smileart commented Aug 18, 2017 • edited Loading

jarthod commented Oct 21, 2021

smileart commented Aug 18, 2017 •

edited

Loading

smileart commented Aug 18, 2017 •

edited

Loading