-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only RFC 3986 compliant URLs should be valid? #74
Comments
Another one, from sporkmonger/addressable#198 [5] pry(main)> Twingly::URL.parse("http://www,google.com").valid?
=> true |
This would help the case where we get an exception from an underlaying library using the stricter https://app.getsentry.com/twingly/klondike/issues/106202251/ URI::InvalidURIError: bad URI(is not URI?): http://[app-id].appspot.com/feeds/posts/default
from bunny/consumer_work_pool.rb:88:in `run_loop'
from bunny/consumer_work_pool.rb:88:in `catch'
from bunny/consumer_work_pool.rb:89:in `block in run_loop'
from bunny/consumer_work_pool.rb:89:in `loop'
from bunny/consumer_work_pool.rb:94:in `block (2 levels) in run_loop'
from bunny/consumer_work_pool.rb:94:in `call'
from bunny/channel.rb:1722:in `block in handle_frameset'
from bunny/consumer.rb:56:in `call'
from bunny/consumer.rb:56:in `call'
from twingly/amqp/subscription.rb:40:in `block in each_message'
from twingly/amqp/subscription.rb:40:in `call'
from /app/lib/work_queue.rb:19:in `block in each_ping'
from /app/lib/work_queue.rb:19:in `call'
from /app/lib/client.rb:29:in `block in start'
from /app/lib/client.rb:46:in `handle_incoming_ping'
from /app/lib/custom_domain_detector/blogger.rb:14:in `blog?'
from /app/lib/custom_domain_detector/blogger.rb:34:in `feed_is_blogger?'
from /app/lib/http_client.rb:34:in `head'
from faraday/connection.rb:140:in `head'
from faraday/connection.rb:377:in `run_request'
from faraday/rack_builder.rb:139:in `build_response'
from faraday/rack_builder.rb:192:in `build_env'
from faraday/connection.rb:406:in `build_exclusive_url'
from uri/generic.rb:1098:in `merge'
from uri/generic.rb:1140:in `merge0'
from uri/rfc3986_parser.rb:116:in `convert_to_uri'
from uri/rfc3986_parser.rb:72:in `parse'
from uri/rfc3986_parser.rb:66:in `split' |
Related Addressable issue: sporkmonger/addressable#161 which has been "Accepted" |
Perhaps we should look into why we can't use https://bugs.ruby-lang.org/projects/ruby-trunk/repository/revisions/46491 "support RFC3986" |
Perhaps this is the reason: From https://bugs.ruby-lang.org/issues/9990
|
Hey all, hopefully you've read my comments on the issue you linked to. I've been thinking about adding a second method to Addressable, |
@sporkmonger it sounds like ...but someday I hope that we can write our own HTTP wrapper-library that uses twingly-url, so we could try to visit even the iffy URIs :-) |
Perhaps something to discuss here, should we make URLs with too long hostnames invalid? References: #66 (comment), http://stackoverflow.com/questions/14402407/maximum-length-of-a-domain-name-without-the-http-www-com-parts |
That's unfortunate. My inclination would normally be to not enforce this On Mon, Sep 12, 2016, 8:47 AM Patrik Ragnarsson notifications@github.com
|
The other day I was looking for a way to validate URLs before actually requesting them and ended up here coming from this issue. I used this table to check all the libs/methods available. And for VALID_URLS = [
'http://foo.com/blah_blah',
'http://foo.com/blah_blah/',
'http://foo.com/blah_blah_(wikipedia)',
'http://foo.com/blah_blah_(wikipedia)_(again)',
'http://www.example.com/wpstyle/?p=364',
'https://www.example.com/foo/?bar=baz&inga=42&quux',
'http://✪df.ws/123',
'http://userid:password@example.com:8080',
'http://userid:password@example.com:8080/',
'http://userid@example.com',
'http://userid@example.com/',
'http://userid@example.com:8080',
'http://userid@example.com:8080/',
'http://userid:password@example.com',
'http://userid:password@example.com/',
# 'http://142.42.1.1/',
# 'http://142.42.1.1:8080/',
'http://➡.ws/䨹',
"http://⌘.ws\\"
'http://⌘.ws/',
'http://foo.com/blah_(wikipedia)#cite-1',
'http://foo.com/blah_(wikipedia)_blah#cite-1',
'http://foo.com/unicode_(✪)_in_parens',
'http://foo.com/(something)?after=parens',
'http://☺.damowmow.com/',
'http://code.google.com/events/#&product=browser',
'http://j.mp',
# 'ftp://foo.bar/baz',
'http://foo.bar/?q=Test%20URL-encoded%20stuff',
# 'http://مثال.إختبار',
# 'http://例子.测试',
# 'http://उदाहरण.परीक्षा',
'http://-.~_!$&\'()*+,;=:%40:80%2f::::::@example.com',
'http://1337.net',
'http://a.b-c.de',
# 'http://223.255.255.254'
]
INVALID_URLS = [
'http://',
'http://.',
'http://..',
'http://../',
'http://?',
'http://??',
'http://??/',
'http://#',
'http://##',
'http://##/',
# 'http://foo.bar?q=Spaces should be encoded',
'//',
'//a',
'///a',
'///',
'http:///a',
# 'foo.com',
'rdar://1234',
'h://test',
# 'http:// shouldfail.com',
':// should fail',
# 'http://foo.bar/foo(bar)baz quux',
'ftps://foo.bar/',
'http://-error-.invalid/',
# 'http://a.b--c.de/',
# 'http://-a.b.co',
# 'http://a.b-.co',
'http://0.0.0.0',
'http://10.1.1.0',
'http://10.1.1.255',
'http://224.1.1.1',
'http://1.1.1.1.1',
'http://123.123.123',
'http://3628126748',
'http://.www.foo.bar/',
# 'http://www.foo.bar./',
# 'http://.www.foo.bar./',
'http://10.1.1.1',
'http://10.1.1.254'
] In the both arrays I’ve commented false-negatives and false-positives accordingly. In the end of the day I chose to use @dperini’s famous Regexp wrapped into a gem by @amogil which shows much better results on the test samples provided (1 and 3 “errors” on each array accordingly). So I wonder if it the list of URL samples which is inadequate and incorrect or is it actual |
@smileart A quick comment on the general purpose of twingly-url: we want to reject URLs that causes other libraries, typically HTTP libraries, to raise exceptions. We could of course try harder to fix what we think are broken URLs, e.g. converting Thanks for the links, URLs sure are interesting stuff! :-) |
Oops, sorry @smileart, I didn't check https://mathiasbynens.be/demo/url-regex so I thought the URLs come from our tests, and was checked with another tool. I think we could take a closer look here and fix most of the cases accordingly. Some cases are already covered in other issues, e.g. #76. Thanks for taking the time to comment. :) |
@dentarg Sure, not a problem. Thanks for the lib! In fact, with some cases fixed it surely could be utterly useful. Cheers! 👍 |
@sporkmonger hello 👋 I've been hitting this issue too recently, noticing the validation was a bit too permissive in I already had to do this for years (to valide updown 100k+ URLs) so I can contribute a starting implementation I'm currently using to validate the hostname looks valid (or is an IP): HostnameRegex = /(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-_]*[a-zA-Z0-9])\.)*([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-_]*[A-Za-z0-9])/
PortRegex = /\:[0-9]{2,5}/
HostRegex = /^(#{HostnameRegex})(#{PortRegex})?$/
uri = Addressable::URI.parse(...)
valid_ip = IPAddr.new(uri.host) rescue nil
(valid_ip or HostRegex.match?(uri.normalized_host)) I am matching against I don't really like the IPAddr parsing with the Let me know what you think, if you have other starting code, and if you're interested in a PR for this. |
Warning: this allows a lot more URLs. See twingly/twingly-url#74 (comment)
These are now invalid after twingly#158. I think we can close twingly#74. Close twingly#74
For example we, currently, deem the following url as valid:
But the RFC has
]
listed as a reserved character.stdlib's URI is a bit more restrictive here:
The text was updated successfully, but these errors were encountered: