Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Modified sanitize_url to accept IPv6 addresses #235

Closed
wants to merge 3 commits into from

4 participants

@k21

Also removed mailto from valid_schemes, because mailto: scheme would never get accepted as it could not pass the checks further in the function (according to urlparse, it does not contain a hostname).

@spladug
Owner

Github says this pull was updated, but the commit looks the same as before. Am I missing something? :X

@k21

In this version the urlparse() function is used to get the IPv6 address from the link while in the previous version urlparse was not used at all. Some functions from the previous version are still used, because I know of no other way to validate IPv6 address in Python. I remember reading somewhere that urlparse expects that its input to be a valid url, so I suppose I have to check myself that the address is really valid.

@k21

I did some testing and I found a bug in the current version of sanitize_url. It currently accepts any address enclosed in brackets (e.g. "http://[nonsense]/abcd"), which is in fact invalid and should not be accepted. I pushed a new commit which should fix this.

@k21

I am sorry, I accidentally included some code that had nothing to do with this problem in the previous version of the pull request. I removed those commits and also cleaned up the code a little. I hope that this version is correct.

@k21

However, I still think that using urlparse is not a very good idea, because it does expect its input to be a valid URL. Because of this, it will incorrectly parse things like "http://hello:world!/" thinking that "world!" is a port number and sanitize_url will not find out, because it never checks whether the port number is really a number.

@k21 k21 Modify sanitize_url to accept IPv6 addresses
 * After the change the function also accepts some invalid links, but
   after discussion with spladug, it should not be a problem, since
   many invalid addresses that were accepted before the change exist
b713eca
@spladug

I believe [ and ] need to be allowed by the regex as well, right?

spladug: they do not have to, because if the link contains IPv6 address, urlparse says that the hostname is the address and it removes the brackets automatically

Owner

That doesn't seem to be the case when I test it, am I doing it wrong?

Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53) 
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> urlparse.urlparse("http://[3ffe:2a00:100:7031::1]")
ParseResult(scheme='http', netloc='[3ffe:2a00:100:7031::1]', path='', params='', query='', fragment='')

sanitize_url uses hostname, not netloc. Demo:

Python 2.7.2+ (default, Aug 16 2011, 09:23:59) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from urlparse import urlparse
>>> u = urlparse('http://[3ffe:2a00:100:7031::1]')
>>> u
ParseResult(scheme='http', netloc='[3ffe:2a00:100:7031::1]', path='', params='', query='', fragment='')
>>> u.hostname
'3ffe:2a00:100:7031::1'
Owner

Yay!

@spladug
Owner

This works perfectly for allowing the URL to be submitted, but there are downstream pieces of code that break. For example:

reddit app natbook:2111 started 748d0d6 at 2011-12-01 10:18:49.275693
Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53) 

In [1]: from r2.lib.utils import domain

In [2]: domain("http://[3ffe:2a00:100:7031::1]")
Out[2]: '[3ffe'

I think we'll need to figure out those issues before this can go live.

@k21
k21 commented

Sorry, I did not notice that. I will go through the helper functions, find those that can be affected (there will probably be quite a lot of them) and fix them.

@spladug
Owner

No worries, this code has a lot of assumptions in it :)

@k21
k21 commented

It is still not complete, there seem to be problems with listing links for a domain, e.g. /domain/[::1]/ returns not found error.

@k21

I think that now everything should work as expected even with IPv6 addresses.

@danry25

Looks like this patch breaks submissions in the current iteration of Reddit from Source :(

Yes, it has been a long time since this patch was written. I do not have a reddit locally installed right now and because the patch applied cleanly, I do not know what it breaks. Could you please be a bit more specific? Thanks.

Well, let me try & set it up on our latest reddit install, we migrated from reddit from source on Ubuntu 11.10 to reddit from source on ubuntu 12.04.1 since I last tried this patch. What all do I need to change out file wise by the way?

If you are going to try applying this patch, it might be a good idea to use commit @2cfd44f instead of this one, which also fixes problems with /domain listings.

Only the following files were modified in this patch:
r2/r2/config/middleware.py
r2/r2/lib/utils/utils.py

Thanks for the info, I'll try applying the patch from the commit you recomended here in a minute & see how it goes.

interesting, so it appears to have no effect if I go & insert the changes you made into the latest reddit builds r2/r2/config/middleware.py
& r2/r2/lib/utils/utils.py files, although it doesn't break normal link submission. Submitting an IPv6 url gets reddit to reply with "you should check that url". Maybe reddit is relying on more values to screen urls now?

I will try to get my local reddit installation running again and find out what the problem is, but I do not have a lot of free time, so I cannot guarantee when (whether) it will be fixed.

Oh, don't worry about it too much, I'm not particularly pressed to get this added into my reddit install. I'll clone your repository though on a different VPS & see if it works with IPv6 links.

@mikemol

Is there a set of unit tests for exercising sanitize_url with various inputs?

@spladug
Owner

@mikemol: Not currently ;)

@mikemol
@danry25

Hey, I just tested this patch from a reddit user named BasementTrix on Uppit.us, a minorly modified install of Reddit from source. It works for allowing IPv6 links, long & short to be submitted to Uppit.us like you would submit a normal URL or an IPv4 address.

@spladug
Owner

Unfortunately, due to our latency in merging this, it's become unmergeable and out of date. I'm going to close this. Thanks a tonne for your work and we should definitely still do something like this.

@spladug spladug closed this
@danry25

Meh, it really didn't make sense to add it since the pull request for snudown was rejected, reddit would just end up with half working IPv6 support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Nov 29, 2011
  1. @k21

    Modify sanitize_url to accept IPv6 addresses

    k21 authored
     * After the change the function also accepts some invalid links, but
       after discussion with spladug, it should not be a problem, since
       many invalid addresses that were accepted before the change exist
Commits on Dec 22, 2011
  1. @k21
  2. @k21
This page is out of date. Refresh to see the latest.
Showing with 27 additions and 9 deletions.
  1. +1 −1  r2/r2/config/middleware.py
  2. +26 −8 r2/r2/lib/utils/utils.py
View
2  r2/r2/config/middleware.py
@@ -356,7 +356,7 @@ def __call__(self, environ, start_response):
return self.app(environ, start_response)
class DomainListingMiddleware(object):
- domain_pattern = re.compile(r'\A/domain/(([-\w]+\.)+[\w]+)')
+ domain_pattern = re.compile(r'\A/domain/(([-\w]+\.)+[\w]+|\[[0-9a-fA-F:]+\])')
def __init__(self, app):
self.app = app
View
34 r2/r2/lib/utils/utils.py
@@ -220,7 +220,7 @@ def base_url(url):
res = r_base_url.findall(url)
return (res and res[0]) or url
-r_domain = re.compile("(?i)(?:.+?://)?(?:www[\d]*\.)?([^/:#?]*)")
+r_domain = re.compile("(?i)(?:.+?://)?(?:www[\d]*\.)?(\[[0-9a-fA-F:]+\]|[^/:#?]*)")
def domain(s):
"""
Takes a URL and returns the domain part, minus www., if
@@ -266,7 +266,7 @@ def get_title(url):
return None
valid_schemes = ('http', 'https', 'ftp', 'mailto')
-valid_dns = re.compile('\A[-a-zA-Z0-9]+\Z')
+valid_dns = re.compile('\A[-a-zA-Z0-9:]+\Z')
def sanitize_url(url, require_scheme = False):
"""Validates that the url is of the form
@@ -379,7 +379,8 @@ class UrlParser(object):
__slots__ = ['scheme', 'path', 'params', 'query',
'fragment', 'username', 'password', 'hostname',
- 'port', '_url_updates', '_orig_url', '_query_dict']
+ 'port', '_url_updates', '_orig_url', '_query_dict',
+ 'is_ipv6']
valid_schemes = ('http', 'https', 'ftp', 'mailto')
cname_get = "cnameframe"
@@ -389,6 +390,9 @@ def __init__(self, url):
for s in self.__slots__:
if hasattr(u, s):
setattr(self, s, getattr(u, s))
+ self.is_ipv6 = False
+ if getattr(u, 'netloc', '').startswith('['):
+ self.is_ipv6 = True
self._url_updates = {}
self._orig_url = url
self._query_dict = None
@@ -459,8 +463,19 @@ def unparse(self):
q.update(self._url_updates)
q = query_string(q).lstrip('?')
- # make sure the port is not doubly specified
- if self.port and ":" in self.hostname:
+ # if this is ipv6 address, remove brackets from hostname
+ if self.hostname and self.hostname.startswith('[') and ']' in self.hostname:
+ self.is_ipv6 = True
+ self.hostname = self.hostname[1:]
+ self.hostname = self.hostname[:self.hostname.index(']')]
+
+ # if this is marked as ipv6 address but it is not, remove the mark
+ if self.hostname and self.is_ipv6:
+ if not all(c in '0123456789abcdefABCDEF:' for c in self.hostname):
+ self.is_ipv6 = False
+
+ # make sure the port is not doubly specified
+ if self.hostname and ':' in self.hostname and self.port and not self.is_ipv6:
self.hostname = self.hostname.split(':')[0]
# if there is a netloc, there had better be a scheme
@@ -539,7 +554,10 @@ def netloc(self):
if not self.hostname:
return ""
elif getattr(self, "port", None):
- return self.hostname + ":" + str(self.port)
+ if self.is_ipv6:
+ return "[" + self.hostname + "]:" + str(self.port)
+ else:
+ return self.hostname + ":" + str(self.port)
return self.hostname
def mk_cname(self, require_frame = True, subreddit = None, port = None):
@@ -948,8 +966,8 @@ def new_fn(*a,**kw):
def common_subdomain(domain1, domain2):
if not domain1 or not domain2:
return ""
- domain1 = domain1.split(":")[0]
- domain2 = domain2.split(":")[0]
+ domain1 = urlparse(domain1).hostname
+ domain2 = urlparse(domain2).hostname
if len(domain1) > len(domain2):
domain1, domain2 = domain2, domain1
Something went wrong with that request. Please try again.