Skip to content
This repository

Modified sanitize_url to accept IPv6 addresses #235

Closed
wants to merge 3 commits into from

4 participants

k21 Neil Williams Michael Mol danry25
k21
k21 commented October 18, 2011

Also removed mailto from valid_schemes, because mailto: scheme would never get accepted as it could not pass the checks further in the function (according to urlparse, it does not contain a hostname).

Neil Williams
Owner

Github says this pull was updated, but the commit looks the same as before. Am I missing something? :X

k21

In this version the urlparse() function is used to get the IPv6 address from the link while in the previous version urlparse was not used at all. Some functions from the previous version are still used, because I know of no other way to validate IPv6 address in Python. I remember reading somewhere that urlparse expects that its input to be a valid url, so I suppose I have to check myself that the address is really valid.

k21

I did some testing and I found a bug in the current version of sanitize_url. It currently accepts any address enclosed in brackets (e.g. "http://[nonsense]/abcd"), which is in fact invalid and should not be accepted. I pushed a new commit which should fix this.

k21

I am sorry, I accidentally included some code that had nothing to do with this problem in the previous version of the pull request. I removed those commits and also cleaned up the code a little. I hope that this version is correct.

k21

However, I still think that using urlparse is not a very good idea, because it does expect its input to be a valid URL. Because of this, it will incorrectly parse things like "http://hello:world!/" thinking that "world!" is a port number and sanitize_url will not find out, because it never checks whether the port number is really a number.

k21 Modify sanitize_url to accept IPv6 addresses
 * After the change the function also accepts some invalid links, but
   after discussion with spladug, it should not be a problem, since
   many invalid addresses that were accepted before the change exist
b713eca
Neil Williams

I believe [ and ] need to be allowed by the regex as well, right?

spladug: they do not have to, because if the link contains IPv6 address, urlparse says that the hostname is the address and it removes the brackets automatically

Owner

That doesn't seem to be the case when I test it, am I doing it wrong?

Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53) 
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> urlparse.urlparse("http://[3ffe:2a00:100:7031::1]")
ParseResult(scheme='http', netloc='[3ffe:2a00:100:7031::1]', path='', params='', query='', fragment='')

sanitize_url uses hostname, not netloc. Demo:

Python 2.7.2+ (default, Aug 16 2011, 09:23:59) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from urlparse import urlparse
>>> u = urlparse('http://[3ffe:2a00:100:7031::1]')
>>> u
ParseResult(scheme='http', netloc='[3ffe:2a00:100:7031::1]', path='', params='', query='', fragment='')
>>> u.hostname
'3ffe:2a00:100:7031::1'
Owner

Yay!

Neil Williams
Owner

This works perfectly for allowing the URL to be submitted, but there are downstream pieces of code that break. For example:

reddit app natbook:2111 started 748d0d6 at 2011-12-01 10:18:49.275693
Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53) 

In [1]: from r2.lib.utils import domain

In [2]: domain("http://[3ffe:2a00:100:7031::1]")
Out[2]: '[3ffe'

I think we'll need to figure out those issues before this can go live.

k21

Sorry, I did not notice that. I will go through the helper functions, find those that can be affected (there will probably be quite a lot of them) and fix them.

Neil Williams
Owner

No worries, this code has a lot of assumptions in it :)

k21

It is still not complete, there seem to be problems with listing links for a domain, e.g. /domain/[::1]/ returns not found error.

k21

I think that now everything should work as expected even with IPv6 addresses.

danry25

Looks like this patch breaks submissions in the current iteration of Reddit from Source :(

Yes, it has been a long time since this patch was written. I do not have a reddit locally installed right now and because the patch applied cleanly, I do not know what it breaks. Could you please be a bit more specific? Thanks.

Well, let me try & set it up on our latest reddit install, we migrated from reddit from source on Ubuntu 11.10 to reddit from source on ubuntu 12.04.1 since I last tried this patch. What all do I need to change out file wise by the way?

If you are going to try applying this patch, it might be a good idea to use commit @2cfd44f instead of this one, which also fixes problems with /domain listings.

Only the following files were modified in this patch:
r2/r2/config/middleware.py
r2/r2/lib/utils/utils.py

Thanks for the info, I'll try applying the patch from the commit you recomended here in a minute & see how it goes.

interesting, so it appears to have no effect if I go & insert the changes you made into the latest reddit builds r2/r2/config/middleware.py
& r2/r2/lib/utils/utils.py files, although it doesn't break normal link submission. Submitting an IPv6 url gets reddit to reply with "you should check that url". Maybe reddit is relying on more values to screen urls now?

I will try to get my local reddit installation running again and find out what the problem is, but I do not have a lot of free time, so I cannot guarantee when (whether) it will be fixed.

Oh, don't worry about it too much, I'm not particularly pressed to get this added into my reddit install. I'll clone your repository though on a different VPS & see if it works with IPv6 links.

Michael Mol

Is there a set of unit tests for exercising sanitize_url with various inputs?

Neil Williams
Owner

@mikemol: Not currently ;)

Michael Mol
danry25

Hey, I just tested this patch from a reddit user named BasementTrix on Uppit.us, a minorly modified install of Reddit from source. It works for allowing IPv6 links, long & short to be submitted to Uppit.us like you would submit a normal URL or an IPv4 address.

Neil Williams
Owner

Unfortunately, due to our latency in merging this, it's become unmergeable and out of date. I'm going to close this. Thanks a tonne for your work and we should definitely still do something like this.

Neil Williams spladug closed this July 15, 2013
danry25

Meh, it really didn't make sense to add it since the pull request for snudown was rejected, reddit would just end up with half working IPv6 support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Showing 3 unique commits by 1 author.

Nov 29, 2011
k21 Modify sanitize_url to accept IPv6 addresses
 * After the change the function also accepts some invalid links, but
   after discussion with spladug, it should not be a problem, since
   many invalid addresses that were accepted before the change exist
b713eca
Dec 22, 2011
k21 Fix some problems with IPv6 addresses in utils 6ea6a6d
k21 Fix /domain listings with IPv6 address 2cfd44f
This page is out of date. Refresh to see the latest.
2  r2/r2/config/middleware.py
@@ -356,7 +356,7 @@ def __call__(self, environ, start_response):
356 356
         return self.app(environ, start_response)
357 357
 
358 358
 class DomainListingMiddleware(object):
359  
-    domain_pattern = re.compile(r'\A/domain/(([-\w]+\.)+[\w]+)')
  359
+    domain_pattern = re.compile(r'\A/domain/(([-\w]+\.)+[\w]+|\[[0-9a-fA-F:]+\])')
360 360
 
361 361
     def __init__(self, app):
362 362
         self.app = app
34  r2/r2/lib/utils/utils.py
@@ -220,7 +220,7 @@ def base_url(url):
220 220
     res = r_base_url.findall(url)
221 221
     return (res and res[0]) or url
222 222
 
223  
-r_domain = re.compile("(?i)(?:.+?://)?(?:www[\d]*\.)?([^/:#?]*)")
  223
+r_domain = re.compile("(?i)(?:.+?://)?(?:www[\d]*\.)?(\[[0-9a-fA-F:]+\]|[^/:#?]*)")
224 224
 def domain(s):
225 225
     """
226 226
         Takes a URL and returns the domain part, minus www., if
@@ -266,7 +266,7 @@ def get_title(url):
266 266
         return None
267 267
 
268 268
 valid_schemes = ('http', 'https', 'ftp', 'mailto')
269  
-valid_dns = re.compile('\A[-a-zA-Z0-9]+\Z')
  269
+valid_dns = re.compile('\A[-a-zA-Z0-9:]+\Z')
270 270
 def sanitize_url(url, require_scheme = False):
271 271
     """Validates that the url is of the form
272 272
 
@@ -379,7 +379,8 @@ class UrlParser(object):
379 379
 
380 380
     __slots__ = ['scheme', 'path', 'params', 'query',
381 381
                  'fragment', 'username', 'password', 'hostname',
382  
-                 'port', '_url_updates', '_orig_url', '_query_dict']
  382
+                 'port', '_url_updates', '_orig_url', '_query_dict',
  383
+                 'is_ipv6']
383 384
 
384 385
     valid_schemes = ('http', 'https', 'ftp', 'mailto')
385 386
     cname_get = "cnameframe"
@@ -389,6 +390,9 @@ def __init__(self, url):
389 390
         for s in self.__slots__:
390 391
             if hasattr(u, s):
391 392
                 setattr(self, s, getattr(u, s))
  393
+        self.is_ipv6 = False
  394
+        if getattr(u, 'netloc', '').startswith('['):
  395
+            self.is_ipv6 = True
392 396
         self._url_updates = {}
393 397
         self._orig_url    = url
394 398
         self._query_dict  = None
@@ -459,8 +463,19 @@ def unparse(self):
459 463
             q.update(self._url_updates)
460 464
             q = query_string(q).lstrip('?')
461 465
 
462  
-        # make sure the port is not doubly specified 
463  
-        if self.port and ":" in self.hostname:
  466
+        # if this is ipv6 address, remove brackets from hostname
  467
+        if self.hostname and self.hostname.startswith('[') and ']' in self.hostname:
  468
+            self.is_ipv6 = True
  469
+            self.hostname = self.hostname[1:]
  470
+            self.hostname = self.hostname[:self.hostname.index(']')]
  471
+
  472
+        # if this is marked as ipv6 address but it is not, remove the mark
  473
+        if self.hostname and self.is_ipv6:
  474
+            if not all(c in '0123456789abcdefABCDEF:' for c in self.hostname):
  475
+                self.is_ipv6 = False
  476
+
  477
+        # make sure the port is not doubly specified
  478
+        if self.hostname and ':' in self.hostname and self.port and not self.is_ipv6:
464 479
             self.hostname = self.hostname.split(':')[0]
465 480
 
466 481
         # if there is a netloc, there had better be a scheme
@@ -539,7 +554,10 @@ def netloc(self):
539 554
         if not self.hostname:
540 555
             return ""
541 556
         elif getattr(self, "port", None):
542  
-            return self.hostname + ":" + str(self.port)
  557
+            if self.is_ipv6:
  558
+                return "[" + self.hostname + "]:" + str(self.port)
  559
+            else:
  560
+                return self.hostname + ":" + str(self.port)
543 561
         return self.hostname
544 562
 
545 563
     def mk_cname(self, require_frame = True, subreddit = None, port = None):
@@ -948,8 +966,8 @@ def new_fn(*a,**kw):
948 966
 def common_subdomain(domain1, domain2):
949 967
     if not domain1 or not domain2:
950 968
         return ""
951  
-    domain1 = domain1.split(":")[0]
952  
-    domain2 = domain2.split(":")[0]
  969
+    domain1 = urlparse(domain1).hostname
  970
+    domain2 = urlparse(domain2).hostname
953 971
     if len(domain1) > len(domain2):
954 972
         domain1, domain2 = domain2, domain1
955 973
 
Commit_comment_tip

Tip: You can add notes to lines in a file. Hover to the left of a line to make a note

Something went wrong with that request. Please try again.