Modified sanitize_url to accept IPv6 addresses #235

Closed
wants to merge 3 commits into from

4 participants

@k21

Also removed mailto from valid_schemes, because mailto: scheme would never get accepted as it could not pass the checks further in the function (according to urlparse, it does not contain a hostname).

@spladug
reddit member

Github says this pull was updated, but the commit looks the same as before. Am I missing something? :X

@k21

In this version the urlparse() function is used to get the IPv6 address from the link while in the previous version urlparse was not used at all. Some functions from the previous version are still used, because I know of no other way to validate IPv6 address in Python. I remember reading somewhere that urlparse expects that its input to be a valid url, so I suppose I have to check myself that the address is really valid.

@k21

I did some testing and I found a bug in the current version of sanitize_url. It currently accepts any address enclosed in brackets (e.g. "http://[nonsense]/abcd"), which is in fact invalid and should not be accepted. I pushed a new commit which should fix this.

@k21

I am sorry, I accidentally included some code that had nothing to do with this problem in the previous version of the pull request. I removed those commits and also cleaned up the code a little. I hope that this version is correct.

@k21

However, I still think that using urlparse is not a very good idea, because it does expect its input to be a valid URL. Because of this, it will incorrectly parse things like "http://hello:world!/" thinking that "world!" is a port number and sanitize_url will not find out, because it never checks whether the port number is really a number.

@k21 k21 Modify sanitize_url to accept IPv6 addresses
 * After the change the function also accepts some invalid links, but
   after discussion with spladug, it should not be a problem, since
   many invalid addresses that were accepted before the change exist
b713eca
@spladug

I believe [ and ] need to be allowed by the regex as well, right?

spladug: they do not have to, because if the link contains IPv6 address, urlparse says that the hostname is the address and it removes the brackets automatically

reddit member

That doesn't seem to be the case when I test it, am I doing it wrong?

Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53) 
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> urlparse.urlparse("http://[3ffe:2a00:100:7031::1]")
ParseResult(scheme='http', netloc='[3ffe:2a00:100:7031::1]', path='', params='', query='', fragment='')

sanitize_url uses hostname, not netloc. Demo:

Python 2.7.2+ (default, Aug 16 2011, 09:23:59) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from urlparse import urlparse
>>> u = urlparse('http://[3ffe:2a00:100:7031::1]')
>>> u
ParseResult(scheme='http', netloc='[3ffe:2a00:100:7031::1]', path='', params='', query='', fragment='')
>>> u.hostname
'3ffe:2a00:100:7031::1'
reddit member

Yay!

@spladug
reddit member

This works perfectly for allowing the URL to be submitted, but there are downstream pieces of code that break. For example:

reddit app natbook:2111 started 748d0d6 at 2011-12-01 10:18:49.275693
Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53) 

In [1]: from r2.lib.utils import domain

In [2]: domain("http://[3ffe:2a00:100:7031::1]")
Out[2]: '[3ffe'

I think we'll need to figure out those issues before this can go live.

@k21
k21 commented Dec 1, 2011

Sorry, I did not notice that. I will go through the helper functions, find those that can be affected (there will probably be quite a lot of them) and fix them.

@spladug
reddit member

No worries, this code has a lot of assumptions in it :)

@k21
k21 commented Dec 4, 2011

It is still not complete, there seem to be problems with listing links for a domain, e.g. /domain/[::1]/ returns not found error.

@k21

I think that now everything should work as expected even with IPv6 addresses.

@danry25

Looks like this patch breaks submissions in the current iteration of Reddit from Source :(

Yes, it has been a long time since this patch was written. I do not have a reddit locally installed right now and because the patch applied cleanly, I do not know what it breaks. Could you please be a bit more specific? Thanks.

Well, let me try & set it up on our latest reddit install, we migrated from reddit from source on Ubuntu 11.10 to reddit from source on ubuntu 12.04.1 since I last tried this patch. What all do I need to change out file wise by the way?

If you are going to try applying this patch, it might be a good idea to use commit @2cfd44f instead of this one, which also fixes problems with /domain listings.

Only the following files were modified in this patch:
r2/r2/config/middleware.py
r2/r2/lib/utils/utils.py

Thanks for the info, I'll try applying the patch from the commit you recomended here in a minute & see how it goes.

interesting, so it appears to have no effect if I go & insert the changes you made into the latest reddit builds r2/r2/config/middleware.py
& r2/r2/lib/utils/utils.py files, although it doesn't break normal link submission. Submitting an IPv6 url gets reddit to reply with "you should check that url". Maybe reddit is relying on more values to screen urls now?

I will try to get my local reddit installation running again and find out what the problem is, but I do not have a lot of free time, so I cannot guarantee when (whether) it will be fixed.

Oh, don't worry about it too much, I'm not particularly pressed to get this added into my reddit install. I'll clone your repository though on a different VPS & see if it works with IPv6 links.

@mikemol

Is there a set of unit tests for exercising sanitize_url with various inputs?

@spladug
reddit member

@mikemol: Not currently ;)

@mikemol
@danry25

Hey, I just tested this patch from a reddit user named BasementTrix on Uppit.us, a minorly modified install of Reddit from source. It works for allowing IPv6 links, long & short to be submitted to Uppit.us like you would submit a normal URL or an IPv4 address.

@spladug
reddit member

Unfortunately, due to our latency in merging this, it's become unmergeable and out of date. I'm going to close this. Thanks a tonne for your work and we should definitely still do something like this.

@spladug spladug closed this Jul 16, 2013
@danry25

Meh, it really didn't make sense to add it since the pull request for snudown was rejected, reddit would just end up with half working IPv6 support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment