I've hit a url that the httpclient it failing for that works with urllib. Below is a snippet of code with the url and showing it produces a 400 bad request from the httpclient side.
from tornado import httpclient
url = "https://blogs.msdn.com/b/jmeier/archive/2012/05/13/the-rapid-research-method.aspx?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed: jmeier (J.D. Meier's Blog)&Redirected=true"
fh = urllib.urlopen(url)
# This will load up the content just peachy...
content = fh.read()
# This will get me a 400 bad request response.
http = httpclient.HTTPClient()
response = http.fetch(url)
print "Content should be in here."
except Exception, e:
print "but it goes BOOM!"
Technically that url is invalid because urls are not supposed to contain spaces. Browsers and other HTTP clients tend to "helpfully" rewrite the invalid url, although I'm not sure if there are any rules as to the right way to do it (it's not as simple as urllib.quote, since you don't want to encode the ampersands and other characters that would normally be percent-encoded)
Yea, I was trying to find some way to break it apart and manually escape it, but when it worked in urllib I wondered if maybe this is something that could be picked up and ported or something.
Under the covers, urllib.urlopen() calls urllib.quote() with a safe character set of "%/:=&?~#+!$,;'@()*|". Is it worth adding that "helpfulness" to Tornado's httpclient? If that type of magic is better handled outside Tornado, perhaps this issue can be closed.
Is there any documentation of what browsers do (or should do) here? (maybe in html5?) If there's a standard to follow then I'm OK with adding that to Tornado, but I'd rather not add a bit of copy/pasted magic that may or may not be the same as what's used elsewhere.