Web page 404 #77

telking · 2012-06-16T07:39:00Z

urllib3 can not access to web pages

import urllib3
http = urllib3.PoolManager()
url = 'http://waptt.com/'
r = http.request('GET', url, retries = 5)
print r.status
404

But I use curl to get to 200 status
curl -I http://waptt.com
HTTP/1.1 200 OK

shazow · 2012-06-16T18:56:43Z

This is because of #8 (urllib3 sends the entire url in the GET line, instead of just the path). Seems the servers running waptt.com don't like that.

I would suggest using https://github.com/kennethreitz/requests for now, which is built on top of urllib3 but does a lot of extra things for you like strip out the scheme/host from the GET line before sending it.

telking · 2012-06-17T00:21:52Z

I've tested the grequests (Requests + Gevent https://github.com/kennethreitz/grequests) and Urllib3 the performance comparison and concluded that much better gevent + urllib3 performance than grequests ,so I gevent + urllib3

shazow · 2012-06-17T00:24:01Z

Interesting. Could you share your benchmark methodology and numbers? I'm curious to see where Requests is slow; I'm sure we can speed it up.

My other suggestion, for now, would be to implement your own PoolManager which removes the scheme+host from the request url before passing it on.

telking · 2012-06-17T00:36:15Z

Requests speed is slower and more error

python gtest.py
by requests: 25.7797329426 seconds
by urllib3: 1.32646393776 seconds

Test code:

import sys

import gevent
from gevent import monkey

gevent.monkey.patch_all(thread=False)

import grequests
import urllib3
http = urllib3.PoolManager()

def call_back(resp):
    content = resp.content

def worker(url, use_urllib2=False):
    if use_urllib2:
         content = http.request('GET', url)

    else:
        rs = [grequests.get(u) for u in url]
        resps = grequests.map(rs)
        for resp in resps:
            call_back(resp)

urls = ['http://www.baidu.com/']*50

def by_requests():
    worker(urls)
def by_urllib2():
    jobs = [gevent.spawn(worker, url, True) for url in urls]
    gevent.joinall(jobs)

if __name__=='__main__':
    from timeit import Timer
    t = Timer(stmt="by_requests()", setup="from __main__ import by_requests")
    print 'by requests: %s seconds'%t.timeit(number=3)
    t = Timer(stmt="by_urllib2()", setup="from __main__ import by_urllib2")
    print 'by urllib3: %s seconds'%t.timeit(number=3)

shazow · 2012-06-17T01:24:54Z

@kennethreitz, thoughts?

shazow closed this as completed Jun 16, 2012

shazow mentioned this issue Jun 23, 2012

301 problems #82

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web page 404 #77

Web page 404 #77

telking commented Jun 16, 2012

shazow commented Jun 16, 2012

telking commented Jun 17, 2012

shazow commented Jun 17, 2012

telking commented Jun 17, 2012

shazow commented Jun 17, 2012

Web page 404 #77

Web page 404 #77

Comments

telking commented Jun 16, 2012

shazow commented Jun 16, 2012

telking commented Jun 17, 2012

shazow commented Jun 17, 2012

telking commented Jun 17, 2012

shazow commented Jun 17, 2012