Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web page 404 #77

Closed
telking opened this issue Jun 16, 2012 · 5 comments
Closed

Web page 404 #77

telking opened this issue Jun 16, 2012 · 5 comments

Comments

@telking
Copy link

telking commented Jun 16, 2012

urllib3 can not access to web pages

import urllib3
http = urllib3.PoolManager()
url = 'http://waptt.com/'
r = http.request('GET', url, retries = 5)
print r.status
404

But I use curl to get to 200 status
curl -I http://waptt.com
HTTP/1.1 200 OK

@shazow
Copy link
Member

shazow commented Jun 16, 2012

This is because of #8 (urllib3 sends the entire url in the GET line, instead of just the path). Seems the servers running waptt.com don't like that.

I would suggest using https://github.com/kennethreitz/requests for now, which is built on top of urllib3 but does a lot of extra things for you like strip out the scheme/host from the GET line before sending it.

@shazow shazow closed this as completed Jun 16, 2012
@telking
Copy link
Author

telking commented Jun 17, 2012

I've tested the grequests (Requests + Gevent https://github.com/kennethreitz/grequests) and Urllib3 the performance comparison and concluded that much better gevent + urllib3 performance than grequests ,so I gevent + urllib3

@shazow
Copy link
Member

shazow commented Jun 17, 2012

Interesting. Could you share your benchmark methodology and numbers? I'm curious to see where Requests is slow; I'm sure we can speed it up.

My other suggestion, for now, would be to implement your own PoolManager which removes the scheme+host from the request url before passing it on.

@telking
Copy link
Author

telking commented Jun 17, 2012

Requests speed is slower and more error

python gtest.py
by requests: 25.7797329426 seconds
by urllib3: 1.32646393776 seconds

Test code:

import sys

import gevent
from gevent import monkey

gevent.monkey.patch_all(thread=False)

import grequests
import urllib3
http = urllib3.PoolManager()

def call_back(resp):
    content = resp.content

def worker(url, use_urllib2=False):
    if use_urllib2:
         content = http.request('GET', url)

    else:
        rs = [grequests.get(u) for u in url]
        resps = grequests.map(rs)
        for resp in resps:
            call_back(resp)

urls = ['http://www.baidu.com/']*50

def by_requests():
    worker(urls)
def by_urllib2():
    jobs = [gevent.spawn(worker, url, True) for url in urls]
    gevent.joinall(jobs)

if __name__=='__main__':
    from timeit import Timer
    t = Timer(stmt="by_requests()", setup="from __main__ import by_requests")
    print 'by requests: %s seconds'%t.timeit(number=3)
    t = Timer(stmt="by_urllib2()", setup="from __main__ import by_urllib2")
    print 'by urllib3: %s seconds'%t.timeit(number=3)

@shazow
Copy link
Member

shazow commented Jun 17, 2012

@kennethreitz, thoughts?

@shazow shazow mentioned this issue Jun 23, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant