New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PY3: use six for robotparser and urlparse #800
Conversation
Yep, URLs are going to be one of the trickiest part of the Python 3 port because sometimes it makes more sense to have them as bytes and sometimes it makes more sense to have them as unicode. We may have to make sure all our functions accept both, at least in in Python 3.x (in 2.x some stdlib functions may break with unicode input). This PR looks good to me as it shouldn't break 2.x code and will make it easier to make further changes in 3.x. Most likely regardless of the bytes/unicode handling changes these imports will remain as is. |
PY3: use six for robotparser and urlparse
@kmike This change broke this function https://github.com/scrapy/scrapy/blob/master/scrapy/utils/url.py#L71 which broke Traceback (most recent call last):
File "/home/scrapinghub/Devel/scrapy/scrapy/crawler.py", line 93, in start
self.start_reactor()
File "/home/scrapinghub/Devel/scrapy/scrapy/crawler.py", line 130, in start_reactor
reactor.run(installSignalHandlers=False) # blocking call
File "/home/scrapinghub/.virtualenvs/testspiders/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1192, in run
self.mainLoop()
File "/home/scrapinghub/.virtualenvs/testspiders/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1201, in mainLoop
self.runUntilCurrent()
--- <exception caught here> ---
File "/home/scrapinghub/.virtualenvs/testspiders/local/lib/python2.7/site-packages/twisted/internet/base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/home/scrapinghub/Devel/scrapy/scrapy/utils/reactor.py", line 41, in __call__
return self._func(*self._a, **self._kw)
File "/home/scrapinghub/Devel/scrapy/scrapy/core/engine.py", line 120, in _next_request
self.crawl(request, spider)
File "/home/scrapinghub/Devel/scrapy/scrapy/core/engine.py", line 176, in crawl
self.schedule(request, spider)
File "/home/scrapinghub/Devel/scrapy/scrapy/core/engine.py", line 182, in schedule
return self.slot.scheduler.enqueue_request(request)
File "/home/scrapinghub/Devel/scrapy/scrapy/core/scheduler.py", line 48, in enqueue_request
if not request.dont_filter and self.df.request_seen(request):
File "/home/scrapinghub/Devel/scrapy/scrapy/dupefilter.py", line 46, in request_seen
fp = self.request_fingerprint(request)
File "/home/scrapinghub/Devel/scrapy/scrapy/dupefilter.py", line 54, in request_fingerprint
return request_fingerprint(request)
File "/home/scrapinghub/Devel/scrapy/scrapy/utils/request.py", line 52, in request_fingerprint
fp.update(canonicalize_url(request.url))
File "/home/scrapinghub/Devel/scrapy/scrapy/utils/url.py", line 56, in canonicalize_url
scheme, netloc, path, params, query, fragment = parse_url(url)
File "/home/scrapinghub/Devel/scrapy/scrapy/utils/url.py", line 76, in parse_url
urlparse(unicode_to_str(url, encoding))
exceptions.TypeError: 'module' object is not callable |
@nramirezuy I can't reproduce this, and the tests work (both on Travis and locally via tox). Maybe it is an issue with some older versions of six? If so, we need to update six version listed in setup.py. |
@kmike I think I have a newer one >>> import six
>>> six
<module 'six' from '/home/scrapinghub/.virtualenvs/testspiders/local/lib/python2.7/site-packages/six.py'>
>>> six.__version__
'1.7.3' |
It must be caused by old w3lib - import * from old w3lib may shadow urlparse. |
@kmike Nope.
>>> import w3lib
>>> w3lib
<module 'w3lib' from '/home/scrapinghub/.virtualenvs/testspiders/local/lib/python2.7/site-packages/w3lib/__init__.pyc'> |
I can't think of a reason other than old w3lib either :( |
That's misterious. PyCharm also shows me that imports at top of scrapy.utils.url are all shadowed by @nramirezuy could you try moving 'import *' to the top of the file to see if it helps? |
I fixed it, I had an old version installed via pip. |
Not very sure about the text encoding, but it should at least work for current python 2.x environments.