PY3: use six for robotparser and urlparse #800

felixonmars · 2014-07-14T13:28:01Z

Not very sure about the text encoding, but it should at least work for current python 2.x environments.

kmike · 2014-07-14T13:44:00Z

Yep, URLs are going to be one of the trickiest part of the Python 3 port because sometimes it makes more sense to have them as bytes and sometimes it makes more sense to have them as unicode. We may have to make sure all our functions accept both, at least in in Python 3.x (in 2.x some stdlib functions may break with unicode input).

This PR looks good to me as it shouldn't break 2.x code and will make it easier to make further changes in 3.x. Most likely regardless of the bytes/unicode handling changes these imports will remain as is.

PY3: use six for robotparser and urlparse

nramirezuy · 2014-07-22T15:42:04Z

@kmike This change broke this function https://github.com/scrapy/scrapy/blob/master/scrapy/utils/url.py#L71 which broke Scrapy completely.

    Traceback (most recent call last):
      File "/home/scrapinghub/Devel/scrapy/scrapy/crawler.py", line 93, in start
        self.start_reactor()
      File "/home/scrapinghub/Devel/scrapy/scrapy/crawler.py", line 130, in start_reactor
        reactor.run(installSignalHandlers=False)  # blocking call
      File "/home/scrapinghub/.virtualenvs/testspiders/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1192, in run
        self.mainLoop()
      File "/home/scrapinghub/.virtualenvs/testspiders/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1201, in mainLoop
        self.runUntilCurrent()
    --- <exception caught here> ---
      File "/home/scrapinghub/.virtualenvs/testspiders/local/lib/python2.7/site-packages/twisted/internet/base.py", line 824, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/home/scrapinghub/Devel/scrapy/scrapy/utils/reactor.py", line 41, in __call__
        return self._func(*self._a, **self._kw)
      File "/home/scrapinghub/Devel/scrapy/scrapy/core/engine.py", line 120, in _next_request
        self.crawl(request, spider)
      File "/home/scrapinghub/Devel/scrapy/scrapy/core/engine.py", line 176, in crawl
        self.schedule(request, spider)
      File "/home/scrapinghub/Devel/scrapy/scrapy/core/engine.py", line 182, in schedule
        return self.slot.scheduler.enqueue_request(request)
      File "/home/scrapinghub/Devel/scrapy/scrapy/core/scheduler.py", line 48, in enqueue_request
        if not request.dont_filter and self.df.request_seen(request):
      File "/home/scrapinghub/Devel/scrapy/scrapy/dupefilter.py", line 46, in request_seen
        fp = self.request_fingerprint(request)
      File "/home/scrapinghub/Devel/scrapy/scrapy/dupefilter.py", line 54, in request_fingerprint
        return request_fingerprint(request)
      File "/home/scrapinghub/Devel/scrapy/scrapy/utils/request.py", line 52, in request_fingerprint
        fp.update(canonicalize_url(request.url))
      File "/home/scrapinghub/Devel/scrapy/scrapy/utils/url.py", line 56, in canonicalize_url
        scheme, netloc, path, params, query, fragment = parse_url(url)
      File "/home/scrapinghub/Devel/scrapy/scrapy/utils/url.py", line 76, in parse_url
        urlparse(unicode_to_str(url, encoding))
    exceptions.TypeError: 'module' object is not callable

kmike · 2014-07-22T15:55:40Z

@nramirezuy I can't reproduce this, and the tests work (both on Travis and locally via tox).

Maybe it is an issue with some older versions of six? If so, we need to update six version listed in setup.py.

nramirezuy · 2014-07-22T15:57:47Z

@kmike I think I have a newer one

>>> import six
>>> six
<module 'six' from '/home/scrapinghub/.virtualenvs/testspiders/local/lib/python2.7/site-packages/six.py'>
>>> six.__version__
'1.7.3'

kmike · 2014-07-22T16:00:58Z

It must be caused by old w3lib - import * from old w3lib may shadow urlparse.

nramirezuy · 2014-07-22T16:06:05Z

@kmike Nope.

commit 6aa32e30435b85301b9df2b24aef8b93fccd02f4
Merge: 9604f80 11779e4
Author: Pablo Hoffman <pablo@scrapinghub.com>
Date:   Mon Jul 21 16:27:22 2014 -0300

    Merge pull request #23 from scrapy/w3lib-http-py3

    PY3 port headers_raw_to_dict and headers_dict_to_raw to Python 3

>>> import w3lib
>>> w3lib
<module 'w3lib' from '/home/scrapinghub/.virtualenvs/testspiders/local/lib/python2.7/site-packages/w3lib/__init__.pyc'>

felixonmars · 2014-07-22T16:13:53Z

I can't think of a reason other than old w3lib either :(
Looks like the fix was applied quite long ago: scrapy/w3lib@3e156ac

kmike · 2014-07-22T16:19:43Z

That's misterious. PyCharm also shows me that imports at top of scrapy.utils.url are all shadowed by from w3lib.url import *, but I don't understand how could it happen.

@nramirezuy could you try moving 'import *' to the top of the file to see if it helps?

nramirezuy · 2014-07-22T19:49:56Z

I fixed it, I had an old version installed via pip.

PY3: use six for robotparser and urlparse

6f7efa1

kmike added a commit that referenced this pull request Jul 14, 2014

Merge pull request #800 from felixonmars/py3-port

5a2f738

PY3: use six for robotparser and urlparse

kmike merged commit 5a2f738 into scrapy:master Jul 14, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PY3: use six for robotparser and urlparse #800

PY3: use six for robotparser and urlparse #800

felixonmars commented Jul 14, 2014

kmike commented Jul 14, 2014

nramirezuy commented Jul 22, 2014

kmike commented Jul 22, 2014

nramirezuy commented Jul 22, 2014

kmike commented Jul 22, 2014

nramirezuy commented Jul 22, 2014

felixonmars commented Jul 22, 2014

kmike commented Jul 22, 2014

nramirezuy commented Jul 22, 2014

PY3: use six for robotparser and urlparse #800

PY3: use six for robotparser and urlparse #800

Conversation

felixonmars commented Jul 14, 2014

kmike commented Jul 14, 2014

nramirezuy commented Jul 22, 2014

kmike commented Jul 22, 2014

nramirezuy commented Jul 22, 2014

kmike commented Jul 22, 2014

nramirezuy commented Jul 22, 2014

felixonmars commented Jul 22, 2014

kmike commented Jul 22, 2014

nramirezuy commented Jul 22, 2014