Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PY3: use six for robotparser and urlparse #800

Merged
merged 1 commit into from Jul 14, 2014
Merged

Conversation

felixonmars
Copy link
Contributor

Not very sure about the text encoding, but it should at least work for current python 2.x environments.

@kmike
Copy link
Member

kmike commented Jul 14, 2014

Yep, URLs are going to be one of the trickiest part of the Python 3 port because sometimes it makes more sense to have them as bytes and sometimes it makes more sense to have them as unicode. We may have to make sure all our functions accept both, at least in in Python 3.x (in 2.x some stdlib functions may break with unicode input).

This PR looks good to me as it shouldn't break 2.x code and will make it easier to make further changes in 3.x. Most likely regardless of the bytes/unicode handling changes these imports will remain as is.

kmike added a commit that referenced this pull request Jul 14, 2014
PY3: use six for robotparser and urlparse
@kmike kmike merged commit 5a2f738 into scrapy:master Jul 14, 2014
@nramirezuy
Copy link
Contributor

@kmike This change broke this function https://github.com/scrapy/scrapy/blob/master/scrapy/utils/url.py#L71 which broke Scrapy completely.

    Traceback (most recent call last):
      File "/home/scrapinghub/Devel/scrapy/scrapy/crawler.py", line 93, in start
        self.start_reactor()
      File "/home/scrapinghub/Devel/scrapy/scrapy/crawler.py", line 130, in start_reactor
        reactor.run(installSignalHandlers=False)  # blocking call
      File "/home/scrapinghub/.virtualenvs/testspiders/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1192, in run
        self.mainLoop()
      File "/home/scrapinghub/.virtualenvs/testspiders/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1201, in mainLoop
        self.runUntilCurrent()
    --- <exception caught here> ---
      File "/home/scrapinghub/.virtualenvs/testspiders/local/lib/python2.7/site-packages/twisted/internet/base.py", line 824, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/home/scrapinghub/Devel/scrapy/scrapy/utils/reactor.py", line 41, in __call__
        return self._func(*self._a, **self._kw)
      File "/home/scrapinghub/Devel/scrapy/scrapy/core/engine.py", line 120, in _next_request
        self.crawl(request, spider)
      File "/home/scrapinghub/Devel/scrapy/scrapy/core/engine.py", line 176, in crawl
        self.schedule(request, spider)
      File "/home/scrapinghub/Devel/scrapy/scrapy/core/engine.py", line 182, in schedule
        return self.slot.scheduler.enqueue_request(request)
      File "/home/scrapinghub/Devel/scrapy/scrapy/core/scheduler.py", line 48, in enqueue_request
        if not request.dont_filter and self.df.request_seen(request):
      File "/home/scrapinghub/Devel/scrapy/scrapy/dupefilter.py", line 46, in request_seen
        fp = self.request_fingerprint(request)
      File "/home/scrapinghub/Devel/scrapy/scrapy/dupefilter.py", line 54, in request_fingerprint
        return request_fingerprint(request)
      File "/home/scrapinghub/Devel/scrapy/scrapy/utils/request.py", line 52, in request_fingerprint
        fp.update(canonicalize_url(request.url))
      File "/home/scrapinghub/Devel/scrapy/scrapy/utils/url.py", line 56, in canonicalize_url
        scheme, netloc, path, params, query, fragment = parse_url(url)
      File "/home/scrapinghub/Devel/scrapy/scrapy/utils/url.py", line 76, in parse_url
        urlparse(unicode_to_str(url, encoding))
    exceptions.TypeError: 'module' object is not callable

@kmike
Copy link
Member

kmike commented Jul 22, 2014

@nramirezuy I can't reproduce this, and the tests work (both on Travis and locally via tox).

Maybe it is an issue with some older versions of six? If so, we need to update six version listed in setup.py.

@nramirezuy
Copy link
Contributor

@kmike I think I have a newer one

>>> import six
>>> six
<module 'six' from '/home/scrapinghub/.virtualenvs/testspiders/local/lib/python2.7/site-packages/six.py'>
>>> six.__version__
'1.7.3'

@kmike
Copy link
Member

kmike commented Jul 22, 2014

It must be caused by old w3lib - import * from old w3lib may shadow urlparse.

@nramirezuy
Copy link
Contributor

@kmike Nope.

commit 6aa32e30435b85301b9df2b24aef8b93fccd02f4
Merge: 9604f80 11779e4
Author: Pablo Hoffman <pablo@scrapinghub.com>
Date:   Mon Jul 21 16:27:22 2014 -0300

    Merge pull request #23 from scrapy/w3lib-http-py3

    PY3 port headers_raw_to_dict and headers_dict_to_raw to Python 3
>>> import w3lib
>>> w3lib
<module 'w3lib' from '/home/scrapinghub/.virtualenvs/testspiders/local/lib/python2.7/site-packages/w3lib/__init__.pyc'>

@felixonmars
Copy link
Contributor Author

I can't think of a reason other than old w3lib either :(
Looks like the fix was applied quite long ago: scrapy/w3lib@3e156ac

@kmike
Copy link
Member

kmike commented Jul 22, 2014

That's misterious. PyCharm also shows me that imports at top of scrapy.utils.url are all shadowed by from w3lib.url import *, but I don't understand how could it happen.

@nramirezuy could you try moving 'import *' to the top of the file to see if it helps?

@nramirezuy
Copy link
Contributor

I fixed it, I had an old version installed via pip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants