Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Command Shell Error, Read URLs from Redis #50

Closed
rafaelcapucho opened this issue Apr 26, 2016 · 8 comments
Closed

Command Shell Error, Read URLs from Redis #50

rafaelcapucho opened this issue Apr 26, 2016 · 8 comments

Comments

@rafaelcapucho
Copy link

Hello,

When executing a scrapy shell into an URL it loads Redis List, as you can see in the output:
https://paste.ee/r/PPkg2

Attention to:
2016-04-26 02:14:01 [epocacosmeticos.com.br] -> DEBUG: Reading URLs from redis list 'epocacosmeticos.com.br:start_urls' prior to error.

Before install scrapy-redis, scrapy shell was working ok.

Scrapy 1.1.0rc3 and Python 3.5.1
Thank you

@rafaelcapucho
Copy link
Author

When commenting SCHEDULER = "scrapy_redis.scheduler.Scheduler" from settings.py the scrapy shell command back to work.

The above note about the debug "Reading URLs from redis list" don't make difference, it is still present when scrapy shell works: https://paste.ee/r/rAL1V

The question is, how your scrapy_redis.scheduler.Scheduler is not compatible to work with scrapy shell <url>.

Thank you

@rafaelcapucho
Copy link
Author

Scrapy 1.1.0 is finally released in pypi... would be awesome if scrapy-redis support py3 and scrapy 1.1.0

@rafaelcapucho
Copy link
Author

The problem is happening When the engine calls enqueue_request and that method calls self.queue.push(request).

I'm using the SpiderQueue and the push method throws error when calling self._encode_request(request):

    def push(self, request):
        """Push a request"""
        self.server.lpush(self.key, self._encode_request(request))

the method _encode_request defined at queue.Base use the scrapy request_to_dict to serialize the request:

    def _encode_request(self, request):
        """Encode a request object"""
        return pickle.dumps(request_to_dict(request, self.spider), protocol=-1)

I have added 3 prints into request_to_dict to understand the inputs:

def request_to_dict(request, spider=None):
    """Convert Request object to a dict.

    If a spider is given, it will try to find out the name of the spider method
    used in the callback and store that as the callback.
    """
    cb = request.callback
    print('callback: ', cb)
    print('request: ', request)
    print('spider: ', spider)

    if callable(cb):
        cb = _find_method(spider, cb)
    eb = request.errback
    if callable(eb):
        eb = _find_method(spider, eb)
    d = {
        'url': to_unicode(request.url),  # urls should be safe (safe_string_url)
        'callback': cb,
        'errback': eb,
        'method': request.method,
        'headers': dict(request.headers),
        'body': request.body,
        'cookies': request.cookies,
        'meta': request.meta,
        '_encoding': request._encoding,
        'priority': request.priority,
        'dont_filter': request.dont_filter,
    }
    return d

Running the shell command It return:

callback: <bound method Deferred.callback of <Deferred at 0x7f3e13b69358>> request: <GET http://www.epocacosmeticos.com.br/any-url-goes-here> spider: <EpocaCosmeticosSpider 'epocacosmeticos.com.br' at 0x7f3e13032d30>

The previous error:
ValueError: Function <bound method Deferred.callback of <Deferred at 0x7f3e13b69358>> is not a method of: <EpocaCosmeticosSpider 'epocacosmeticos.com.br' at 0x7f3e13032d30>

happens when _find_method is executed:

def _find_method(obj, func):
    if obj:
        try:
            func_self = six.get_method_self(func)
        except AttributeError:  # func has no __self__
            pass
        else:
            if func_self is obj:
                return six.get_method_function(func).__name__
    raise ValueError("Function %s is not a method of: %s" % (func, obj))

I don't know yet what is wrong

@rmax
Copy link
Owner

rmax commented Jun 23, 2016

Sorry for the late response.Py3k support is on its way (see #53). And your issue seems related to either your spider or a middleware yielding a request with a deferred object as a callback (media middleware?).

I have created #54 to follow up this deserialization issue.

@rmax rmax closed this as completed Jun 23, 2016
@rmax
Copy link
Owner

rmax commented Jun 25, 2016

Latest release support Python 3.x. Besides that, it's a scrapy limitation to require callbacks to me spider methods. However, I have added a TODO to avoid serializing those requests.

@Congee
Copy link

Congee commented Jun 2, 2018

Any updates? It's still an issue in 2018. :/

@rmax
Copy link
Owner

rmax commented Jun 2, 2018

@Congee a workaround is to open the shell first (i.e.: scrapy shell) and then fetching the URL (fetch('https://...')).

@Congee
Copy link

Congee commented Jun 2, 2018

@rmax Yeah, that's a workaround. Thanks anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants