Command Shell Error, Read URLs from Redis #50

rafaelcapucho · 2016-04-26T05:22:38Z

Hello,

When executing a scrapy shell into an URL it loads Redis List, as you can see in the output:
https://paste.ee/r/PPkg2

Attention to:
2016-04-26 02:14:01 [epocacosmeticos.com.br] -> DEBUG: Reading URLs from redis list 'epocacosmeticos.com.br:start_urls' prior to error.

Before install scrapy-redis, scrapy shell was working ok.

Scrapy 1.1.0rc3 and Python 3.5.1
Thank you

The text was updated successfully, but these errors were encountered:

rafaelcapucho · 2016-04-26T05:37:31Z

When commenting SCHEDULER = "scrapy_redis.scheduler.Scheduler" from settings.py the scrapy shell command back to work.

The above note about the debug "Reading URLs from redis list" don't make difference, it is still present when scrapy shell works: https://paste.ee/r/rAL1V

The question is, how your scrapy_redis.scheduler.Scheduler is not compatible to work with scrapy shell <url>.

Thank you

rafaelcapucho · 2016-05-13T23:08:18Z

Scrapy 1.1.0 is finally released in pypi... would be awesome if scrapy-redis support py3 and scrapy 1.1.0

rafaelcapucho · 2016-05-14T06:17:34Z

The problem is happening When the engine calls enqueue_request and that method calls self.queue.push(request).

I'm using the SpiderQueue and the push method throws error when calling self._encode_request(request):

    def push(self, request):
        """Push a request"""
        self.server.lpush(self.key, self._encode_request(request))

the method _encode_request defined at queue.Base use the scrapy request_to_dict to serialize the request:

    def _encode_request(self, request):
        """Encode a request object"""
        return pickle.dumps(request_to_dict(request, self.spider), protocol=-1)

I have added 3 prints into request_to_dict to understand the inputs:

def request_to_dict(request, spider=None):
    """Convert Request object to a dict.

    If a spider is given, it will try to find out the name of the spider method
    used in the callback and store that as the callback.
    """
    cb = request.callback
    print('callback: ', cb)
    print('request: ', request)
    print('spider: ', spider)

    if callable(cb):
        cb = _find_method(spider, cb)
    eb = request.errback
    if callable(eb):
        eb = _find_method(spider, eb)
    d = {
        'url': to_unicode(request.url),  # urls should be safe (safe_string_url)
        'callback': cb,
        'errback': eb,
        'method': request.method,
        'headers': dict(request.headers),
        'body': request.body,
        'cookies': request.cookies,
        'meta': request.meta,
        '_encoding': request._encoding,
        'priority': request.priority,
        'dont_filter': request.dont_filter,
    }
    return d

Running the shell command It return:

callback: <bound method Deferred.callback of <Deferred at 0x7f3e13b69358>> request: <GET http://www.epocacosmeticos.com.br/any-url-goes-here> spider: <EpocaCosmeticosSpider 'epocacosmeticos.com.br' at 0x7f3e13032d30>

The previous error:
ValueError: Function <bound method Deferred.callback of <Deferred at 0x7f3e13b69358>> is not a method of: <EpocaCosmeticosSpider 'epocacosmeticos.com.br' at 0x7f3e13032d30>

happens when _find_method is executed:

def _find_method(obj, func):
    if obj:
        try:
            func_self = six.get_method_self(func)
        except AttributeError:  # func has no __self__
            pass
        else:
            if func_self is obj:
                return six.get_method_function(func).__name__
    raise ValueError("Function %s is not a method of: %s" % (func, obj))

I don't know yet what is wrong

rmax · 2016-06-23T08:20:18Z

Sorry for the late response.Py3k support is on its way (see #53). And your issue seems related to either your spider or a middleware yielding a request with a deferred object as a callback (media middleware?).

I have created #54 to follow up this deserialization issue.

rmax · 2016-06-25T22:59:44Z

Latest release support Python 3.x. Besides that, it's a scrapy limitation to require callbacks to me spider methods. However, I have added a TODO to avoid serializing those requests.

Congee · 2018-06-02T05:21:26Z

Any updates? It's still an issue in 2018. :/

rmax · 2018-06-02T17:21:23Z

@Congee a workaround is to open the shell first (i.e.: scrapy shell) and then fetching the URL (fetch('https://...')).

Congee · 2018-06-02T17:24:17Z

@rmax Yeah, that's a workaround. Thanks anyway.

rmax mentioned this issue Jun 23, 2016

Ensure serialized requests can be deserialized. #54

Open

rmax closed this as completed Jun 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Command Shell Error, Read URLs from Redis #50

Command Shell Error, Read URLs from Redis #50

rafaelcapucho commented Apr 26, 2016

rafaelcapucho commented Apr 26, 2016

rafaelcapucho commented May 13, 2016

rafaelcapucho commented May 14, 2016

rmax commented Jun 23, 2016

rmax commented Jun 25, 2016

Congee commented Jun 2, 2018

rmax commented Jun 2, 2018

Congee commented Jun 2, 2018

Command Shell Error, Read URLs from Redis #50

Command Shell Error, Read URLs from Redis #50

Comments

rafaelcapucho commented Apr 26, 2016

rafaelcapucho commented Apr 26, 2016

rafaelcapucho commented May 13, 2016

rafaelcapucho commented May 14, 2016

rmax commented Jun 23, 2016

rmax commented Jun 25, 2016

Congee commented Jun 2, 2018

rmax commented Jun 2, 2018

Congee commented Jun 2, 2018