Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Callback kwargs #3563

Merged
merged 23 commits into from Jun 26, 2019
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
50a0d87
Passing keyword arguments to callbacks
elacuesta Jan 3, 2019
a2b509a
Pass callback kwargs with response.follow
elacuesta Jan 3, 2019
69a1ee7
Copy request.kwargs
elacuesta Jan 3, 2019
a67f1ce
Serialize Request kwargs
elacuesta Jan 3, 2019
770a501
Test request kwargs (copy, serialization)
elacuesta Jan 9, 2019
57e7c76
Test callback kwargs
elacuesta Jan 9, 2019
bddfeab
Add Request.kwargs docs
elacuesta Jan 15, 2019
645e8d1
Count keyword argument checks
elacuesta Mar 15, 2019
6760bca
Rename Request.kwargs to Request.cb_kwargs
elacuesta Mar 15, 2019
8528f50
[Doc] Update cb_kwargs example
elacuesta Mar 27, 2019
c43a231
Merge remote-tracking branch 'upstream/master' into callback_kwargs
elacuesta Mar 27, 2019
70a4d93
Callback kwargs: more tests
elacuesta Mar 28, 2019
3efe3be
Update docs about cb_kwargs and meta
elacuesta Mar 28, 2019
e8af633
Add cb_kwargs option to the parse command
elacuesta Mar 28, 2019
8fb0776
Request.cb_kwargs: Update docs
elacuesta Mar 28, 2019
f5e0b6b
parse command: rename cb_kwargs option to cbkwargs
elacuesta Mar 29, 2019
ccb56a3
Update docs about cb_kwargs and meta
elacuesta Mar 29, 2019
294ef51
parse command: update docs about passing callback keyword arguments
elacuesta Mar 29, 2019
0522fe3
parse command: improve option description
elacuesta Mar 29, 2019
428309b
Merge remote-tracking branch 'origin/master' into callback_kwargs
elacuesta Jun 26, 2019
1f9f41b
Move request.cb_kwargs tests to their own test file
elacuesta Jun 26, 2019
d4d68cf
Request.cb_kwargs: update in downloader middleware
elacuesta Jun 26, 2019
312e573
Request.cb_kwargs: update in spider middleware
elacuesta Jun 26, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
40 changes: 34 additions & 6 deletions docs/topics/request-response.rst
Expand Up @@ -24,7 +24,7 @@ below in :ref:`topics-request-response-ref-request-subclasses` and
Request objects
===============

.. class:: Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback, flags])
.. class:: Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback, flags, cb_kwargs])

A :class:`Request` object represents an HTTP request, which is usually
generated in the Spider and executed by the Downloader, and thus generating
Expand Down Expand Up @@ -126,6 +126,9 @@ Request objects
:param flags: Flags sent to the request, can be used for logging or similar purposes.
:type flags: list

:param cb_kwargs: A dict with arbitrary data that will be passed as keyword arguments to the Request's callback.
:type cb_kwargs: dict

.. attribute:: Request.url

A string containing the URL of this request. Keep in mind that this
Expand Down Expand Up @@ -165,6 +168,17 @@ Request objects
``copy()`` or ``replace()`` methods, and can also be accessed, in your
spider, from the ``response.meta`` attribute.

.. attribute:: Request.cb_kwargs

A dictionary that contains arbitrary metadata for this request. Its contents
will be passed to the Request's callback as keyword arguments. It is empty
for new Requests, which means by default callbacks only get a :class:`Response`
object as argument.

This dict is `shallow copied`_ when the request is cloned using the
``copy()`` or ``replace()`` methods, and can also be accessed, in your
spider, from the ``response.cb_kwargs`` attribute.

.. _shallow copied: https://docs.python.org/2/library/copy.html

.. method:: Request.copy()
Expand Down Expand Up @@ -200,11 +214,9 @@ Example::
self.logger.info("Visited %s", response.url)

In some cases you may be interested in passing arguments to those callback
functions so you can receive the arguments later, in the second callback. You
can use the :attr:`Request.meta` attribute for that.

Here's an example of how to pass an item using this mechanism, to populate
different fields from different pages::
functions so you can receive the arguments later, in the second callback.
The following two examples show how to achieve this by using the
:attr:`Request.meta` and :attr:`Request.cb_kwargs` attributes respectively::
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved

def parse_page1(self, response):
item = MyItem()
Expand All @@ -219,6 +231,22 @@ different fields from different pages::
item['other_url'] = response.url
yield item

::

def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.cb_kwargs['item'] = item
elacuesta marked this conversation as resolved.
Show resolved Hide resolved
request.cb_kwargs['foo'] = 'bar'
yield request

def parse_page2(self, response, item, foo):
item['other_url'] = response.url
item['foo'] = foo
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
yield item


.. _topics-request-response-ref-errbacks:

Expand Down
4 changes: 3 additions & 1 deletion scrapy/core/scraper.py
Expand Up @@ -143,7 +143,9 @@ def _scrape2(self, request_result, request, spider):
def call_spider(self, result, request, spider):
result.request = request
dfd = defer_result(result)
dfd.addCallbacks(request.callback or spider.parse, request.errback)
dfd.addCallbacks(callback=request.callback or spider.parse,
errback=request.errback,
callbackKeywords=request.cb_kwargs)
return dfd.addCallback(iterate_spider_output)

def handle_spider_error(self, _failure, request, response, spider):
Expand Down
11 changes: 9 additions & 2 deletions scrapy/http/request/__init__.py
Expand Up @@ -18,7 +18,7 @@ class Request(object_ref):

def __init__(self, url, callback=None, method='GET', headers=None, body=None,
cookies=None, meta=None, encoding='utf-8', priority=0,
dont_filter=False, errback=None, flags=None):
dont_filter=False, errback=None, flags=None, cb_kwargs=None):

self._encoding = encoding # this one has to be set first
self.method = str(method).upper()
Expand All @@ -40,8 +40,15 @@ def __init__(self, url, callback=None, method='GET', headers=None, body=None,
self.dont_filter = dont_filter

self._meta = dict(meta) if meta else None
self._cb_kwargs = dict(cb_kwargs) if cb_kwargs else None
self.flags = [] if flags is None else list(flags)

@property
def cb_kwargs(self):
if self._cb_kwargs is None:
self._cb_kwargs = {}
return self._cb_kwargs

@property
def meta(self):
if self._meta is None:
Expand Down Expand Up @@ -92,7 +99,7 @@ def replace(self, *args, **kwargs):
given new values.
"""
for x in ['url', 'method', 'headers', 'body', 'cookies', 'meta', 'flags',
'encoding', 'priority', 'dont_filter', 'callback', 'errback']:
'encoding', 'priority', 'dont_filter', 'callback', 'errback', 'cb_kwargs']:
kwargs.setdefault(x, getattr(self, x))
cls = kwargs.pop('cls', self.__class__)
return cls(*args, **kwargs)
5 changes: 3 additions & 2 deletions scrapy/http/response/__init__.py
Expand Up @@ -106,7 +106,7 @@ def xpath(self, *a, **kw):

def follow(self, url, callback=None, method='GET', headers=None, body=None,
cookies=None, meta=None, encoding='utf-8', priority=0,
dont_filter=False, errback=None):
dont_filter=False, errback=None, cb_kwargs=None):
# type: (...) -> Request
"""
Return a :class:`~.Request` instance to follow a link ``url``.
Expand All @@ -132,4 +132,5 @@ def follow(self, url, callback=None, method='GET', headers=None, body=None,
encoding=encoding,
priority=priority,
dont_filter=dont_filter,
errback=errback)
errback=errback,
cb_kwargs=cb_kwargs)
5 changes: 3 additions & 2 deletions scrapy/http/response/text.py
Expand Up @@ -123,7 +123,7 @@ def css(self, query):

def follow(self, url, callback=None, method='GET', headers=None, body=None,
cookies=None, meta=None, encoding=None, priority=0,
dont_filter=False, errback=None):
dont_filter=False, errback=None, cb_kwargs=None):
# type: (...) -> Request
"""
Return a :class:`~.Request` instance to follow a link ``url``.
Expand Down Expand Up @@ -154,7 +154,8 @@ def follow(self, url, callback=None, method='GET', headers=None, body=None,
encoding=encoding,
priority=priority,
dont_filter=dont_filter,
errback=errback
errback=errback,
cb_kwargs=cb_kwargs,
)


Expand Down
7 changes: 5 additions & 2 deletions scrapy/utils/reqser.py
Expand Up @@ -32,7 +32,8 @@ def request_to_dict(request, spider=None):
'_encoding': request._encoding,
'priority': request.priority,
'dont_filter': request.dont_filter,
'flags': request.flags
'flags': request.flags,
'cb_kwargs': request.cb_kwargs,
}
if type(request) is not Request:
d['_class'] = request.__module__ + '.' + request.__class__.__name__
Expand Down Expand Up @@ -64,7 +65,9 @@ def request_from_dict(d, spider=None):
encoding=d['_encoding'],
priority=d['priority'],
dont_filter=d['dont_filter'],
flags=d.get('flags'))
flags=d.get('flags'),
cb_kwargs=d.get('cb_kwargs'),
)


def _find_method(obj, func):
Expand Down
39 changes: 39 additions & 0 deletions tests/spiders.py
Expand Up @@ -28,6 +28,45 @@ def closed(self, reason):
self.meta['close_reason'] = reason


class KeywordArgumentsSpider(MockServerSpider):
Copy link
Member

@kmike kmike Mar 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a test which checks how Scrapy behaves if there is a mismatch between parameters parse accept and parameters passed via callback kwargs, e.g. some required argument is missing, or an extra argument is passed? Is exception raised? Does traceback look good (no need to write a test for it)? Is errback called?

Another case which could be nice to check explicitly is how default values are handled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests for defaults and argument mismatch.
Errback is not called with this current implementation, it does if we add the callback and the errback to the Deferred in two separate steps, i.e.

dfd.addCallback(request.callback or spider.parse, **request.cb_kwargs)
dfd.addErrback(request.errback)

instead of

dfd.addCallbacks(
    callback=request.callback or spider.parse,
    errback=request.errback,
    callbackKeywords=request.cb_kwargs)

In any case, the Request object is not bounded to the Failure received by the errback. Personally, I don't think calling the errback would be appropriate here, since it's not an error with the request/response itself, but with the code that handles it. The logged error is very descriptive, and similar to the one that currently appears when, for instance, a callback does not take a second positional argument (TypeError: parse() takes 1 positional argument but 2 were given).


name = 'kwargs'
checks = list()

def start_requests(self):
data = {'key': 'value', 'number': 123}
yield Request(self.mockserver.url('/first'), self.parse_first, cb_kwargs=data)
yield Request(self.mockserver.url('/general_with'), self.parse_general, cb_kwargs=data)
yield Request(self.mockserver.url('/general_without'), self.parse_general)
yield Request(self.mockserver.url('/no_kwargs'), self.parse_no_kwargs)

def parse_first(self, response, key, number):
self.checks.append(key == 'value')
self.checks.append(number == 123)
self.crawler.stats.inc_value('boolean_checks', 2)
yield response.follow(
self.mockserver.url('/two'),
self.parse_second,
cb_kwargs={'new_key': 'new_value'})

def parse_second(self, response, new_key):
self.checks.append(new_key == 'new_value')
self.crawler.stats.inc_value('boolean_checks')

def parse_general(self, response, **kwargs):
if response.url.endswith('/general_with'):
self.checks.append(kwargs['key'] == 'value')
self.checks.append(kwargs['number'] == 123)
self.crawler.stats.inc_value('boolean_checks', 2)
elif response.url.endswith('/general_without'):
self.checks.append(kwargs == {})
self.crawler.stats.inc_value('boolean_checks')

def parse_no_kwargs(self, response):
self.checks.append(response.url.endswith('/no_kwargs'))
self.crawler.stats.inc_value('boolean_checks')


class FollowAllSpider(MetaSpider):

name = 'follow'
Expand Down
9 changes: 8 additions & 1 deletion tests/test_crawl.py
Expand Up @@ -8,7 +8,7 @@
from scrapy.http import Request
from scrapy.crawler import CrawlerRunner
from scrapy.utils.python import to_unicode
from tests.spiders import FollowAllSpider, DelaySpider, SimpleSpider, \
from tests.spiders import FollowAllSpider, DelaySpider, SimpleSpider, KeywordArgumentsSpider, \
BrokenStartRequestsSpider, SingleRequestSpider, DuplicateStartRequestsSpider
from tests.mockserver import MockServer

Expand All @@ -23,6 +23,13 @@ def setUp(self):
def tearDown(self):
self.mockserver.__exit__(None, None, None)

@defer.inlineCallbacks
def test_callback_kwargs(self):
crawler = self.runner.create_crawler(KeywordArgumentsSpider)
yield crawler.crawl(mockserver=self.mockserver)
self.assertTrue(all(crawler.spider.checks))
self.assertEqual(len(crawler.spider.checks), crawler.stats.get_value('boolean_checks'))

@defer.inlineCallbacks
def test_follow_all(self):
crawler = self.runner.create_crawler(FollowAllSpider)
Expand Down
5 changes: 5 additions & 0 deletions tests/test_http_request.py
Expand Up @@ -177,6 +177,7 @@ def somecallback():
r1 = self.request_class("http://www.example.com", flags=['f1', 'f2'],
callback=somecallback, errback=somecallback)
r1.meta['foo'] = 'bar'
r1.cb_kwargs['key'] = 'value'
r2 = r1.copy()

# make sure copy does not propagate callbacks
Expand All @@ -189,6 +190,10 @@ def somecallback():
assert r1.flags is not r2.flags, "flags must be a shallow copy, not identical"
self.assertEqual(r1.flags, r2.flags)

# make sure cb_kwargs dict is shallow copied
assert r1.cb_kwargs is not r2.cb_kwargs, "cb_kwargs must be a shallow copy, not identical"
self.assertEqual(r1.cb_kwargs, r2.cb_kwargs)

# make sure meta dict is shallow copied
assert r1.meta is not r2.meta, "meta must be a shallow copy, not identical"
self.assertEqual(r1.meta, r2.meta)
Expand Down
2 changes: 2 additions & 0 deletions tests/test_utils_reqser.py
Expand Up @@ -26,6 +26,7 @@ def test_all_attributes(self):
encoding='latin-1',
priority=20,
meta={'a': 'b'},
cb_kwargs={'k': 'v'},
flags=['testFlag'])
self._assert_serializes_ok(r, spider=self.spider)

Expand All @@ -52,6 +53,7 @@ def _assert_same_request(self, r1, r2):
self.assertEqual(r1.headers, r2.headers)
self.assertEqual(r1.cookies, r2.cookies)
self.assertEqual(r1.meta, r2.meta)
self.assertEqual(r1.cb_kwargs, r2.cb_kwargs)
self.assertEqual(r1._encoding, r2._encoding)
self.assertEqual(r1.priority, r2.priority)
self.assertEqual(r1.dont_filter, r2.dont_filter)
Expand Down