Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Callback kwargs #3563

Merged
merged 23 commits into from Jun 26, 2019
Merged

[MRG+1] Callback kwargs #3563

merged 23 commits into from Jun 26, 2019

Conversation

@elacuesta
Copy link
Member

@elacuesta elacuesta commented Jan 3, 2019

Fixes #1138

This is just a first approach. It's currently lacking docs and tests, I'll add those if the implementation is good. Update: added tests and docs

I see (at least) two points for discussion:

  1. Should we also pass the same arguments to the errbacks? Or maybe add a different parameter? errback_kwargs or something like that. On the other hand, the request object is available in the failure received by the errback, failure.request.cb_kwargs gives access to the arguments, so I think it shouldn't be necessary.
  2. I'm not a fan of the kwargs name, I think it could be easily confused with Python's own "kwargs" naming convention, i.e., people could understand that any remaining keyword argument passed to the Request constructor will be passed to the callbacks. Is "callback_kwargs" too verbose? Maybe it's not compatible with the previous point. Update Renamed to cb_kwargs

Sample spider:

import scrapy

class TestCallbackKwargsSpider(scrapy.Spider):
    name = 'callback_kwargs'

    def start_requests(self):
        data = {'a': 123, 'b': 456}
        yield scrapy.Request('https://example.org', cb_kwargs=data)

    def parse(self, response, a, b):
        yield {'url': response.url, 'a': a, 'b': b}
        yield response.follow(
            response.css('a::attr(href)').get(),
            self.parse_other,
            cb_kwargs={'source': response.url})

    def parse_other(self, response, source):
        yield {'url': response.url, 'source': source}
        yield response.follow(
            response.css('a::attr(href)').get(),
            self.parse_regular)

    def parse_regular(self, response):
        yield {'url': response.url}

Output:

(...)
2019-01-03 17:40:38 [scrapy.core.engine] INFO: Spider opened
2019-01-03 17:40:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-03 17:40:38 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-01-03 17:40:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.org> (referer: None)
2019-01-03 17:40:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://example.org>
{'url': 'https://example.org', 'a': 123, 'b': 456}
2019-01-03 17:40:39 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.iana.org/domains/reserved> from <GET http://www.iana.org/domains/example>
2019-01-03 17:40:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.iana.org/domains/reserved> (referer: None)
2019-01-03 17:40:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.iana.org/domains/reserved>
{'url': 'https://www.iana.org/domains/reserved', 'source': 'https://example.org'}
2019-01-03 17:40:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.iana.org/> (referer: https://www.iana.org/domains/reserved)
2019-01-03 17:40:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.iana.org/>
{'url': 'https://www.iana.org/'}
2019-01-03 17:40:40 [scrapy.core.engine] INFO: Closing spider (finished)
(...)

Tasks

  • Request changes
  • Response.follow changes
  • Scraper changes
  • Request serialization
  • Tests
  • Docs

/cc @kmike @dangra

@elacuesta elacuesta force-pushed the callback_kwargs branch from 8b9dee0 to a67f1ce Jan 3, 2019
@codecov
Copy link

@codecov codecov bot commented Jan 3, 2019

Codecov Report

Merging #3563 into master will decrease coverage by 0.05%.
The diff coverage is 34.78%.

@@            Coverage Diff            @@
##           master   #3563      +/-   ##
=========================================
- Coverage   85.46%   85.4%   -0.06%     
=========================================
  Files         169     169              
  Lines        9666    9682      +16     
  Branches     1440    1443       +3     
=========================================
+ Hits         8261    8269       +8     
- Misses       1157    1166       +9     
+ Partials      248     247       -1
Impacted Files Coverage Δ
scrapy/utils/reqser.py 94.11% <ø> (ø) ⬆️
scrapy/http/response/__init__.py 93.44% <ø> (ø) ⬆️
scrapy/http/response/text.py 97.84% <ø> (ø) ⬆️
scrapy/core/scraper.py 88.51% <100%> (ø) ⬆️
scrapy/http/request/__init__.py 100% <100%> (ø) ⬆️
scrapy/commands/parse.py 20.32% <11.76%> (-0.73%) ⬇️
scrapy/core/spidermw.py 100% <0%> (+2.46%) ⬆️

tests/test_crawl.py Outdated Show resolved Hide resolved
@elacuesta elacuesta changed the title [WIP] Callback kwargs Callback kwargs Jan 15, 2019
@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Jan 22, 2019

@dangra @kmike I'm sorry to bother you guys, but do you have any comments on this?

@kmike kmike added this to the v1.7 milestone Jan 29, 2019
Copy link
Contributor

@ejulio ejulio left a comment

@elacuesta great work here 👏
Looks good to me, but I left some comments.

Besides those:

  • kwargs does not sound good, it seems to be Request's kwargs and not the callback
  • I like callback_kwargs
  • The same kwargs for the callback should be used for the failure one (keep compatible with meta)

Maybe, the best solution would be using partial functions, but IMHO, they'd look verbose and not pythonic. Unless you know some library to help with partial functions 😄

scrapy/http/request/__init__.py Outdated Show resolved Hide resolved
tests/spiders.py Outdated Show resolved Hide resolved
tests/test_crawl.py Outdated Show resolved Hide resolved
@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Mar 15, 2019

Yeah I like callback_kwargs more than just kwargs, I'll better change it now to save time before we get close to 1.7 :-)
There were some comments about partial in the original issue, but it didn't seem to work. I think using the Deferred's API for this is the most natural choice, passing keyword arguments to a callback is precisely what we need.

@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Mar 15, 2019

I decided to go with cb_kwargs instead of callback_kwargs. It's less verbose, and it's also already in use within Scrapy (https://docs.scrapy.org/en/latest/topics/spiders.html?highlight=cb_kwargs#scrapy.spiders.Rule)

docs/topics/request-response.rst Outdated Show resolved Hide resolved
docs/topics/request-response.rst Outdated Show resolved Hide resolved
docs/topics/request-response.rst Outdated Show resolved Hide resolved
tests/spiders.py Outdated
@@ -28,6 +28,45 @@ def closed(self, reason):
self.meta['close_reason'] = reason


class KeywordArgumentsSpider(MockServerSpider):
Copy link
Member

@kmike kmike Mar 27, 2019

Could you please add a test which checks how Scrapy behaves if there is a mismatch between parameters parse accept and parameters passed via callback kwargs, e.g. some required argument is missing, or an extra argument is passed? Is exception raised? Does traceback look good (no need to write a test for it)? Is errback called?

Another case which could be nice to check explicitly is how default values are handled.

Copy link
Member Author

@elacuesta elacuesta Mar 28, 2019

Added tests for defaults and argument mismatch.
Errback is not called with this current implementation, it does if we add the callback and the errback to the Deferred in two separate steps, i.e.

dfd.addCallback(request.callback or spider.parse, **request.cb_kwargs)
dfd.addErrback(request.errback)

instead of

dfd.addCallbacks(
    callback=request.callback or spider.parse,
    errback=request.errback,
    callbackKeywords=request.cb_kwargs)

In any case, the Request object is not bounded to the Failure received by the errback. Personally, I don't think calling the errback would be appropriate here, since it's not an error with the request/response itself, but with the code that handles it. The logged error is very descriptive, and similar to the one that currently appears when, for instance, a callback does not take a second positional argument (TypeError: parse() takes 1 positional argument but 2 were given).

@kmike
Copy link
Member

@kmike kmike commented Mar 27, 2019

By searching our docs for "meta" it is possible to find a few other places where docs may need an update:

I could miss something - it'd be great to check all cases "meta" is mentioned somewhere.

@kmike
Copy link
Member

@kmike kmike commented Mar 27, 2019

All other things equal, I'd prefer either kwargs or callback_kwargs over cb_kwargs, but the fact we use this name already in Rule is a good argument to go with cb_kwargs name.

@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Mar 29, 2019

Updated/replaced several occurences of Request.meta accros the docs.
I'm not sure why the Codecov check is failing, I added a test for the --cb_kwargs option in the parse command 🤔

docs/topics/commands.rst Outdated Show resolved Hide resolved
docs/topics/request-response.rst Outdated Show resolved Hide resolved
docs/topics/request-response.rst Outdated Show resolved Hide resolved
docs/topics/request-response.rst Show resolved Hide resolved
docs/topics/request-response.rst Outdated Show resolved Hide resolved
docs/topics/request-response.rst Show resolved Hide resolved
@kmike
Copy link
Member

@kmike kmike commented Jun 24, 2019

Thanks for the work @elacuesta, and thanks for a careful review @Gallaecio and @ejulio! The PR looks great. I think we can merge it without errback kwargs support, and discuss this feature separately.

May I ask for one additional test though? It would be awesome if a middleware can change request.cb_kwargs, and changes have an effect; it is not clear if this is supported.

This feature would enable some cool use cases, mostly related to dependency injection, similar to how pytest fixtures work. For example, it'd be possible to implement this:

class MySpider(scrapy.Spider):
    # ...
    def parse(self, response):
        # ... business as usual

    def parse_page(self, response, cookiejar: scrapy.CookieJar):
        # we ask for a current cookiejar object;
        # cookie middleware inspects `callback.__annotations__`, figures out 
        # user wants to read / write current cookies, and provides 
        # (injects) cookiejar object.

or similar features - e.g. it may be useful to integrate more tightly with browsers (ask for a browser in addition to response), etc. It is not clear that such API is the way to go, but it'd be awesome if we make sure it is possible to implement, that we're not closing the door for this.

@kmike kmike mentioned this pull request Jun 25, 2019
@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Jun 26, 2019

@kmike Added tests for downloader/spider middlewares.
That CookieJar injection idea is awesome, and not too hard to implement (https://gist.github.com/elacuesta/edfb297fdb0eaa0e5e415835c148c564), but it does require to override the default Cookies middleware because it is the only one that knows about cookies. Would you consider a PR to add the contents of the above gist to Scrapy itself? Seems like that would be a good way to address #1878. There should be no version-specific problems if we use getattr to get the annotations.

@kmike
Copy link
Member

@kmike kmike commented Jun 26, 2019

That's awesome it works @elacuesta, thanks for checking it and adding more tests!

Regarding actually implementing cookiejar feature, it may need a bit of thought. For example, @pawelmhm raised a good point in #1878 (comment): cookiejar API is not adequate, it is hard to use. So probably if we make this feature built-in, we may need to have some wrapper to make working with cookies straightforward. We may also think if that's possible to fix some other issues using the same API: e.g. how to "fork" a cookiejar (copy current cookies, but update cookiejars separately afterwards).

It seems this all needs a proposal and a separate discussion. Some starter implementation can aid discussion, but overall it looks more complex than making a PR out of gist.

It'd probably be the easiest to release this middleware as a gist snippet, or a small Python package in the meantime, so that people can start using it, while waiting for a solution in Scrapy itself.

@kmike kmike changed the title Callback kwargs [MRG+1] Callback kwargs Jun 26, 2019
kmike
kmike approved these changes Jun 26, 2019
@dangra dangra merged commit 3adf09b into scrapy:master Jun 26, 2019
2 of 3 checks passed
@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Jun 26, 2019

Thanks! I thought about making a small package with that gist, I'll do that 🚀

@elacuesta elacuesta deleted the callback_kwargs branch Jun 26, 2019
Copy link

@mauliadi1990 mauliadi1990 left a comment

Duplicate of #

@kmike kmike mentioned this pull request Jul 10, 2019
@elacuesta elacuesta mentioned this pull request May 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

6 participants