Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CrawlSpider: pass cb_kwargs from process_request #5699

Merged

Conversation

elacuesta
Copy link
Member

@elacuesta elacuesta commented Oct 30, 2022

CrawlSpider._callback is not passing any keyword arguments it might receive to CrawlSpider._parse_response.
I'm currently giving priority to the arguments set in process_request over the ones set in the rule, this could be open to discussion.

Missing tests, will add them soon. Added

Sample spider:

from scrapy.spiders import CrawlSpider, Rule

class ExampleSpider(CrawlSpider):
    name = "example"
    start_urls = ["https://example.org"]
    rules = (
        Rule(
            process_request="process_request",
            callback="parse_item",
        ),
    )

    def process_request(self, request, response):
        request.cb_kwargs["foo"] = "bar"
        return request

    def parse_item(self, response, foo):
        return {"url": response.url, "foo": foo}

Without this patch, at 9077d0f (latest master branch)

2022-10-30 13:15:04 [scrapy.core.engine] INFO: Spider opened
2022-10-30 13:15:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-10-30 13:15:04 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-10-30 13:15:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.org> (referer: None)
2022-10-30 13:15:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.iana.org/domains/reserved> from <GET https://www.iana.org/domains/example>
2022-10-30 13:15:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.iana.org/domains/reserved> (referer: None)
2022-10-30 13:15:06 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.iana.org/domains/reserved> (referer: None)
Traceback (most recent call last):
  File "/home/eugenio/zyte/scrapy/venv-scrapy/lib/python3.9/site-packages/twisted/internet/defer.py", line 858, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
TypeError: _callback() got an unexpected keyword argument 'foo'
2022-10-30 13:15:06 [scrapy.core.engine] INFO: Closing spider (finished)

After this patch:

2022-10-30 13:16:00 [scrapy.core.engine] INFO: Spider opened
2022-10-30 13:16:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-10-30 13:16:00 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-10-30 13:16:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.org> (referer: None)
2022-10-30 13:16:01 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.iana.org/domains/reserved> from <GET https://www.iana.org/domains/example>
2022-10-30 13:16:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.iana.org/domains/reserved> (referer: None)
2022-10-30 13:16:01 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.iana.org/domains/reserved>
{'url': 'http://www.iana.org/domains/reserved', 'foo': 'bar'}
2022-10-30 13:16:01 [scrapy.core.engine] INFO: Closing spider (finished)
$ scrapy version -v
Scrapy       : 2.7.0
lxml         : 4.7.1.0
libxml2      : 2.9.12
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.22.0
Twisted      : 21.7.0
Python       : 3.9.6 (default, Sep  6 2021, 10:09:19) - [GCC 7.5.0]
pyOpenSSL    : 21.0.0 (OpenSSL 1.1.1m  14 Dec 2021)
cryptography : 36.0.1
Platform     : Linux-5.4.0-125-generic-x86_64-with-glibc2.31

@codecov
Copy link

codecov bot commented Oct 30, 2022

Codecov Report

Merging #5699 (b185603) into master (9077d0f) will increase coverage by 0.04%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #5699      +/-   ##
==========================================
+ Coverage   88.83%   88.87%   +0.04%     
==========================================
  Files         162      162              
  Lines       11005    11058      +53     
  Branches     1801     1825      +24     
==========================================
+ Hits         9776     9828      +52     
- Misses        948      949       +1     
  Partials      281      281              
Impacted Files Coverage Δ
scrapy/spiders/crawl.py 94.31% <100.00%> (ø)
scrapy/utils/misc.py 97.89% <0.00%> (+0.08%) ⬆️

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the priorities makes sense.

@wRAR wRAR closed this Nov 1, 2022
@wRAR wRAR reopened this Nov 1, 2022
@wRAR wRAR merged commit 8004075 into scrapy:master Nov 2, 2022
@elacuesta elacuesta deleted the crawlspider-callback-keyword-arguments branch November 2, 2022 13:06
@elacuesta elacuesta restored the crawlspider-callback-keyword-arguments branch August 9, 2023 00:18
@elacuesta elacuesta deleted the crawlspider-callback-keyword-arguments branch August 9, 2023 00:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants