New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a NO_CALLBACK value for Request.callback #5798
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5798 +/- ##
==========================================
+ Coverage 88.93% 88.94% +0.01%
==========================================
Files 162 162
Lines 10992 11002 +10
Branches 1798 1798
==========================================
+ Hits 9776 9786 +10
Misses 937 937
Partials 279 279
|
Co-authored-by: Andrey Rakhmatullin <wrar@wrar.name>
@@ -93,7 +94,7 @@ def _process_request(self, request, info, item): | |||
fp = self._fingerprinter.fingerprint(request) | |||
cb = request.callback or (lambda _: _) | |||
eb = request.errback | |||
request.callback = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to update the FilesPipeline
as well: https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py#L520
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since tests were actually passing for files, I had a deeper look at it, and I think tests pass because those Request(u)
objects are later parsed with the very method here, so the callback gets set before the request object leaves the middleware. So I think no further changes may be necessary specific to files or images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, it might still be cleaner to add the no callback marker to these requests, as they're not supposed to use "parse" callback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we set callback to NO_CALLBACK
twice, in get_media_requests
and in _process_request
(both called from process_item
, one after the other)?
I am not against it, I just want to be certain that I made it clear enough that the reason the callback is not set here is because these request objects are processed further before they leave the pipeline, so with the current code there is no risk of anything outside the pipeline itself to receive a request with callback=None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we set callback to NO_CALLBACK twice, in get_media_requests and in _process_request (both called from process_item, one after the other)?
Yes. I think if the reader sees FilesPipeline.get_media_requests()
with Request(u, callback=NO_CALLBACK)
, it helps re-assure the idea that the parse()
method isn't supposed to be involved here.
Although they could also further inspect MediaPipeline._process_request()
and see that NO_CALLBACK
is assigned, they won't have to if FilesPipeline.get_media_requests()
already shows it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m making the change, then.
I wonder if we should go further, though, by changing _process_request
to:
- Log a deprecation warning if
callback
isNone
. - Raise an exception if
callback
is anything other thanNone
orNO_CALLBACK
. Or the same behavior as above, to avoid a backward-incompatible change. But I think it may be wise to actually break such code, to force users to not set a callback that is being reset in_process_request
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log a deprecation warning if
callback
isNone
.
+1
Raise an exception if
callback
is anything other thanNone
orNO_CALLBACK
. Or the same behavior as above, to avoid a backward-incompatible change. But I think it may be wise to actually break such code, to force users to not set a callback that is being reset in_process_request
.
I'm not quite sure about this, since there might be some Scrapy project out there that does things differently with their MediaPipeline/FilePipeline. For example, they've overridden _process_request
to not directly use the downloader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think it may be wise to actually break such code, to force users to not set a callback that is being reset in _process_request.
The callback is actually not just reset, but stored and used. So maybe my point is void, we should continue to support callbacks on requests from get_media_requests()
as usual. _process_request
will make sure that the request leaves the pipeline with callback=NO_CALLBACK
, but the original callback will be called nonetheless by the pipeline.
I probably missed previous related discussion. no_callback.pyimport scrapy
from scrapy.crawler import CrawlerProcess
class QuotesToScrapeSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
yield scrapy.Request(url='http://quotes.toscrape.com/', callback=self.no_callback)
def no_callback(self,response):
pass
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(QuotesToScrapeSpider)
process.start() no_callback2.pyimport scrapy
from scrapy.crawler import CrawlerProcess
class CustomSpider(scrapy.Spider):
name = 'custom'
@classmethod
def no_callback(self, response):
# print added for debugging purposes only
# as callback - @classmethod - self.logger is not available here
print(f'reached no_callback in {response.url}')
pass
class QuotesToScrapeSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
yield scrapy.Request(url='http://quotes.toscrape.com/', callback=CustomSpider.no_callback)
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(QuotesToScrapeSpider)
process.start() |
@GeorgeA92 that's an interesting idea. It seems such callable shouldn't be a spider method, because it's not user who should define it, and we're aiming to keep Spider interface minimal. But if a callable is not a spider method, the I'm not sure request serialization (i.e. disk queues) works. But probably we don't need serialization of such requests, as they usually go directly to downloader. So, maybe a callable could work. I'm not sure what's better though. With the current implementation if someone attempts to invoke the request's callback, an exception is going to be raised, and on a first sight it looks right. What issues do you see with the current implementation? FTR, the change is motivated by scrapinghub/scrapy-poet#48. |
Yes. From this code(responsible for converting req to dict ->used in serialisation) we see that if scrapy/scrapy/http/request/__init__.py Lines 182 to 188 in 6ded3cf
I didn't thought about option to disable _find_method(spider, self.callback) check to make possible to serialize requests with asigned function (not spider method at all) as Request.callback like this (not sure is it backward compartible change):
no_callback3.pydef no_callback(response):
# print added for debugging purposes only
print(f'reached no_callback in {response.url}')
pass
class QuotesToScrapeSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
yield scrapy.Request(url='http://quotes.toscrape.com/', callback=no_callback)
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(QuotesToScrapeSpider)
process.start() From the other side adding something like def _no_callback(self, response):
pass somewhere inside https://github.com/scrapy/scrapy/blob/2.7.1/scrapy/spiders/__init__.py base Spider class (and asigning that callback into problem middlewares/pipelines just as any other callback) - will be already enough for this. |
When you say “minimal code changes”, what do you have in mind. As far as I can see, if
I consider both having a separate type for the So, while I do not love the proposed solution (I don‘t like the typing workaround due to the limitations of
I think serialization should not affect the decision, though. The proposed alternative does not contemplate serialization either. I think supporting serialization of these requests would in both cases require changes, though I would prefer to change the serialization code to make an exception for NO_CALLBACK, rather than polluting the |
Ok. I agree that both approaches in terms of *spoiling core parts of scrapy code ~equal ( But I see quite noteable performance difference between these two approaches: According to million_reqs.pyfrom scrapy import Request
import cProfile
def dummy_callable_to_pass_req_init(response):
pass
def create_reqs():
print('reached')
for i in range(1_000_000):
r = Request(f'http://quotes.toscrape.com/page={i}', callback=dummy_callable_to_pass_req_init)
if __name__ == '__main__':
cProfile.run('create_reqs()', sort='cumulative') log_output
For next profiler test I changed def _set_xback(self, callback, errback) -> None:
if callback is not None and not callable(callback):
raise TypeError(
f"callback must be a callable, got {type(callback).__name__}"
)
if errback is not None and not callable(errback):
raise TypeError(f"errback must be a callable, got {type(errback).__name__}") log_output_2
On my hardware with enabled profiler it means that changes from this pull request with additional callable/checks inside On a first look this runtime impact doesn't look significant enough. Hovewer in scrapy jobs with a lot of initiated requests especially on broad crawls, crawl spiders on specified domains or other where expected a lot of duplicate requests on cpu bound spiders and on lower grade hardware we can expect more significant runtime impact as this checks also will be executed for requests that will be filtered later by dupefilter or by offsite middleware. |
The only necessary difference in that part of the code is the additional condition in the I tested with this code (notice I added back attribute setting at the end): def _set_xback(
self,
callback: Union[None, NoCallbackType, Callable],
errback: Optional[Callable],
) -> None:
if not (
callable(callback)
or callback is None
or callback is NO_CALLBACK # Commented out on the first run.
):
raise TypeError(
f"callback must be a callable, got {type(callback).__name__}"
)
if not (
callable(errback)
or errback is None
):
raise TypeError(f"errback must be a callable, got {type(errback).__name__}")
self.callback = callback
self.errback = errback And the performance is really similar: $ python million_reqs.py | grep xback
1000000 0.465 0.000 0.633 0.000 __init__.py:107(_set_xback)
$ python million_reqs.py | grep xback
1000000 0.485 0.000 0.658 0.000 __init__.py:107(_set_xback) Which is as I would expect, because unless the callback is NO_CALLBACK or an invalid value, performance should be pretty much the same. I tested with $ python million_reqs.py | grep xback
1000000 0.526 0.000 0.685 0.000 __init__.py:107(_set_xback) So I don’t think performance is key here, I think both options can be implemented with a similar performance impact. We could even consider not checking the parameter type (we have hints, after all), or checking specifically for expected bad values (e.g. That said, the only 2 cons I found for the callable approach are very minor, so I want to make it clear that I have no strong opinion on which approach to take. I am OK with any of them. |
@wRAR and @kmike agree on going with the callable approach, and not to worry about serialization here. Thank you, @GeorgeA92! I’ll update the PR soon. |
For the record, #5798 (comment) is open, but I hope to address that in a few minutes. |
Co-authored-by: Andrey Rakhmatullin <wrar@wrar.name>
Thanks @Gallaecio! |
No description provided.