Implement a NO_CALLBACK value for Request.callback #5798

Gallaecio · 2023-01-19T16:15:35Z

No description provided.

codecov · 2023-01-19T16:30:27Z

Codecov Report

Merging #5798 (78eaf06) into master (da15d93) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #5798      +/-   ##
==========================================
+ Coverage   88.93%   88.94%   +0.01%     
==========================================
  Files         162      162              
  Lines       10992    11002      +10     
  Branches     1798     1798              
==========================================
+ Hits         9776     9786      +10     
  Misses        937      937              
  Partials      279      279

Impacted Files	Coverage Δ
scrapy/downloadermiddlewares/robotstxt.py	`100.00% <100.00%> (ø)`
scrapy/http/request/__init__.py	`97.82% <100.00%> (+0.04%)`	⬆️
scrapy/pipelines/files.py	`71.42% <100.00%> (+0.09%)`	⬆️
scrapy/pipelines/images.py	`97.08% <100.00%> (+0.02%)`	⬆️
scrapy/pipelines/media.py	`98.63% <100.00%> (+0.04%)`	⬆️

tests/test_pipeline_images.py

docs/topics/request-response.rst

scrapy/http/request/__init__.py

Co-authored-by: Andrey Rakhmatullin <wrar@wrar.name>

scrapy/http/request/__init__.py

BurnzZ · 2023-01-25T03:32:14Z

scrapy/pipelines/media.py

@@ -93,7 +94,7 @@ def _process_request(self, request, info, item):
        fp = self._fingerprinter.fingerprint(request)
        cb = request.callback or (lambda _: _)
        eb = request.errback
-        request.callback = None


We need to update the FilesPipeline as well: https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py#L520

Since tests were actually passing for files, I had a deeper look at it, and I think tests pass because those Request(u) objects are later parsed with the very method here, so the callback gets set before the request object leaves the middleware. So I think no further changes may be necessary specific to files or images.

Hm, it might still be cleaner to add the no callback marker to these requests, as they're not supposed to use "parse" callback.

So we set callback to NO_CALLBACK twice, in get_media_requests and in _process_request (both called from process_item, one after the other)?

I am not against it, I just want to be certain that I made it clear enough that the reason the callback is not set here is because these request objects are processed further before they leave the pipeline, so with the current code there is no risk of anything outside the pipeline itself to receive a request with callback=None.

So we set callback to NO_CALLBACK twice, in get_media_requests and in _process_request (both called from process_item, one after the other)?

Yes. I think if the reader sees FilesPipeline.get_media_requests() with Request(u, callback=NO_CALLBACK), it helps re-assure the idea that the parse() method isn't supposed to be involved here.

Although they could also further inspect MediaPipeline._process_request() and see that NO_CALLBACK is assigned, they won't have to if FilesPipeline.get_media_requests() already shows it.

I’m making the change, then.

I wonder if we should go further, though, by changing _process_request to:

Log a deprecation warning if callback is None.

Raise an exception if callback is anything other than None or NO_CALLBACK. Or the same behavior as above, to avoid a backward-incompatible change. But I think it may be wise to actually break such code, to force users to not set a callback that is being reset in _process_request.

Log a deprecation warning if callback is None.

+1

Raise an exception if callback is anything other than None or NO_CALLBACK. Or the same behavior as above, to avoid a backward-incompatible change. But I think it may be wise to actually break such code, to force users to not set a callback that is being reset in _process_request.

I'm not quite sure about this, since there might be some Scrapy project out there that does things differently with their MediaPipeline/FilePipeline. For example, they've overridden _process_request to not directly use the downloader.

But I think it may be wise to actually break such code, to force users to not set a callback that is being reset in _process_request.

The callback is actually not just reset, but stored and used. So maybe my point is void, we should continue to support callbacks on requests from get_media_requests() as usual. _process_request will make sure that the request leaves the pipeline with callback=NO_CALLBACK, but the original callback will be called nonetheless by the pipeline.

scrapy/http/request/__init__.py

GeorgeA92 · 2023-01-25T17:47:37Z

I probably missed previous related discussion.
But It is not clear why we need to allow to asign non callable to Request.callback
When in this situations we can asign valid Callable that does nothing with the same results (or not?) but without changes to base Request class
by something similar to this:

no_callback.py

import scrapy
from scrapy.crawler import CrawlerProcess

class QuotesToScrapeSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        yield scrapy.Request(url='http://quotes.toscrape.com/', callback=self.no_callback)

    def no_callback(self,response):
        pass

if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(QuotesToScrapeSpider)
    process.start()

no_callback2.py

import scrapy
from scrapy.crawler import CrawlerProcess

class CustomSpider(scrapy.Spider):
    name = 'custom'
    @classmethod
    def no_callback(self, response):
        # print added for debugging purposes only
        # as callback - @classmethod - self.logger is not available here
        print(f'reached no_callback in {response.url}')
        pass


class QuotesToScrapeSpider(scrapy.Spider):
    name = "quotes"
    def start_requests(self):
        yield scrapy.Request(url='http://quotes.toscrape.com/', callback=CustomSpider.no_callback)

if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(QuotesToScrapeSpider)
    process.start()

At least on related/linked issues I didn't see any comments/signs where similar approach mentioned or discussed

kmike · 2023-01-25T18:01:12Z

@GeorgeA92 that's an interesting idea. It seems such callable shouldn't be a spider method, because it's not user who should define it, and we're aiming to keep Spider interface minimal. But if a callable is not a spider method, the I'm not sure request serialization (i.e. disk queues) works. But probably we don't need serialization of such requests, as they usually go directly to downloader.

So, maybe a callable could work. I'm not sure what's better though. With the current implementation if someone attempts to invoke the request's callback, an exception is going to be raised, and on a first sight it looks right.

What issues do you see with the current implementation?

FTR, the change is motivated by scrapinghub/scrapy-poet#48.

GeorgeA92 · 2023-01-25T20:50:50Z

It seems such callable shouldn't be a spider method, because it's not user who should define it, and we're aiming to keep Spider interface minimal. But if a callable is not a spider method, the I'm not sure request serialization (i.e. disk queues) works.

Yes. From this code(responsible for converting req to dict ->used in serialisation) we see that if Request.callback is callable - it should be spider method (method that belongs to specific spider instance) so for external Spider method like on code sample no_callback2.py from my previous comment - it will be unserializable.

scrapy/scrapy/http/request/__init__.py

Lines 182 to 188 in 6ded3cf

    
           d = { 
        
               "url": self.url,  # urls are safe (safe_string_url) 
        
               "callback": _find_method(spider, self.callback) if callable(self.callback) else self.callback, 
        
               "errback": _find_method(spider, self.errback) if callable(self.errback) else self.errback, 
        
               "headers": dict(self.headers), 
        
           } 
        
           for attr in self.attributes:

I didn't thought about option to disable _find_method(spider, self.callback) check to make possible to serialize requests with asigned function (not spider method at all) as Request.callback like this (not sure is it backward compartible change):

no_callback3.py

def no_callback(response):
    # print added for debugging purposes only
    print(f'reached no_callback in {response.url}')
    pass

class QuotesToScrapeSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        yield scrapy.Request(url='http://quotes.toscrape.com/', callback=no_callback)

if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(QuotesToScrapeSpider)
    process.start()

From the other side adding something like

def _no_callback(self, response):
    pass

somewhere inside https://github.com/scrapy/scrapy/blob/2.7.1/scrapy/spiders/__init__.py base Spider class (and asigning that callback into problem middlewares/pipelines just as any other callback) - will be already enough for this.
In this way we will receive the same functionality with minimal code changes (but with change in base Spider class.)

Gallaecio · 2023-01-26T07:58:38Z

When you say “minimal code changes”, what do you have in mind.

As far as I can see, if NO_CALLBACK is a callable, you can save:

3 lines that we are using to define NoCallbackType.
1 line in the if statement that checks if the callback value is valid.

I consider both having a separate type for the NO_CALLBACK constant and having that extra line in the if statement good things from a discoverability standpoint, i.e. they make it more obvious that NO_CALLBACK exists.

So, while I do not love the proposed solution (I don‘t like the typing workaround due to the limitations of typing.Literal), I am not sure using a callable would be better:

It feels somewhat wrong to me to define a callable that should never be called, and should probably raise an exception if called.
There is no discoverability from type hints, unless you define a specific type for this callable, going from saving 4 lines to saving 1.

I think serialization should not affect the decision, though. The proposed alternative does not contemplate serialization either. I think supporting serialization of these requests would in both cases require changes, though I would prefer to change the serialization code to make an exception for NO_CALLBACK, rather than polluting the Spider class.

GeorgeA92 · 2023-01-26T13:26:32Z

Ok. I agree that both approaches in terms of *spoiling core parts of scrapy code ~equal (Request._init__ or Spider.__init__).

But I see quite noteable performance difference between these two approaches:
callable check implemented in _set_xback from this pull request will be called twice for literally every new Request, while _no_callback in Spider.__init__ will be placed only once and applies only for cases where it (_no_callback) explicitly set to Request.callback - no other additional actions/checks applied for every other requests

According to cProfile profiler results applied to "create million Request objects" test

million_reqs.py

from scrapy import Request
import cProfile

def dummy_callable_to_pass_req_init(response):
    pass

def create_reqs():
    print('reached')
    for i in range(1_000_000):
        r = Request(f'http://quotes.toscrape.com/page={i}', callback=dummy_callable_to_pass_req_init)

if __name__ == '__main__':
    cProfile.run('create_reqs()', sort='cumulative')

log_output

         106150758 function calls (105150751 primitive calls) in 102.274 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      3/1    0.000    0.000  102.769  102.769 {built-in method builtins.exec}
        1    0.000    0.000  102.769  102.769 <string>:1(<module>)
        1    3.844    3.844  102.769  102.769 reqs_init.py:7(create_reqs)
  1000000    6.086    0.000   94.849    0.000 __init__.py:73(__init__)
  1000000    2.720    0.000   83.101    0.000 __init__.py:132(_set_url)
  1000000    7.607    0.000   73.930    0.000 url.py:39(safe_url_string)
  1000000   10.973    0.000   22.560    0.000 parse.py:437(urlsplit)
  3000000    2.854    0.000   19.890    0.000 parse.py:818(quote)
  3000000    6.224    0.000   16.478    0.000 parse.py:889(quote_from_bytes)
  1000000    8.940    0.000    8.940    0.000 parse.py:902(<listcomp>)
5000000/4000000    3.174    0.000    7.683    0.000 {method 'encode' of 'str' objects}
  1000000    1.393    0.000    6.000    0.000 url.py:49(escape_ajax)
  3000000    3.374    0.000    5.336    0.000 util.py:29(to_bytes)
  1000000    2.962    0.000    5.060    0.000 parse.py:411(_splitnetloc)
  1000000    3.367    0.000    4.935    0.000 idna.py:147(encode)
  1000000    2.414    0.000    4.923    0.000 parse.py:505(urlunsplit)
  3000000    4.014    0.000    4.663    0.000 parse.py:114(_coerce_args)
  2000000    1.316    0.000    4.377    0.000 util.py:41(to_native_str)
  1000000    1.975    0.000    4.082    0.000 parse.py:593(urldefrag)
 24000013    3.899    0.000    3.899    0.000 {built-in method builtins.isinstance}
  3000000    2.587    0.000    3.803    0.000 util.py:17(to_unicode)
  1000000    1.563    0.000    3.108    0.000 trackref.py:28(__new__)
  1000000    1.523    0.000    2.485    0.000 headers.py:11(__init__)
  2000000    1.483    0.000    2.332    0.000 __init__.py:108(_set_xback)
  4000000    1.917    0.000    1.917    0.000 {method 'find' of 'str' objects}
  6000003    1.482    0.000    1.482    0.000 {method 'replace' of 'str' objects}
  1000000    1.478    0.000    1.478    0.000 {method 'sub' of 're.Pattern' objects}
  3000001    1.197    0.000    1.197    0.000 {built-in method __new__ of type object at 0x00007FFA81BEAD30}
  6000060    1.023    0.000    1.023    0.000 {built-in method builtins.len}
  1000000    0.553    0.000    0.972    0.000 <string>:1(<lambda>)
  1000000    0.967    0.000    0.967    0.000 weakref.py:370(remove)
  1000000    0.962    0.000    0.962    0.000 datatypes.py:17(__init__)
  1000000    0.897    0.000    0.897    0.000 weakref.py:428(__setitem__)
  1000000    0.671    0.000    0.862    0.000 parse.py:419(_checknetloc)
  2000000    0.792    0.000    0.792    0.000 {method 'decode' of 'bytes' objects}
  1000000    0.563    0.000    0.563    0.000 {built-in method builtins.min}
  3000000    0.555    0.000    0.555    0.000 parse.py:103(_noop)
  1000057    0.526    0.000    0.526    0.000 {method 'startswith' of 'str' objects}
  1000000    0.521    0.000    0.521    0.000 {method 'split' of 'bytes' objects}
  2000001    0.486    0.000    0.486    0.000 {built-in method builtins.setattr}
  1000007    0.437    0.000    0.437    0.000 {method 'get' of 'dict' objects}
  1000000    0.399    0.000    0.399    0.000 {method 'rstrip' of 'bytes' objects}
  1000000    0.381    0.000    0.381    0.000 __init__.py:151(_set_body)
  2000000    0.363    0.000    0.363    0.000 {built-in method builtins.callable}
  1000000    0.324    0.000    0.324    0.000 {method 'upper' of 'str' objects}
  1000000    0.286    0.000    0.286    0.000 {built-in method time.time}
  1000079    0.272    0.000    0.272    0.000 {method 'rstrip' of 'str' objects}
  1000000    0.262    0.000    0.262    0.000 __init__.py:156(encoding)
  1000000    0.257    0.000    0.257    0.000 {method 'lower' of 'str' objects}
    49999    0.062    0.000    0.218    0.000 parse.py:88(clear_cache)
  1000004    0.191    0.000    0.191    0.000 {method 'isascii' of 'str' objects}
    99998    0.156    0.000    0.156    0.000 {method 'clear' of 'dict' objects}
        1    0.000    0.000    0.004    0.004 __init__.py:71(search_function)
        1    0.000    0.000    0.004    0.004 {built-in method builtins.__import__}
      2/1    0.000    0.000    0.004    0.004 <frozen importlib._bootstrap>:1022(_find_and_load)
      2/1    0.000    0.000    0.003    0.003 <frozen importlib._bootstrap>:987(_find_and_load_unlocked)
      2/1    0.000    0.000    0.003    0.003 <frozen importlib._bootstrap>:664(_load_unlocked)
      2/1    0.000    0.000    0.003    0.003 <frozen importlib._bootstrap_external>:877(exec_module)
      2/1    0.000    0.000    0.002    0.002 <frozen importlib._bootstrap>:233(_call_with_frames_removed)
        1    0.000    0.000    0.002    0.002 idna.py:1(<module>)
        2    0.000    0.000    0.002    0.001 <frozen importlib._bootstrap_external>:950(get_code)
        2    0.000    0.000    0.001    0.001 <frozen importlib._bootstrap_external>:1070(get_data)
        2    0.000    0.000    0.001    0.001 <frozen importlib._bootstrap>:921(_find_spec)
        2    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:1431(find_spec)
        2    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:1399(_get_spec)
        5    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:1536(find_spec)
        2    0.001    0.000    0.001    0.000 {built-in method io.open_code}
        9    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:140(_path_stat)
        9    0.001    0.000    0.001    0.000 {built-in method nt.stat}
       25    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:96(_path_join)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:670(_compile_bytecode)
        2    0.000    0.000    0.000    0.000 {built-in method marshal.loads}
        1    0.000    0.000    0.000    0.000 stringprep.py:1(<module>)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:564(module_from_spec)
        5    0.000    0.000    0.000    0.000 {built-in method builtins.__build_class__}
        2    0.000    0.000    0.000    0.000 {method 'read' of '_io.BufferedReader' objects}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:492(_init_module_attrs)
        4    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:380(cache_from_source)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:159(_path_isfile)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:150(_path_is_mode_type)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1089(path_stats)
        4    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:391(cached)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:510(_get_cached)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:169(__enter__)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1531(_get_spec)
        4    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:132(_path_split)
       25    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:119(<listcomp>)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:721(spec_from_file_location)
        2    0.000    0.000    0.000    0.000 {method '__exit__' of '_io._IOBase' objects}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:179(_get_module_lock)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.print}
       87    0.000    0.000    0.000    0.000 {method 'endswith' of 'str' objects}
        4    0.000    0.000    0.000    0.000 {built-in method builtins.max}
       13    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        5    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:67(_relax_case)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:585(_classify_pyc)
        2    0.000    0.000    0.000    0.000 __init__.py:89(find_spec)
       12    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:134(<genexpr>)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:172(_path_isabs)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:71(__init__)
        1    0.000    0.000    0.000    0.000 __init__.py:43(normalize_encoding)
       27    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:244(_verbose_message)
        1    0.000    0.000    0.000    0.000 idna.py:300(getregentry)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:100(acquire)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:173(__exit__)
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:84(_unpack_uint32)
       30    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:618(_validate_timestamp_pyc)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:125(release)
        4    0.000    0.000    0.000    0.000 {built-in method _thread.allocate_lock}
       33    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
       14    0.000    0.000    0.000    0.000 {method 'rpartition' of 'str' objects}
        1    0.000    0.000    0.000    0.000 re.py:249(compile)
        1    0.000    0.000    0.000    0.000 codecs.py:94(__new__)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:198(cb)
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1356(_path_importer_cache)
        8    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:897(__exit__)
        8    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:893(__enter__)
        1    0.000    0.000    0.000    0.000 re.py:288(_compile)
        9    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
        1    0.000    0.000    0.000    0.000 weakref.py:368(__init__)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:746(find_spec)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:826(find_spec)
        8    0.000    0.000    0.000    0.000 {method 'rfind' of 'str' objects}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:542(_check_name_wrapper)
        2    0.000    0.000    0.000    0.000 {built-in method _imp._fix_co_filename}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:48(_new_module)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:357(__init__)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:404(parent)
        2    0.000    0.000    0.000    0.000 {built-in method nt._path_splitroot}
        6    0.000    0.000    0.000    0.000 {built-in method from_bytes}
       12    0.000    0.000    0.000    0.000 {built-in method _imp.acquire_lock}
       12    0.000    0.000    0.000    0.000 {built-in method _imp.release_lock}
        1    0.000    0.000    0.000    0.000 {built-in method _imp.is_builtin}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1040(__init__)
        1    0.000    0.000    0.000    0.000 {method 'format' of 'str' objects}
        1    0.000    0.000    0.000    0.000 idna.py:295(StreamReader)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:165(__init__)
        6    0.000    0.000    0.000    0.000 {built-in method nt.fspath}
        4    0.000    0.000    0.000    0.000 {method '__exit__' of '_thread.lock' objects}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.locals}
        2    0.000    0.000    0.000    0.000 {built-in method _imp.is_frozen}
        1    0.000    0.000    0.000    0.000 idna.py:146(Codec)
        4    0.000    0.000    0.000    0.000 {built-in method _thread.get_ident}
        4    0.000    0.000    0.000    0.000 {method 'isalnum' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 {method 'pop' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 idna.py:253(IncrementalDecoder)
        1    0.000    0.000    0.000    0.000 idna.py:218(IncrementalEncoder)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:412(has_location)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:874(create_module)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1065(get_filename)
        1    0.000    0.000    0.000    0.000 idna.py:292(StreamWriter)
        1    0.000    0.000    0.000    0.000 __init__.py:96(<lambda>)



Process finished with exit code 0

We see that callable/non callable exception value checks from `_set_xback` - responsible for additional 2.2% of CPU time (2.332 seconds of 102.274 total).

For next profiler test I changed _set_xback to make it close as possible to current implementation (do not allow to set callable no f_strings and still packed it inside function to make it visible by profiler)

    def _set_xback(self, callback, errback) -> None:
        if callback is not None and not callable(callback):
            raise TypeError(
                f"callback must be a callable, got {type(callback).__name__}"
            )
        if errback is not None and not callable(errback):
            raise TypeError(f"errback must be a callable, got {type(errback).__name__}")

log_output_2

         102150758 function calls (101150751 primitive calls) in 101.555 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      3/1    0.000    0.000  102.063  102.063 {built-in method builtins.exec}
        1    0.000    0.000  102.063  102.063 <string>:1(<module>)
        1    3.910    3.910  102.063  102.063 reqs_init.py:7(create_reqs)
  1000000    5.805    0.000   94.052    0.000 __init__.py:73(__init__)
  1000000    2.700    0.000   84.130    0.000 __init__.py:130(_set_url)
  1000000    7.762    0.000   74.946    0.000 url.py:39(safe_url_string)
  1000000   11.103    0.000   22.714    0.000 parse.py:437(urlsplit)
  3000000    2.898    0.000   20.136    0.000 parse.py:818(quote)
  3000000    6.314    0.000   16.670    0.000 parse.py:889(quote_from_bytes)
  1000000    9.040    0.000    9.040    0.000 parse.py:902(<listcomp>)
5000000/4000000    3.235    0.000    7.830    0.000 {method 'encode' of 'str' objects}
  1000000    1.395    0.000    6.041    0.000 url.py:49(escape_ajax)
  3000000    3.400    0.000    5.398    0.000 util.py:29(to_bytes)
  1000000    3.032    0.000    5.105    0.000 parse.py:411(_splitnetloc)
  1000000    3.432    0.000    5.031    0.000 idna.py:147(encode)
  1000000    2.464    0.000    4.951    0.000 parse.py:505(urlunsplit)
  3000000    4.036    0.000    4.674    0.000 parse.py:114(_coerce_args)
  2000000    1.501    0.000    4.596    0.000 util.py:41(to_native_str)
  1000000    1.963    0.000    4.109    0.000 parse.py:593(urldefrag)
 24000013    3.919    0.000    3.919    0.000 {built-in method builtins.isinstance}
  3000000    2.621    0.000    3.831    0.000 util.py:17(to_unicode)
  1000000    1.568    0.000    3.102    0.000 trackref.py:28(__new__)
  1000000    1.490    0.000    2.442    0.000 headers.py:11(__init__)
  4000000    1.943    0.000    1.943    0.000 {method 'find' of 'str' objects}
  1000000    1.493    0.000    1.493    0.000 {method 'sub' of 're.Pattern' objects}
  6000003    1.465    0.000    1.465    0.000 {method 'replace' of 'str' objects}
  3000001    1.185    0.000    1.185    0.000 {built-in method __new__ of type object at 0x00007FFA81BEAD30}
  6000060    1.022    0.000    1.022    0.000 {built-in method builtins.len}
  1000000    0.999    0.000    0.999    0.000 weakref.py:370(remove)
  1000000    0.952    0.000    0.952    0.000 datatypes.py:17(__init__)
  1000000    0.540    0.000    0.950    0.000 <string>:1(<lambda>)
  1000000    0.888    0.000    0.888    0.000 weakref.py:428(__setitem__)
  1000000    0.679    0.000    0.874    0.000 parse.py:419(_checknetloc)
  1000000    0.607    0.000    0.813    0.000 __init__.py:107(_set_xback)
  2000000    0.776    0.000    0.776    0.000 {method 'decode' of 'bytes' objects}
  3000000    0.544    0.000    0.544    0.000 parse.py:103(_noop)
  1000057    0.538    0.000    0.538    0.000 {method 'startswith' of 'str' objects}
  1000000    0.536    0.000    0.536    0.000 {built-in method builtins.min}
  1000000    0.535    0.000    0.535    0.000 {method 'split' of 'bytes' objects}
  1000007    0.428    0.000    0.428    0.000 {method 'get' of 'dict' objects}
  1000000    0.409    0.000    0.409    0.000 __init__.py:149(_set_body)
  1000000    0.404    0.000    0.404    0.000 {method 'rstrip' of 'bytes' objects}
  1000000    0.310    0.000    0.310    0.000 {method 'upper' of 'str' objects}
  1000000    0.286    0.000    0.286    0.000 {built-in method time.time}
  1000079    0.276    0.000    0.276    0.000 {method 'rstrip' of 'str' objects}
  1000000    0.271    0.000    0.271    0.000 __init__.py:154(encoding)
  1000000    0.256    0.000    0.256    0.000 {method 'lower' of 'str' objects}
    49999    0.063    0.000    0.221    0.000 parse.py:88(clear_cache)
  1000000    0.206    0.000    0.206    0.000 {built-in method builtins.callable}
  1000004    0.195    0.000    0.195    0.000 {method 'isascii' of 'str' objects}
    99998    0.158    0.000    0.158    0.000 {method 'clear' of 'dict' objects}
        1    0.000    0.000    0.004    0.004 __init__.py:71(search_function)
        1    0.000    0.000    0.004    0.004 {built-in method builtins.__import__}
      2/1    0.000    0.000    0.004    0.004 <frozen importlib._bootstrap>:1022(_find_and_load)
      2/1    0.000    0.000    0.004    0.004 <frozen importlib._bootstrap>:987(_find_and_load_unlocked)
      2/1    0.000    0.000    0.003    0.003 <frozen importlib._bootstrap>:664(_load_unlocked)
      2/1    0.000    0.000    0.003    0.003 <frozen importlib._bootstrap_external>:877(exec_module)
      2/1    0.000    0.000    0.002    0.002 <frozen importlib._bootstrap>:233(_call_with_frames_removed)
        1    0.000    0.000    0.002    0.002 idna.py:1(<module>)
        2    0.000    0.000    0.002    0.001 <frozen importlib._bootstrap_external>:950(get_code)
        2    0.000    0.000    0.001    0.001 <frozen importlib._bootstrap_external>:1070(get_data)
        2    0.000    0.000    0.001    0.001 <frozen importlib._bootstrap>:921(_find_spec)
        2    0.000    0.000    0.001    0.001 <frozen importlib._bootstrap_external>:1431(find_spec)
        2    0.000    0.000    0.001    0.001 <frozen importlib._bootstrap_external>:1399(_get_spec)
        5    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:1536(find_spec)
        2    0.001    0.000    0.001    0.000 {built-in method io.open_code}
        9    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:140(_path_stat)
        9    0.001    0.000    0.001    0.000 {built-in method nt.stat}
       25    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:96(_path_join)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:670(_compile_bytecode)
        2    0.000    0.000    0.000    0.000 {built-in method marshal.loads}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:564(module_from_spec)
        1    0.000    0.000    0.000    0.000 stringprep.py:1(<module>)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:492(_init_module_attrs)
        4    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:380(cache_from_source)
        5    0.000    0.000    0.000    0.000 {built-in method builtins.__build_class__}
        2    0.000    0.000    0.000    0.000 {method 'read' of '_io.BufferedReader' objects}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1089(path_stats)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:159(_path_isfile)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:150(_path_is_mode_type)
        4    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:391(cached)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:510(_get_cached)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:169(__enter__)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1531(_get_spec)
        4    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:132(_path_split)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:721(spec_from_file_location)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:179(_get_module_lock)
        2    0.000    0.000    0.000    0.000 {method '__exit__' of '_io._IOBase' objects}
       25    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:119(<listcomp>)
       13    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.print}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:585(_classify_pyc)
       87    0.000    0.000    0.000    0.000 {method 'endswith' of 'str' objects}
        4    0.000    0.000    0.000    0.000 {built-in method builtins.max}
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1356(_path_importer_cache)
        2    0.000    0.000    0.000    0.000 __init__.py:89(find_spec)
        1    0.000    0.000    0.000    0.000 __init__.py:43(normalize_encoding)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:71(__init__)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:172(_path_isabs)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:100(acquire)
       12    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:134(<genexpr>)
        1    0.000    0.000    0.000    0.000 idna.py:300(getregentry)
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:84(_unpack_uint32)
       27    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:244(_verbose_message)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:173(__exit__)
       30    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
        4    0.000    0.000    0.000    0.000 {built-in method _thread.allocate_lock}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:125(release)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:618(_validate_timestamp_pyc)
        1    0.000    0.000    0.000    0.000 re.py:249(compile)
        5    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:67(_relax_case)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:746(find_spec)
       33    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
       14    0.000    0.000    0.000    0.000 {method 'rpartition' of 'str' objects}
        1    0.000    0.000    0.000    0.000 re.py:288(_compile)
        1    0.000    0.000    0.000    0.000 codecs.py:94(__new__)
        1    0.000    0.000    0.000    0.000 weakref.py:368(__init__)
        9    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
        8    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:897(__exit__)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:198(cb)
        8    0.000    0.000    0.000    0.000 {method 'rfind' of 'str' objects}
        8    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:893(__enter__)
        1    0.000    0.000    0.000    0.000 {built-in method _imp.is_builtin}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:542(_check_name_wrapper)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:48(_new_module)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:826(find_spec)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:404(parent)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:165(__init__)
        2    0.000    0.000    0.000    0.000 {built-in method _imp._fix_co_filename}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:357(__init__)
        6    0.000    0.000    0.000    0.000 {built-in method from_bytes}
        1    0.000    0.000    0.000    0.000 {method 'format' of 'str' objects}
       12    0.000    0.000    0.000    0.000 {built-in method _imp.release_lock}
        2    0.000    0.000    0.000    0.000 {built-in method nt._path_splitroot}
       12    0.000    0.000    0.000    0.000 {built-in method _imp.acquire_lock}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1040(__init__)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.locals}
        6    0.000    0.000    0.000    0.000 {built-in method nt.fspath}
        1    0.000    0.000    0.000    0.000 idna.py:146(Codec)
        4    0.000    0.000    0.000    0.000 {method '__exit__' of '_thread.lock' objects}
        1    0.000    0.000    0.000    0.000 idna.py:253(IncrementalDecoder)
        2    0.000    0.000    0.000    0.000 {built-in method _imp.is_frozen}
        4    0.000    0.000    0.000    0.000 {method 'isalnum' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.setattr}
        4    0.000    0.000    0.000    0.000 {built-in method _thread.get_ident}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:412(has_location)
        2    0.000    0.000    0.000    0.000 {method 'pop' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:874(create_module)
        1    0.000    0.000    0.000    0.000 idna.py:218(IncrementalEncoder)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1065(get_filename)
        1    0.000    0.000    0.000    0.000 idna.py:292(StreamWriter)
        1    0.000    0.000    0.000    0.000 idna.py:295(StreamReader)
        1    0.000    0.000    0.000    0.000 __init__.py:96(<lambda>)



Process finished with exit code 0

In this test we see that close to current implementation last for ~0.8% of cpu runtime (0.813 seconds of 101.555 total).

On my hardware with enabled profiler it means that changes from this pull request with additional callable/checks insideRequest.__init__ with usual callable in callback - add ~1.5 seconds of CPU runtime per 1 million of Request.__init__ call.

On a first look this runtime impact doesn't look significant enough. Hovewer in scrapy jobs with a lot of initiated requests especially on broad crawls, crawl spiders on specified domains or other where expected a lot of duplicate requests on cpu bound spiders and on lower grade hardware we can expect more significant runtime impact as this checks also will be executed for requests that will be filtered later by dupefilter or by offsite middleware.

Gallaecio · 2023-01-26T14:58:33Z

The only necessary difference in that part of the code is the additional condition in the if statement. We can revert my refactoring into _set_back to keep the performance as good as possible.

I tested with this code (notice I added back attribute setting at the end):

    def _set_xback(
        self,
        callback: Union[None, NoCallbackType, Callable],
        errback: Optional[Callable],
    ) -> None:
        if not (
            callable(callback)
            or callback is None
            or callback is NO_CALLBACK  # Commented out on the first run.
        ):
            raise TypeError(
                f"callback must be a callable, got {type(callback).__name__}"
            )
        if not (
            callable(errback)
            or errback is None
        ):
            raise TypeError(f"errback must be a callable, got {type(errback).__name__}")
        self.callback = callback
        self.errback = errback

And the performance is really similar:

$ python million_reqs.py | grep xback
  1000000    0.465    0.000    0.633    0.000 __init__.py:107(_set_xback)
$ python million_reqs.py | grep xback
  1000000    0.485    0.000    0.658    0.000 __init__.py:107(_set_xback)

Which is as I would expect, because unless the callback is NO_CALLBACK or an invalid value, performance should be pretty much the same. I tested with callback=NO_CALLBACK in the benchmark script, and the performance did not change significantly either:

$ python million_reqs.py | grep xback
  1000000    0.526    0.000    0.685    0.000 __init__.py:107(_set_xback)

So I don’t think performance is key here, I think both options can be implemented with a similar performance impact. We could even consider not checking the parameter type (we have hints, after all), or checking specifically for expected bad values (e.g. isinstance(callback, str), and that would probably have more of a performance impact that this change.

That said, the only 2 cons I found for the callable approach are very minor, so I want to make it clear that I have no strong opinion on which approach to take. I am OK with any of them.

Gallaecio · 2023-01-26T15:26:38Z

@wRAR and @kmike agree on going with the callable approach, and not to worry about serialization here. Thank you, @GeorgeA92! I’ll update the PR soon.

scrapy/http/request/__init__.py

Gallaecio · 2023-01-30T10:53:20Z

For the record, #5798 (comment) is open, but I hope to address that in a few minutes.

Co-authored-by: Andrey Rakhmatullin <wrar@wrar.name>

tox.ini

kmike · 2023-01-30T20:06:57Z

Thanks @Gallaecio!

Implement a NO_CALLBACK value for Request.callback

50500a6

Gallaecio requested review from kmike and wRAR January 19, 2023 16:15

Gallaecio marked this pull request as ready for review January 19, 2023 16:15

Gallaecio added this to the Scrapy 2.8 milestone Jan 19, 2023

wRAR reviewed Jan 19, 2023

View reviewed changes

tests/test_pipeline_images.py Outdated Show resolved Hide resolved

kmike reviewed Jan 19, 2023

View reviewed changes

docs/topics/request-response.rst Show resolved Hide resolved

Gallaecio added 3 commits January 19, 2023 19:53

Update the screenshot pipeline code example

a493464

Address typing issues

5c1559f

Restore Python 3.7 support

4242ae4

wRAR reviewed Jan 20, 2023

View reviewed changes

scrapy/http/request/__init__.py Outdated Show resolved Hide resolved

wRAR approved these changes Jan 20, 2023

View reviewed changes

Fix typo: it → is

818d69f

Co-authored-by: Andrey Rakhmatullin <wrar@wrar.name>

kmike reviewed Jan 20, 2023

View reviewed changes

scrapy/http/request/__init__.py Outdated Show resolved Hide resolved

BurnzZ reviewed Jan 25, 2023

View reviewed changes

BurnzZ mentioned this pull request Jan 25, 2023

Update behavior when encountering Request.callback = NO_CALLBACK scrapinghub/scrapy-poet#118

Closed

Gallaecio added 3 commits January 25, 2023 17:41

Merge remote-tracking branch 'scrapy/master' into no-callback

9272c4a

Make the _set_xback condition more readable

c883a13

Merge remote-tracking branch 'Gallaecio/no-callback' into no-callback

9d07be6

Gallaecio commented Jan 25, 2023

View reviewed changes

scrapy/http/request/__init__.py Outdated Show resolved Hide resolved

_NoCallback → NoCallbackType

1f3e428

Make NO_CALLBACK a callable

f03b47d

Merge branch 'master' into no-callback

ccd1385

wRAR reviewed Jan 30, 2023

View reviewed changes

scrapy/http/request/__init__.py Outdated Show resolved Hide resolved

Gallaecio and others added 3 commits January 30, 2023 11:54

Fix typo: download middleware → downloader middleware

e169947

Co-authored-by: Andrey Rakhmatullin <wrar@wrar.name>

get_media_requests: support and encourage callback=NO_CALLBACK

389fd99

Merge remote-tracking branch 'Gallaecio/no-callback' into no-callback

4239f7e

wRAR reviewed Jan 30, 2023

View reviewed changes

tox.ini Outdated Show resolved Hide resolved

Remove typing-extensions from tox.ini

78eaf06

wRAR approved these changes Jan 30, 2023

View reviewed changes

kmike merged commit b337c98 into scrapy:master Jan 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a NO_CALLBACK value for Request.callback #5798

Implement a NO_CALLBACK value for Request.callback #5798

Gallaecio commented Jan 19, 2023

codecov bot commented Jan 19, 2023 •

edited

BurnzZ Jan 25, 2023 •

edited

Gallaecio Jan 25, 2023

kmike Jan 25, 2023

Gallaecio Jan 27, 2023

BurnzZ Jan 30, 2023

Gallaecio Jan 30, 2023 •

edited

BurnzZ Jan 30, 2023

Gallaecio Jan 30, 2023

GeorgeA92 commented Jan 25, 2023

kmike commented Jan 25, 2023 •

edited

GeorgeA92 commented Jan 25, 2023

Gallaecio commented Jan 26, 2023

GeorgeA92 commented Jan 26, 2023

Gallaecio commented Jan 26, 2023 •

edited

Gallaecio commented Jan 26, 2023

Gallaecio commented Jan 30, 2023

kmike commented Jan 30, 2023

Implement a NO_CALLBACK value for Request.callback #5798

Implement a NO_CALLBACK value for Request.callback #5798

Conversation

Gallaecio commented Jan 19, 2023

codecov bot commented Jan 19, 2023 • edited

Codecov Report

BurnzZ Jan 25, 2023 • edited

Choose a reason for hiding this comment

Gallaecio Jan 25, 2023

Choose a reason for hiding this comment

kmike Jan 25, 2023

Choose a reason for hiding this comment

Gallaecio Jan 27, 2023

Choose a reason for hiding this comment

BurnzZ Jan 30, 2023

Choose a reason for hiding this comment

Gallaecio Jan 30, 2023 • edited

Choose a reason for hiding this comment

BurnzZ Jan 30, 2023

Choose a reason for hiding this comment

Gallaecio Jan 30, 2023

Choose a reason for hiding this comment

GeorgeA92 commented Jan 25, 2023

kmike commented Jan 25, 2023 • edited

GeorgeA92 commented Jan 25, 2023

Gallaecio commented Jan 26, 2023

GeorgeA92 commented Jan 26, 2023

Gallaecio commented Jan 26, 2023 • edited

Gallaecio commented Jan 26, 2023

Gallaecio commented Jan 30, 2023

kmike commented Jan 30, 2023

codecov bot commented Jan 19, 2023 •

edited

BurnzZ Jan 25, 2023 •

edited

Gallaecio Jan 30, 2023 •

edited

kmike commented Jan 25, 2023 •

edited

Gallaecio commented Jan 26, 2023 •

edited