Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a NO_CALLBACK value for Request.callback #5798

Merged
merged 15 commits into from Jan 30, 2023

Conversation

Gallaecio
Copy link
Member

No description provided.

@Gallaecio Gallaecio requested review from kmike and wRAR January 19, 2023 16:15
@Gallaecio Gallaecio marked this pull request as ready for review January 19, 2023 16:15
@Gallaecio Gallaecio added this to the Scrapy 2.8 milestone Jan 19, 2023
@codecov
Copy link

codecov bot commented Jan 19, 2023

Codecov Report

Merging #5798 (78eaf06) into master (da15d93) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #5798      +/-   ##
==========================================
+ Coverage   88.93%   88.94%   +0.01%     
==========================================
  Files         162      162              
  Lines       10992    11002      +10     
  Branches     1798     1798              
==========================================
+ Hits         9776     9786      +10     
  Misses        937      937              
  Partials      279      279              
Impacted Files Coverage Δ
scrapy/downloadermiddlewares/robotstxt.py 100.00% <100.00%> (ø)
scrapy/http/request/__init__.py 97.82% <100.00%> (+0.04%) ⬆️
scrapy/pipelines/files.py 71.42% <100.00%> (+0.09%) ⬆️
scrapy/pipelines/images.py 97.08% <100.00%> (+0.02%) ⬆️
scrapy/pipelines/media.py 98.63% <100.00%> (+0.04%) ⬆️

tests/test_pipeline_images.py Outdated Show resolved Hide resolved
Co-authored-by: Andrey Rakhmatullin <wrar@wrar.name>
@@ -93,7 +94,7 @@ def _process_request(self, request, info, item):
fp = self._fingerprinter.fingerprint(request)
cb = request.callback or (lambda _: _)
eb = request.errback
request.callback = None
Copy link
Member

@BurnzZ BurnzZ Jan 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since tests were actually passing for files, I had a deeper look at it, and I think tests pass because those Request(u) objects are later parsed with the very method here, so the callback gets set before the request object leaves the middleware. So I think no further changes may be necessary specific to files or images.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, it might still be cleaner to add the no callback marker to these requests, as they're not supposed to use "parse" callback.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we set callback to NO_CALLBACK twice, in get_media_requests and in _process_request (both called from process_item, one after the other)?

I am not against it, I just want to be certain that I made it clear enough that the reason the callback is not set here is because these request objects are processed further before they leave the pipeline, so with the current code there is no risk of anything outside the pipeline itself to receive a request with callback=None.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we set callback to NO_CALLBACK twice, in get_media_requests and in _process_request (both called from process_item, one after the other)?

Yes. I think if the reader sees FilesPipeline.get_media_requests() with Request(u, callback=NO_CALLBACK), it helps re-assure the idea that the parse() method isn't supposed to be involved here.

Although they could also further inspect MediaPipeline._process_request() and see that NO_CALLBACK is assigned, they won't have to if FilesPipeline.get_media_requests() already shows it.

Copy link
Member Author

@Gallaecio Gallaecio Jan 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m making the change, then.

I wonder if we should go further, though, by changing _process_request to:

  • Log a deprecation warning if callback is None.
  • Raise an exception if callback is anything other than None or NO_CALLBACK. Or the same behavior as above, to avoid a backward-incompatible change. But I think it may be wise to actually break such code, to force users to not set a callback that is being reset in _process_request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log a deprecation warning if callback is None.

+1 

Raise an exception if callback is anything other than None or NO_CALLBACK. Or the same behavior as above, to avoid a backward-incompatible change. But I think it may be wise to actually break such code, to force users to not set a callback that is being reset in _process_request.

I'm not quite sure about this, since there might be some Scrapy project out there that does things differently with their MediaPipeline/FilePipeline. For example, they've overridden _process_request to not directly use the downloader.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think it may be wise to actually break such code, to force users to not set a callback that is being reset in _process_request.

The callback is actually not just reset, but stored and used. So maybe my point is void, we should continue to support callbacks on requests from get_media_requests() as usual. _process_request will make sure that the request leaves the pipeline with callback=NO_CALLBACK, but the original callback will be called nonetheless by the pipeline.

@GeorgeA92
Copy link
Contributor

I probably missed previous related discussion.
But It is not clear why we need to allow to asign non callable to Request.callback
When in this situations we can asign valid Callable that does nothing with the same results (or not?) but without changes to base Request class
by something similar to this:

no_callback.py
import scrapy
from scrapy.crawler import CrawlerProcess

class QuotesToScrapeSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        yield scrapy.Request(url='http://quotes.toscrape.com/', callback=self.no_callback)

    def no_callback(self,response):
        pass

if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(QuotesToScrapeSpider)
    process.start()
no_callback2.py
import scrapy
from scrapy.crawler import CrawlerProcess

class CustomSpider(scrapy.Spider):
    name = 'custom'
    @classmethod
    def no_callback(self, response):
        # print added for debugging purposes only
        # as callback - @classmethod - self.logger is not available here
        print(f'reached no_callback in {response.url}')
        pass


class QuotesToScrapeSpider(scrapy.Spider):
    name = "quotes"
    def start_requests(self):
        yield scrapy.Request(url='http://quotes.toscrape.com/', callback=CustomSpider.no_callback)

if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(QuotesToScrapeSpider)
    process.start()
At least on related/linked issues I didn't see any comments/signs where similar approach mentioned or discussed

@kmike
Copy link
Member

kmike commented Jan 25, 2023

@GeorgeA92 that's an interesting idea. It seems such callable shouldn't be a spider method, because it's not user who should define it, and we're aiming to keep Spider interface minimal. But if a callable is not a spider method, the I'm not sure request serialization (i.e. disk queues) works. But probably we don't need serialization of such requests, as they usually go directly to downloader.

So, maybe a callable could work. I'm not sure what's better though. With the current implementation if someone attempts to invoke the request's callback, an exception is going to be raised, and on a first sight it looks right.

What issues do you see with the current implementation?

FTR, the change is motivated by scrapinghub/scrapy-poet#48.

@GeorgeA92
Copy link
Contributor

It seems such callable shouldn't be a spider method, because it's not user who should define it, and we're aiming to keep Spider interface minimal. But if a callable is not a spider method, the I'm not sure request serialization (i.e. disk queues) works.

Yes. From this code(responsible for converting req to dict ->used in serialisation) we see that if Request.callback is callable - it should be spider method (method that belongs to specific spider instance) so for external Spider method like on code sample no_callback2.py from my previous comment - it will be unserializable.

d = {
"url": self.url, # urls are safe (safe_string_url)
"callback": _find_method(spider, self.callback) if callable(self.callback) else self.callback,
"errback": _find_method(spider, self.errback) if callable(self.errback) else self.errback,
"headers": dict(self.headers),
}
for attr in self.attributes:

I didn't thought about option to disable _find_method(spider, self.callback) check to make possible to serialize requests with asigned function (not spider method at all) as Request.callback like this (not sure is it backward compartible change):

no_callback3.py
def no_callback(response):
    # print added for debugging purposes only
    print(f'reached no_callback in {response.url}')
    pass

class QuotesToScrapeSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        yield scrapy.Request(url='http://quotes.toscrape.com/', callback=no_callback)

if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(QuotesToScrapeSpider)
    process.start()

From the other side adding something like

def _no_callback(self, response):
    pass

somewhere inside https://github.com/scrapy/scrapy/blob/2.7.1/scrapy/spiders/__init__.py base Spider class (and asigning that callback into problem middlewares/pipelines just as any other callback) - will be already enough for this.
In this way we will receive the same functionality with minimal code changes (but with change in base Spider class.)

@Gallaecio
Copy link
Member Author

When you say “minimal code changes”, what do you have in mind.

As far as I can see, if NO_CALLBACK is a callable, you can save:

  • 3 lines that we are using to define NoCallbackType.
  • 1 line in the if statement that checks if the callback value is valid.

I consider both having a separate type for the NO_CALLBACK constant and having that extra line in the if statement good things from a discoverability standpoint, i.e. they make it more obvious that NO_CALLBACK exists.

So, while I do not love the proposed solution (I don‘t like the typing workaround due to the limitations of typing.Literal), I am not sure using a callable would be better:

  • It feels somewhat wrong to me to define a callable that should never be called, and should probably raise an exception if called.
  • There is no discoverability from type hints, unless you define a specific type for this callable, going from saving 4 lines to saving 1.

I think serialization should not affect the decision, though. The proposed alternative does not contemplate serialization either. I think supporting serialization of these requests would in both cases require changes, though I would prefer to change the serialization code to make an exception for NO_CALLBACK, rather than polluting the Spider class.

@GeorgeA92
Copy link
Contributor

Ok. I agree that both approaches in terms of *spoiling core parts of scrapy code ~equal (Request._init__ or Spider.__init__).

But I see quite noteable performance difference between these two approaches:
callable check implemented in _set_xback from this pull request will be called twice for literally every new Request, while _no_callback in Spider.__init__ will be placed only once and applies only for cases where it (_no_callback) explicitly set to Request.callback - no other additional actions/checks applied for every other requests

According to cProfile profiler results applied to "create million Request objects" test

million_reqs.py
from scrapy import Request
import cProfile

def dummy_callable_to_pass_req_init(response):
    pass

def create_reqs():
    print('reached')
    for i in range(1_000_000):
        r = Request(f'http://quotes.toscrape.com/page={i}', callback=dummy_callable_to_pass_req_init)

if __name__ == '__main__':
    cProfile.run('create_reqs()', sort='cumulative')
log_output
         106150758 function calls (105150751 primitive calls) in 102.274 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      3/1    0.000    0.000  102.769  102.769 {built-in method builtins.exec}
        1    0.000    0.000  102.769  102.769 <string>:1(<module>)
        1    3.844    3.844  102.769  102.769 reqs_init.py:7(create_reqs)
  1000000    6.086    0.000   94.849    0.000 __init__.py:73(__init__)
  1000000    2.720    0.000   83.101    0.000 __init__.py:132(_set_url)
  1000000    7.607    0.000   73.930    0.000 url.py:39(safe_url_string)
  1000000   10.973    0.000   22.560    0.000 parse.py:437(urlsplit)
  3000000    2.854    0.000   19.890    0.000 parse.py:818(quote)
  3000000    6.224    0.000   16.478    0.000 parse.py:889(quote_from_bytes)
  1000000    8.940    0.000    8.940    0.000 parse.py:902(<listcomp>)
5000000/4000000    3.174    0.000    7.683    0.000 {method 'encode' of 'str' objects}
  1000000    1.393    0.000    6.000    0.000 url.py:49(escape_ajax)
  3000000    3.374    0.000    5.336    0.000 util.py:29(to_bytes)
  1000000    2.962    0.000    5.060    0.000 parse.py:411(_splitnetloc)
  1000000    3.367    0.000    4.935    0.000 idna.py:147(encode)
  1000000    2.414    0.000    4.923    0.000 parse.py:505(urlunsplit)
  3000000    4.014    0.000    4.663    0.000 parse.py:114(_coerce_args)
  2000000    1.316    0.000    4.377    0.000 util.py:41(to_native_str)
  1000000    1.975    0.000    4.082    0.000 parse.py:593(urldefrag)
 24000013    3.899    0.000    3.899    0.000 {built-in method builtins.isinstance}
  3000000    2.587    0.000    3.803    0.000 util.py:17(to_unicode)
  1000000    1.563    0.000    3.108    0.000 trackref.py:28(__new__)
  1000000    1.523    0.000    2.485    0.000 headers.py:11(__init__)
  2000000    1.483    0.000    2.332    0.000 __init__.py:108(_set_xback)
  4000000    1.917    0.000    1.917    0.000 {method 'find' of 'str' objects}
  6000003    1.482    0.000    1.482    0.000 {method 'replace' of 'str' objects}
  1000000    1.478    0.000    1.478    0.000 {method 'sub' of 're.Pattern' objects}
  3000001    1.197    0.000    1.197    0.000 {built-in method __new__ of type object at 0x00007FFA81BEAD30}
  6000060    1.023    0.000    1.023    0.000 {built-in method builtins.len}
  1000000    0.553    0.000    0.972    0.000 <string>:1(<lambda>)
  1000000    0.967    0.000    0.967    0.000 weakref.py:370(remove)
  1000000    0.962    0.000    0.962    0.000 datatypes.py:17(__init__)
  1000000    0.897    0.000    0.897    0.000 weakref.py:428(__setitem__)
  1000000    0.671    0.000    0.862    0.000 parse.py:419(_checknetloc)
  2000000    0.792    0.000    0.792    0.000 {method 'decode' of 'bytes' objects}
  1000000    0.563    0.000    0.563    0.000 {built-in method builtins.min}
  3000000    0.555    0.000    0.555    0.000 parse.py:103(_noop)
  1000057    0.526    0.000    0.526    0.000 {method 'startswith' of 'str' objects}
  1000000    0.521    0.000    0.521    0.000 {method 'split' of 'bytes' objects}
  2000001    0.486    0.000    0.486    0.000 {built-in method builtins.setattr}
  1000007    0.437    0.000    0.437    0.000 {method 'get' of 'dict' objects}
  1000000    0.399    0.000    0.399    0.000 {method 'rstrip' of 'bytes' objects}
  1000000    0.381    0.000    0.381    0.000 __init__.py:151(_set_body)
  2000000    0.363    0.000    0.363    0.000 {built-in method builtins.callable}
  1000000    0.324    0.000    0.324    0.000 {method 'upper' of 'str' objects}
  1000000    0.286    0.000    0.286    0.000 {built-in method time.time}
  1000079    0.272    0.000    0.272    0.000 {method 'rstrip' of 'str' objects}
  1000000    0.262    0.000    0.262    0.000 __init__.py:156(encoding)
  1000000    0.257    0.000    0.257    0.000 {method 'lower' of 'str' objects}
    49999    0.062    0.000    0.218    0.000 parse.py:88(clear_cache)
  1000004    0.191    0.000    0.191    0.000 {method 'isascii' of 'str' objects}
    99998    0.156    0.000    0.156    0.000 {method 'clear' of 'dict' objects}
        1    0.000    0.000    0.004    0.004 __init__.py:71(search_function)
        1    0.000    0.000    0.004    0.004 {built-in method builtins.__import__}
      2/1    0.000    0.000    0.004    0.004 <frozen importlib._bootstrap>:1022(_find_and_load)
      2/1    0.000    0.000    0.003    0.003 <frozen importlib._bootstrap>:987(_find_and_load_unlocked)
      2/1    0.000    0.000    0.003    0.003 <frozen importlib._bootstrap>:664(_load_unlocked)
      2/1    0.000    0.000    0.003    0.003 <frozen importlib._bootstrap_external>:877(exec_module)
      2/1    0.000    0.000    0.002    0.002 <frozen importlib._bootstrap>:233(_call_with_frames_removed)
        1    0.000    0.000    0.002    0.002 idna.py:1(<module>)
        2    0.000    0.000    0.002    0.001 <frozen importlib._bootstrap_external>:950(get_code)
        2    0.000    0.000    0.001    0.001 <frozen importlib._bootstrap_external>:1070(get_data)
        2    0.000    0.000    0.001    0.001 <frozen importlib._bootstrap>:921(_find_spec)
        2    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:1431(find_spec)
        2    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:1399(_get_spec)
        5    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:1536(find_spec)
        2    0.001    0.000    0.001    0.000 {built-in method io.open_code}
        9    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:140(_path_stat)
        9    0.001    0.000    0.001    0.000 {built-in method nt.stat}
       25    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:96(_path_join)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:670(_compile_bytecode)
        2    0.000    0.000    0.000    0.000 {built-in method marshal.loads}
        1    0.000    0.000    0.000    0.000 stringprep.py:1(<module>)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:564(module_from_spec)
        5    0.000    0.000    0.000    0.000 {built-in method builtins.__build_class__}
        2    0.000    0.000    0.000    0.000 {method 'read' of '_io.BufferedReader' objects}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:492(_init_module_attrs)
        4    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:380(cache_from_source)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:159(_path_isfile)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:150(_path_is_mode_type)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1089(path_stats)
        4    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:391(cached)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:510(_get_cached)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:169(__enter__)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1531(_get_spec)
        4    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:132(_path_split)
       25    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:119(<listcomp>)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:721(spec_from_file_location)
        2    0.000    0.000    0.000    0.000 {method '__exit__' of '_io._IOBase' objects}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:179(_get_module_lock)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.print}
       87    0.000    0.000    0.000    0.000 {method 'endswith' of 'str' objects}
        4    0.000    0.000    0.000    0.000 {built-in method builtins.max}
       13    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        5    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:67(_relax_case)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:585(_classify_pyc)
        2    0.000    0.000    0.000    0.000 __init__.py:89(find_spec)
       12    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:134(<genexpr>)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:172(_path_isabs)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:71(__init__)
        1    0.000    0.000    0.000    0.000 __init__.py:43(normalize_encoding)
       27    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:244(_verbose_message)
        1    0.000    0.000    0.000    0.000 idna.py:300(getregentry)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:100(acquire)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:173(__exit__)
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:84(_unpack_uint32)
       30    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:618(_validate_timestamp_pyc)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:125(release)
        4    0.000    0.000    0.000    0.000 {built-in method _thread.allocate_lock}
       33    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
       14    0.000    0.000    0.000    0.000 {method 'rpartition' of 'str' objects}
        1    0.000    0.000    0.000    0.000 re.py:249(compile)
        1    0.000    0.000    0.000    0.000 codecs.py:94(__new__)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:198(cb)
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1356(_path_importer_cache)
        8    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:897(__exit__)
        8    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:893(__enter__)
        1    0.000    0.000    0.000    0.000 re.py:288(_compile)
        9    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
        1    0.000    0.000    0.000    0.000 weakref.py:368(__init__)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:746(find_spec)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:826(find_spec)
        8    0.000    0.000    0.000    0.000 {method 'rfind' of 'str' objects}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:542(_check_name_wrapper)
        2    0.000    0.000    0.000    0.000 {built-in method _imp._fix_co_filename}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:48(_new_module)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:357(__init__)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:404(parent)
        2    0.000    0.000    0.000    0.000 {built-in method nt._path_splitroot}
        6    0.000    0.000    0.000    0.000 {built-in method from_bytes}
       12    0.000    0.000    0.000    0.000 {built-in method _imp.acquire_lock}
       12    0.000    0.000    0.000    0.000 {built-in method _imp.release_lock}
        1    0.000    0.000    0.000    0.000 {built-in method _imp.is_builtin}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1040(__init__)
        1    0.000    0.000    0.000    0.000 {method 'format' of 'str' objects}
        1    0.000    0.000    0.000    0.000 idna.py:295(StreamReader)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:165(__init__)
        6    0.000    0.000    0.000    0.000 {built-in method nt.fspath}
        4    0.000    0.000    0.000    0.000 {method '__exit__' of '_thread.lock' objects}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.locals}
        2    0.000    0.000    0.000    0.000 {built-in method _imp.is_frozen}
        1    0.000    0.000    0.000    0.000 idna.py:146(Codec)
        4    0.000    0.000    0.000    0.000 {built-in method _thread.get_ident}
        4    0.000    0.000    0.000    0.000 {method 'isalnum' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 {method 'pop' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 idna.py:253(IncrementalDecoder)
        1    0.000    0.000    0.000    0.000 idna.py:218(IncrementalEncoder)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:412(has_location)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:874(create_module)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1065(get_filename)
        1    0.000    0.000    0.000    0.000 idna.py:292(StreamWriter)
        1    0.000    0.000    0.000    0.000 __init__.py:96(<lambda>)



Process finished with exit code 0

We see that callable/non callable exception value checks from `_set_xback` - responsible for additional 2.2% of CPU time (2.332 seconds of 102.274 total).

For next profiler test I changed _set_xback to make it close as possible to current implementation (do not allow to set callable no f_strings and still packed it inside function to make it visible by profiler)

    def _set_xback(self, callback, errback) -> None:
        if callback is not None and not callable(callback):
            raise TypeError(
                f"callback must be a callable, got {type(callback).__name__}"
            )
        if errback is not None and not callable(errback):
            raise TypeError(f"errback must be a callable, got {type(errback).__name__}")
log_output_2
         102150758 function calls (101150751 primitive calls) in 101.555 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      3/1    0.000    0.000  102.063  102.063 {built-in method builtins.exec}
        1    0.000    0.000  102.063  102.063 <string>:1(<module>)
        1    3.910    3.910  102.063  102.063 reqs_init.py:7(create_reqs)
  1000000    5.805    0.000   94.052    0.000 __init__.py:73(__init__)
  1000000    2.700    0.000   84.130    0.000 __init__.py:130(_set_url)
  1000000    7.762    0.000   74.946    0.000 url.py:39(safe_url_string)
  1000000   11.103    0.000   22.714    0.000 parse.py:437(urlsplit)
  3000000    2.898    0.000   20.136    0.000 parse.py:818(quote)
  3000000    6.314    0.000   16.670    0.000 parse.py:889(quote_from_bytes)
  1000000    9.040    0.000    9.040    0.000 parse.py:902(<listcomp>)
5000000/4000000    3.235    0.000    7.830    0.000 {method 'encode' of 'str' objects}
  1000000    1.395    0.000    6.041    0.000 url.py:49(escape_ajax)
  3000000    3.400    0.000    5.398    0.000 util.py:29(to_bytes)
  1000000    3.032    0.000    5.105    0.000 parse.py:411(_splitnetloc)
  1000000    3.432    0.000    5.031    0.000 idna.py:147(encode)
  1000000    2.464    0.000    4.951    0.000 parse.py:505(urlunsplit)
  3000000    4.036    0.000    4.674    0.000 parse.py:114(_coerce_args)
  2000000    1.501    0.000    4.596    0.000 util.py:41(to_native_str)
  1000000    1.963    0.000    4.109    0.000 parse.py:593(urldefrag)
 24000013    3.919    0.000    3.919    0.000 {built-in method builtins.isinstance}
  3000000    2.621    0.000    3.831    0.000 util.py:17(to_unicode)
  1000000    1.568    0.000    3.102    0.000 trackref.py:28(__new__)
  1000000    1.490    0.000    2.442    0.000 headers.py:11(__init__)
  4000000    1.943    0.000    1.943    0.000 {method 'find' of 'str' objects}
  1000000    1.493    0.000    1.493    0.000 {method 'sub' of 're.Pattern' objects}
  6000003    1.465    0.000    1.465    0.000 {method 'replace' of 'str' objects}
  3000001    1.185    0.000    1.185    0.000 {built-in method __new__ of type object at 0x00007FFA81BEAD30}
  6000060    1.022    0.000    1.022    0.000 {built-in method builtins.len}
  1000000    0.999    0.000    0.999    0.000 weakref.py:370(remove)
  1000000    0.952    0.000    0.952    0.000 datatypes.py:17(__init__)
  1000000    0.540    0.000    0.950    0.000 <string>:1(<lambda>)
  1000000    0.888    0.000    0.888    0.000 weakref.py:428(__setitem__)
  1000000    0.679    0.000    0.874    0.000 parse.py:419(_checknetloc)
  1000000    0.607    0.000    0.813    0.000 __init__.py:107(_set_xback)
  2000000    0.776    0.000    0.776    0.000 {method 'decode' of 'bytes' objects}
  3000000    0.544    0.000    0.544    0.000 parse.py:103(_noop)
  1000057    0.538    0.000    0.538    0.000 {method 'startswith' of 'str' objects}
  1000000    0.536    0.000    0.536    0.000 {built-in method builtins.min}
  1000000    0.535    0.000    0.535    0.000 {method 'split' of 'bytes' objects}
  1000007    0.428    0.000    0.428    0.000 {method 'get' of 'dict' objects}
  1000000    0.409    0.000    0.409    0.000 __init__.py:149(_set_body)
  1000000    0.404    0.000    0.404    0.000 {method 'rstrip' of 'bytes' objects}
  1000000    0.310    0.000    0.310    0.000 {method 'upper' of 'str' objects}
  1000000    0.286    0.000    0.286    0.000 {built-in method time.time}
  1000079    0.276    0.000    0.276    0.000 {method 'rstrip' of 'str' objects}
  1000000    0.271    0.000    0.271    0.000 __init__.py:154(encoding)
  1000000    0.256    0.000    0.256    0.000 {method 'lower' of 'str' objects}
    49999    0.063    0.000    0.221    0.000 parse.py:88(clear_cache)
  1000000    0.206    0.000    0.206    0.000 {built-in method builtins.callable}
  1000004    0.195    0.000    0.195    0.000 {method 'isascii' of 'str' objects}
    99998    0.158    0.000    0.158    0.000 {method 'clear' of 'dict' objects}
        1    0.000    0.000    0.004    0.004 __init__.py:71(search_function)
        1    0.000    0.000    0.004    0.004 {built-in method builtins.__import__}
      2/1    0.000    0.000    0.004    0.004 <frozen importlib._bootstrap>:1022(_find_and_load)
      2/1    0.000    0.000    0.004    0.004 <frozen importlib._bootstrap>:987(_find_and_load_unlocked)
      2/1    0.000    0.000    0.003    0.003 <frozen importlib._bootstrap>:664(_load_unlocked)
      2/1    0.000    0.000    0.003    0.003 <frozen importlib._bootstrap_external>:877(exec_module)
      2/1    0.000    0.000    0.002    0.002 <frozen importlib._bootstrap>:233(_call_with_frames_removed)
        1    0.000    0.000    0.002    0.002 idna.py:1(<module>)
        2    0.000    0.000    0.002    0.001 <frozen importlib._bootstrap_external>:950(get_code)
        2    0.000    0.000    0.001    0.001 <frozen importlib._bootstrap_external>:1070(get_data)
        2    0.000    0.000    0.001    0.001 <frozen importlib._bootstrap>:921(_find_spec)
        2    0.000    0.000    0.001    0.001 <frozen importlib._bootstrap_external>:1431(find_spec)
        2    0.000    0.000    0.001    0.001 <frozen importlib._bootstrap_external>:1399(_get_spec)
        5    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:1536(find_spec)
        2    0.001    0.000    0.001    0.000 {built-in method io.open_code}
        9    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:140(_path_stat)
        9    0.001    0.000    0.001    0.000 {built-in method nt.stat}
       25    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:96(_path_join)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:670(_compile_bytecode)
        2    0.000    0.000    0.000    0.000 {built-in method marshal.loads}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:564(module_from_spec)
        1    0.000    0.000    0.000    0.000 stringprep.py:1(<module>)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:492(_init_module_attrs)
        4    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:380(cache_from_source)
        5    0.000    0.000    0.000    0.000 {built-in method builtins.__build_class__}
        2    0.000    0.000    0.000    0.000 {method 'read' of '_io.BufferedReader' objects}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1089(path_stats)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:159(_path_isfile)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:150(_path_is_mode_type)
        4    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:391(cached)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:510(_get_cached)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:169(__enter__)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1531(_get_spec)
        4    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:132(_path_split)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:721(spec_from_file_location)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:179(_get_module_lock)
        2    0.000    0.000    0.000    0.000 {method '__exit__' of '_io._IOBase' objects}
       25    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:119(<listcomp>)
       13    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.print}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:585(_classify_pyc)
       87    0.000    0.000    0.000    0.000 {method 'endswith' of 'str' objects}
        4    0.000    0.000    0.000    0.000 {built-in method builtins.max}
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1356(_path_importer_cache)
        2    0.000    0.000    0.000    0.000 __init__.py:89(find_spec)
        1    0.000    0.000    0.000    0.000 __init__.py:43(normalize_encoding)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:71(__init__)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:172(_path_isabs)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:100(acquire)
       12    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:134(<genexpr>)
        1    0.000    0.000    0.000    0.000 idna.py:300(getregentry)
        6    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:84(_unpack_uint32)
       27    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:244(_verbose_message)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:173(__exit__)
       30    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
        4    0.000    0.000    0.000    0.000 {built-in method _thread.allocate_lock}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:125(release)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:618(_validate_timestamp_pyc)
        1    0.000    0.000    0.000    0.000 re.py:249(compile)
        5    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:67(_relax_case)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:746(find_spec)
       33    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
       14    0.000    0.000    0.000    0.000 {method 'rpartition' of 'str' objects}
        1    0.000    0.000    0.000    0.000 re.py:288(_compile)
        1    0.000    0.000    0.000    0.000 codecs.py:94(__new__)
        1    0.000    0.000    0.000    0.000 weakref.py:368(__init__)
        9    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
        8    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:897(__exit__)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:198(cb)
        8    0.000    0.000    0.000    0.000 {method 'rfind' of 'str' objects}
        8    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:893(__enter__)
        1    0.000    0.000    0.000    0.000 {built-in method _imp.is_builtin}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:542(_check_name_wrapper)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:48(_new_module)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:826(find_spec)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:404(parent)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:165(__init__)
        2    0.000    0.000    0.000    0.000 {built-in method _imp._fix_co_filename}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:357(__init__)
        6    0.000    0.000    0.000    0.000 {built-in method from_bytes}
        1    0.000    0.000    0.000    0.000 {method 'format' of 'str' objects}
       12    0.000    0.000    0.000    0.000 {built-in method _imp.release_lock}
        2    0.000    0.000    0.000    0.000 {built-in method nt._path_splitroot}
       12    0.000    0.000    0.000    0.000 {built-in method _imp.acquire_lock}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1040(__init__)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.locals}
        6    0.000    0.000    0.000    0.000 {built-in method nt.fspath}
        1    0.000    0.000    0.000    0.000 idna.py:146(Codec)
        4    0.000    0.000    0.000    0.000 {method '__exit__' of '_thread.lock' objects}
        1    0.000    0.000    0.000    0.000 idna.py:253(IncrementalDecoder)
        2    0.000    0.000    0.000    0.000 {built-in method _imp.is_frozen}
        4    0.000    0.000    0.000    0.000 {method 'isalnum' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.setattr}
        4    0.000    0.000    0.000    0.000 {built-in method _thread.get_ident}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:412(has_location)
        2    0.000    0.000    0.000    0.000 {method 'pop' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:874(create_module)
        1    0.000    0.000    0.000    0.000 idna.py:218(IncrementalEncoder)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1065(get_filename)
        1    0.000    0.000    0.000    0.000 idna.py:292(StreamWriter)
        1    0.000    0.000    0.000    0.000 idna.py:295(StreamReader)
        1    0.000    0.000    0.000    0.000 __init__.py:96(<lambda>)



Process finished with exit code 0

In this test we see that close to current implementation last for ~0.8% of cpu runtime (0.813 seconds of 101.555 total).

On my hardware with enabled profiler it means that changes from this pull request with additional callable/checks insideRequest.__init__ with usual callable in callback - add ~1.5 seconds of CPU runtime per 1 million of Request.__init__ call.

On a first look this runtime impact doesn't look significant enough. Hovewer in scrapy jobs with a lot of initiated requests especially on broad crawls, crawl spiders on specified domains or other where expected a lot of duplicate requests on cpu bound spiders and on lower grade hardware we can expect more significant runtime impact as this checks also will be executed for requests that will be filtered later by dupefilter or by offsite middleware.

@Gallaecio
Copy link
Member Author

Gallaecio commented Jan 26, 2023

The only necessary difference in that part of the code is the additional condition in the if statement. We can revert my refactoring into _set_back to keep the performance as good as possible.

I tested with this code (notice I added back attribute setting at the end):

    def _set_xback(
        self,
        callback: Union[None, NoCallbackType, Callable],
        errback: Optional[Callable],
    ) -> None:
        if not (
            callable(callback)
            or callback is None
            or callback is NO_CALLBACK  # Commented out on the first run.
        ):
            raise TypeError(
                f"callback must be a callable, got {type(callback).__name__}"
            )
        if not (
            callable(errback)
            or errback is None
        ):
            raise TypeError(f"errback must be a callable, got {type(errback).__name__}")
        self.callback = callback
        self.errback = errback

And the performance is really similar:

$ python million_reqs.py | grep xback
  1000000    0.465    0.000    0.633    0.000 __init__.py:107(_set_xback)
$ python million_reqs.py | grep xback
  1000000    0.485    0.000    0.658    0.000 __init__.py:107(_set_xback)

Which is as I would expect, because unless the callback is NO_CALLBACK or an invalid value, performance should be pretty much the same. I tested with callback=NO_CALLBACK in the benchmark script, and the performance did not change significantly either:

$ python million_reqs.py | grep xback
  1000000    0.526    0.000    0.685    0.000 __init__.py:107(_set_xback)

So I don’t think performance is key here, I think both options can be implemented with a similar performance impact. We could even consider not checking the parameter type (we have hints, after all), or checking specifically for expected bad values (e.g. isinstance(callback, str), and that would probably have more of a performance impact that this change.

That said, the only 2 cons I found for the callable approach are very minor, so I want to make it clear that I have no strong opinion on which approach to take. I am OK with any of them.

@Gallaecio
Copy link
Member Author

@wRAR and @kmike agree on going with the callable approach, and not to worry about serialization here. Thank you, @GeorgeA92! I’ll update the PR soon.

@Gallaecio
Copy link
Member Author

For the record, #5798 (comment) is open, but I hope to address that in a few minutes.

tox.ini Outdated Show resolved Hide resolved
@kmike kmike merged commit b337c98 into scrapy:master Jan 30, 2023
@kmike
Copy link
Member

kmike commented Jan 30, 2023

Thanks @Gallaecio!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants