Allow disabling the AutoThrottle extension for a given slot #6246

Gallaecio · 2024-02-26T14:34:24Z

No description provided.

codecov · 2024-02-26T14:45:11Z

Codecov Report

Merging #6246 (d357689) into master (2d46b4a) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #6246   +/-   ##
=======================================
  Coverage   88.90%   88.90%           
=======================================
  Files         161      161           
  Lines       11790    11792    +2     
  Branches     1913     1913           
=======================================
+ Hits        10482    10484    +2     
  Misses        980      980           
  Partials      328      328

Files	Coverage Δ
scrapy/core/downloader/__init__.py	`91.83% <100.00%> (+0.11%)`	⬆️
scrapy/extensions/throttle.py	`100.00% <100.00%> (ø)`

GeorgeA92 · 2024-03-13T17:34:39Z

@Gallaecio
I didn't notice this earlier.
As a developer who was responsible for preparing/testing actual implementation of DOWNLOAD_SLOTS I don't like approach to implement compatibility with AutoThrottle extension from this Pull request.

During preparation of #5328 compatibility with Autothrottle extenstion didn't mentioned as requirement. It tested on default settings (with disabled autothrottle extension)

Updating AutoThrottle extension to make it compatible with DOWNLOAD_SLOT setting values - makes sense only if all of its settings AUTOTHROTTLE_TARGET_CONCURRENCY, AUTOTHROTTLE_ENABLED, AUTOTHROTTLE_START_DELAY, AUTOTHROTTLE_MAX_DELAY will be possible to set on per download slot basis (for example by just adding related supported keys for exiting DOWNLOAD_SLOTS setting) and from scrapy user perspective most likely this will be expected behaviour for DOWNLOAD_SLOTS with enabled AutoThrottle extention.

While disabling throttle on specific slot +- looks like applying of AUTOHTROTTLE_ENABLED only.

And there is no need to.. make backward incompatible changes to scrapy/core downloader slot to implement this.

Technically nothing prevents us to update downloader.per_slot_settings dict used for slots initially listed on DOWNLOAD_SLOTS setting during runtime from Autothrottle extension - as this extension affect downloader variables directly already

Gallaecio · 2024-03-14T21:43:07Z

So, what you are suggesting is to implement an AutoThrottle-specific setting, e.g. AUTOTHROTTLE_SKIP_SLOTS? That sounds better to me indeed for the scenario mentioned in the docs change here.

However, my end goal here, the reason why I implemented this, is to be able to disable AutoThrottle on slots created by scrapy-zyte-api, so that we can implement latency reporting without triggering AutoThrottle. For that, I don’t think a setting would be a good choice.

I am completely open to suggestions, though. I really do not like modifying a core component to control an extension behavior, it is simply the cleanest thing I could come up with.

GeorgeA92 · 2024-03-15T00:37:18Z

@Gallaecio

So, what you are suggesting is to implement an AutoThrottle-specific setting, e.g. AUTOTHROTTLE_SKIP_SLOTS? That sounds better to me indeed for the scenario mentioned in the docs change here.

I mean to something like "quotes.toscrape.com": {"concurrency": 1, "delay": 2, "randomize_delay": False, "AUTOTHROTTLE_START_DELAY": 5 , "AUTOTHROTTLE_MAX_DELAY":25}, etc.

However, my end goal here, the reason why I implemented this, is to be able to disable AutoThrottle on slots created by scrapy-zyte-api, so that we can scrapy-plugins/scrapy-zyte-api#59 (comment). For that, I don’t think a setting would be a good choice.

This will require to.. update params for all download slots with _slot_prefix = "zyte-api@" which is not implemented in DOWNLOAD_SLOTS (will require to hardcode all expected slots with prefixed).

I am completely open to suggestions, though. I really do not like modifying a core component to control an extension behavior, it is simply the cleanest thing I could come up with.

My suggestion is to update.. method of AutoThrottle extension _adjust_delay or _response_downloaded to not make changes if slot has _slot_prefix .
I mean changing this during runtime on active AutoThrottle instance from ScrapyZyteAPIDownloaderMiddleware.open_spider it's spider_opened signal handler.
It can be:

monkeypathing of _adjust_delay method.
diconnect AutoThrottle._response_downloaded signal handler and connect it's updated method (I did similar tricks with signal handlers before)

This approach doesn't require to change both downloader and autothrottle ext (only zyte api middleware code required to update)

Gallaecio · 2024-03-15T09:26:23Z

What about exposing a AUTOTHROTTLE_SLOT_FILTER setting that accepts a Scrapy component for filtering, and set that from scrapy-zyte-api to a component that filters out our slots?

GeorgeA92 · 2024-03-18T11:57:16Z

@Gallaecio
If end goal of this to affect downloader slots from zyteapi middleware/download hanlder - then (if possible - I think it is possible) only zyteapi ext. related code needs to be updated or fixed (with no additional changes to scrapy itself) - this is logic behind approach I proposed in #6246 (comment)

What about exposing a AUTOTHROTTLE_SLOT_FILTER setting that accepts a Scrapy component for filtering, and set that from scrapy-zyte-api to a component that filters out our slots?

With approach I proposed earlier - for end user It will be possible to receive expected result only after updating version of zyteapi middleware. This approach (adding another setting on scrapy code level) as well as updates from this PR also require to update scrapy to the latest version(that include this change) that.. may not be possible for some projects.

Gallaecio · 2024-03-18T12:39:48Z

only zyteapi ext. related code needs to be updated

Yes, but if I understood correctly, you are suggesting to monkey-patch Scrapy code from the extension, which I would like to avoid.

require to update scrapy to the latest version

I think this is OK, since the only thing we need this for is to support latency reporting for responses coming for scrapy-zyte-api, and most users will probably not care about that.

GeorgeA92 · 2024-03-25T16:09:03Z

@Gallaecio

I think this is OK, since the only thing we need this for is to support latency reporting for responses coming for scrapy-zyte-api, and most users will probably not care about that.

As far as I know scrapy-zyte-api (handler) doesn't set download_latency as regular scrapy download handlers do - If yes then I'd prefer to not change that (or set download_latency values to.. some other meta key). Autothrottle should't make any changes to slot if download_latency meta value doesn't exist in request (which means that autothrottle extenstion shouldn't affect requests originated from scrapy-zyte-api)

scrapy/scrapy/extensions/throttle.py

Lines 64 to 66 in 02b97f9

    
           latency = request.meta.get("download_latency") 
        
           if latency is None or slot is None or slot.throttle is False: 
        
               return

Scrapy-zyte-api(handler) has it's own logic that duplicate logic of retrymiddleware and autothrottle extension (as mentioned on scrapy-plugins/scrapy-zyte-api#99 (comment).
What if for single scrapy request - zyteapi download handler will make 4 retries and 1 200s (5 requests) with latencies like [2.5, 6, 0.24, 1.5, 60] ? Autothrottle extension can accept download_latency as single number only

In this case I propose to.. update scrapy-zyte-api.. to write (or to not write) something in download_latency request meta key as trigger to disable/enable autothrottle for that requests - this approach looks more conventional and I still don't see a reason to make backward incompatible changes to scrapy core parts

Gallaecio · 2024-04-05T10:30:54Z

As far as I know scrapy-zyte-api (handler) doesn't set download_latency as regular scrapy download handlers do

This was not done on purpose, though, it was something we accidentally forgot to implement, i.e. a bug. And by the time we realized it, we were already exploiting this bug to prevent AutoThrottle from affecting Zyte API traffic. But we do want to fix the bug, i.e. implement download_latency, only we need to first figure out a way to handle AutoThrottle that does not depend on download_latency not being defined.

What if for single scrapy request - zyteapi download handler will make 4 retries and 1 200s (5 requests) with latencies like [2.5, 6, 0.24, 1.5, 60] ? Autothrottle extension can accept download_latency as single number only

I think this is fine. download_latency is the time the request spends in the download handler. If a request takes longer because of 429 or 521 retries, so be it, and the download_latency would in practice be the sum of the time every attempt took until success or exceeding retries.

I propose to.. update scrapy-zyte-api.. to write (or to not write) something in download_latency request meta key as trigger to disable/enable autothrottle for that requests

I think putting a value in download_latency other than the time the request spent in the download handler would not be a good choice. It would break the expectations of user code using download_latency about what download_latency stands for.

Gallaecio added 2 commits February 26, 2024 15:27

Complete coverage for the AutoThrottle extension

7bd1af9

Allow disabling the AutoThrottle extension for a given slot

4f0d21f

Gallaecio requested a review from kmike February 26, 2024 14:34

Gallaecio added 9 commits February 26, 2024 15:50

Cover Slot.throttle in the docs

6a3866b

Add throttle support to DOWNLOAD_SLOTS

72ad899

Clear caplog before use

3c96cb1

Implement target concurrency checking

f77cecd

Document the lower limit of AUTOTHROTTLE_TARGET_CONCURRENCY

e019c18

Merge branch 'autothrottle-coverage' into slot-autothrottle

3a20705

Fix repr test

654e981

Fix unclean reactor issue

b4e1132

Merge branch 'master' into slot-autothrottle

d357689

wRAR approved these changes Feb 27, 2024

View reviewed changes

VMRuiz approved these changes Feb 27, 2024

View reviewed changes

Gallaecio merged commit d7581c6 into scrapy:master Mar 12, 2024
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow disabling the AutoThrottle extension for a given slot #6246

Allow disabling the AutoThrottle extension for a given slot #6246

Gallaecio commented Feb 26, 2024 •

edited

codecov bot commented Feb 26, 2024 •

edited

GeorgeA92 commented Mar 13, 2024

Gallaecio commented Mar 14, 2024

GeorgeA92 commented Mar 15, 2024

Gallaecio commented Mar 15, 2024

GeorgeA92 commented Mar 18, 2024

Gallaecio commented Mar 18, 2024

GeorgeA92 commented Mar 25, 2024

Gallaecio commented Apr 5, 2024

Allow disabling the AutoThrottle extension for a given slot #6246

Allow disabling the AutoThrottle extension for a given slot #6246

Conversation

Gallaecio commented Feb 26, 2024 • edited

codecov bot commented Feb 26, 2024 • edited

Codecov Report

GeorgeA92 commented Mar 13, 2024

Gallaecio commented Mar 14, 2024

GeorgeA92 commented Mar 15, 2024

Gallaecio commented Mar 15, 2024

GeorgeA92 commented Mar 18, 2024

Gallaecio commented Mar 18, 2024

GeorgeA92 commented Mar 25, 2024

Gallaecio commented Apr 5, 2024

Gallaecio commented Feb 26, 2024 •

edited

codecov bot commented Feb 26, 2024 •

edited