Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow disabling the AutoThrottle extension for a given slot #6246

Merged
merged 11 commits into from
Mar 12, 2024

Conversation

Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Feb 26, 2024

No description provided.

@Gallaecio Gallaecio requested a review from kmike February 26, 2024 14:34
Copy link

codecov bot commented Feb 26, 2024

Codecov Report

Merging #6246 (d357689) into master (2d46b4a) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6246   +/-   ##
=======================================
  Coverage   88.90%   88.90%           
=======================================
  Files         161      161           
  Lines       11790    11792    +2     
  Branches     1913     1913           
=======================================
+ Hits        10482    10484    +2     
  Misses        980      980           
  Partials      328      328           
Files Coverage Δ
scrapy/core/downloader/__init__.py 91.83% <100.00%> (+0.11%) ⬆️
scrapy/extensions/throttle.py 100.00% <100.00%> (ø)

@Gallaecio Gallaecio merged commit d7581c6 into scrapy:master Mar 12, 2024
26 checks passed
@GeorgeA92
Copy link
Contributor

@Gallaecio
I didn't notice this earlier.
As a developer who was responsible for preparing/testing actual implementation of DOWNLOAD_SLOTS I don't like approach to implement compatibility with AutoThrottle extension from this Pull request.

During preparation of #5328 compatibility with Autothrottle extenstion didn't mentioned as requirement. It tested on default settings (with disabled autothrottle extension)

Updating AutoThrottle extension to make it compatible with DOWNLOAD_SLOT setting values - makes sense only if all of its settings AUTOTHROTTLE_TARGET_CONCURRENCY, AUTOTHROTTLE_ENABLED, AUTOTHROTTLE_START_DELAY, AUTOTHROTTLE_MAX_DELAY will be possible to set on per download slot basis (for example by just adding related supported keys for exiting DOWNLOAD_SLOTS setting) and from scrapy user perspective most likely this will be expected behaviour for DOWNLOAD_SLOTS with enabled AutoThrottle extention.

While disabling throttle on specific slot +- looks like applying of AUTOHTROTTLE_ENABLED only.

And there is no need to.. make backward incompatible changes to scrapy/core downloader slot to implement this.

Technically nothing prevents us to update downloader.per_slot_settings dict used for slots initially listed on DOWNLOAD_SLOTS setting during runtime from Autothrottle extension - as this extension affect downloader variables directly already

@Gallaecio
Copy link
Member Author

So, what you are suggesting is to implement an AutoThrottle-specific setting, e.g. AUTOTHROTTLE_SKIP_SLOTS? That sounds better to me indeed for the scenario mentioned in the docs change here.

However, my end goal here, the reason why I implemented this, is to be able to disable AutoThrottle on slots created by scrapy-zyte-api, so that we can implement latency reporting without triggering AutoThrottle. For that, I don’t think a setting would be a good choice.

I am completely open to suggestions, though. I really do not like modifying a core component to control an extension behavior, it is simply the cleanest thing I could come up with.

@GeorgeA92
Copy link
Contributor

@Gallaecio

So, what you are suggesting is to implement an AutoThrottle-specific setting, e.g. AUTOTHROTTLE_SKIP_SLOTS? That sounds better to me indeed for the scenario mentioned in the docs change here.

I mean to something like "quotes.toscrape.com": {"concurrency": 1, "delay": 2, "randomize_delay": False, "AUTOTHROTTLE_START_DELAY": 5 , "AUTOTHROTTLE_MAX_DELAY":25}, etc.

However, my end goal here, the reason why I implemented this, is to be able to disable AutoThrottle on slots created by scrapy-zyte-api, so that we can scrapy-plugins/scrapy-zyte-api#59 (comment). For that, I don’t think a setting would be a good choice.

This will require to.. update params for all download slots with _slot_prefix = "zyte-api@" which is not implemented in DOWNLOAD_SLOTS (will require to hardcode all expected slots with prefixed).

I am completely open to suggestions, though. I really do not like modifying a core component to control an extension behavior, it is simply the cleanest thing I could come up with.

My suggestion is to update.. method of AutoThrottle extension _adjust_delay or _response_downloaded to not make changes if slot has _slot_prefix .
I mean changing this during runtime on active AutoThrottle instance from ScrapyZyteAPIDownloaderMiddleware.open_spider it's spider_opened signal handler.
It can be:

  1. monkeypathing of _adjust_delay method.
  2. diconnect AutoThrottle._response_downloaded signal handler and connect it's updated method (I did similar tricks with signal handlers before)

This approach doesn't require to change both downloader and autothrottle ext (only zyte api middleware code required to update)

@Gallaecio
Copy link
Member Author

What about exposing a AUTOTHROTTLE_SLOT_FILTER setting that accepts a Scrapy component for filtering, and set that from scrapy-zyte-api to a component that filters out our slots?

@GeorgeA92
Copy link
Contributor

@Gallaecio
If end goal of this to affect downloader slots from zyteapi middleware/download hanlder - then (if possible - I think it is possible) only zyteapi ext. related code needs to be updated or fixed (with no additional changes to scrapy itself) - this is logic behind approach I proposed in #6246 (comment)

What about exposing a AUTOTHROTTLE_SLOT_FILTER setting that accepts a Scrapy component for filtering, and set that from scrapy-zyte-api to a component that filters out our slots?

With approach I proposed earlier - for end user It will be possible to receive expected result only after updating version of zyteapi middleware. This approach (adding another setting on scrapy code level) as well as updates from this PR also require to update scrapy to the latest version(that include this change) that.. may not be possible for some projects.

@Gallaecio
Copy link
Member Author

only zyteapi ext. related code needs to be updated

Yes, but if I understood correctly, you are suggesting to monkey-patch Scrapy code from the extension, which I would like to avoid.

require to update scrapy to the latest version

I think this is OK, since the only thing we need this for is to support latency reporting for responses coming for scrapy-zyte-api, and most users will probably not care about that.

@GeorgeA92
Copy link
Contributor

@Gallaecio

I think this is OK, since the only thing we need this for is to support latency reporting for responses coming for scrapy-zyte-api, and most users will probably not care about that.

As far as I know scrapy-zyte-api (handler) doesn't set download_latency as regular scrapy download handlers do - If yes then I'd prefer to not change that (or set download_latency values to.. some other meta key). Autothrottle should't make any changes to slot if download_latency meta value doesn't exist in request (which means that autothrottle extenstion shouldn't affect requests originated from scrapy-zyte-api)

latency = request.meta.get("download_latency")
if latency is None or slot is None or slot.throttle is False:
return

Scrapy-zyte-api(handler) has it's own logic that duplicate logic of retrymiddleware and autothrottle extension (as mentioned on scrapy-plugins/scrapy-zyte-api#99 (comment).
What if for single scrapy request - zyteapi download handler will make 4 retries and 1 200s (5 requests) with latencies like [2.5, 6, 0.24, 1.5, 60] ? Autothrottle extension can accept download_latency as single number only

In this case I propose to.. update scrapy-zyte-api.. to write (or to not write) something in download_latency request meta key as trigger to disable/enable autothrottle for that requests - this approach looks more conventional and I still don't see a reason to make backward incompatible changes to scrapy core parts

@Gallaecio
Copy link
Member Author

As far as I know scrapy-zyte-api (handler) doesn't set download_latency as regular scrapy download handlers do

This was not done on purpose, though, it was something we accidentally forgot to implement, i.e. a bug. And by the time we realized it, we were already exploiting this bug to prevent AutoThrottle from affecting Zyte API traffic. But we do want to fix the bug, i.e. implement download_latency, only we need to first figure out a way to handle AutoThrottle that does not depend on download_latency not being defined.

What if for single scrapy request - zyteapi download handler will make 4 retries and 1 200s (5 requests) with latencies like [2.5, 6, 0.24, 1.5, 60] ? Autothrottle extension can accept download_latency as single number only

I think this is fine. download_latency is the time the request spends in the download handler. If a request takes longer because of 429 or 521 retries, so be it, and the download_latency would in practice be the sum of the time every attempt took until success or exceeding retries.

I propose to.. update scrapy-zyte-api.. to write (or to not write) something in download_latency request meta key as trigger to disable/enable autothrottle for that requests

I think putting a value in download_latency other than the time the request spent in the download handler would not be a good choice. It would break the expectations of user code using download_latency about what download_latency stands for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants