-
Notifications
You must be signed in to change notification settings - Fork 10.9k
Per slot settings #5328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Per slot settings #5328
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5328 +/- ##
==========================================
+ Coverage 88.95% 88.96% +0.01%
==========================================
Files 162 162
Lines 11007 11021 +14
Branches 1798 1796 -2
==========================================
+ Hits 9791 9805 +14
- Misses 937 938 +1
+ Partials 279 278 -1
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, nice job! Tests and documentation should go next, before we merge.
Re-running failing jobs… |
This PR is ready for review. |
@GeorgeA92 #5328 (comment) is still unaddressed. |
tolerance = 0.3 | ||
|
||
delays_real = {k: v[1] - v[0] for k, v in times.items()} | ||
error_delta = {k: 1 - delays_real[k] / v.delay for k, v in slots.items()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, for one slot in the tests the delay is 2s, for another it's 1.5s, and it's 1s for the default. Let's say implementation is incorrect, and in all cases delay of 2s is used. error_delta values would be 1 - 2/1.5 = -0.33
and (1-2/1 = 01)
. The assertion below will pass: max(-0.33, -1) = -0.33 < 0.3
is True.
If the 1.5s delay is used instead of 2s delay, then error_delta values would be 1 - 1.5/2 = 0.25
. 0.25 < 0.3
is True again. It may catch the default 1s download delay though.
Maybe there are reasons these tests may detect an issue (e.g. default DOWNLOAD_DELAY), but I think it could make sense to try improving them. Have you encountered issues with a simpler implementation, something like abs(delays_real[k] - v.delay) < tolerance
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test implemented directly in the same way as existing test for DOWNLOAD_DELAY
setting (counting that we need to test multiple downloader slots now)
Lines 85 to 95 in 8004075
tolerance = (1 - (0.6 if randomize else 0.2)) | |
settings = {"DOWNLOAD_DELAY": delay, | |
'RANDOMIZE_DOWNLOAD_DELAY': randomize} | |
crawler = get_crawler(FollowAllSpider, settings) | |
yield crawler.crawl(**crawl_kwargs) | |
times = crawler.spider.times | |
total_time = times[-1] - times[0] | |
average = total_time / (len(times) - 1) | |
self.assertTrue(average > delay * tolerance, | |
f"download delay too small: {average}") |
On my local tests delays_real
calculated from time.time
calls in spider by some unknown reason always a bit lower that delays from settings:
for delays [1, 1.5, 2] - I received [~0.90, ~1.43, ~ 1.90] in delays_real
. So on this condition I don't expect issues from value that became negative.
Anyway error calculation method updated to be sure that it will not happen.
I also thought about increasing delay for test from [1, 1.5, 2] to for example [3, 5, 7] in this case with increased delays in test we can safely reduce tolerance
.
The main question what is acceptable total time for this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main question what is acceptable total time for this test?
The minimum value that still makes the test reliable, both in the sense that it does not break randomly, and that it indeed validates what it is meant to validate. If we need 30 seconds for that, so be it. But if it can be done in 5 seconds, that would be better.
Hey! The feature makes sense, and the implementation looks good 👍 I think to merge it, we need to fix a few small issues and add the docs. |
# Conflicts: # scrapy/core/downloader/__init__.py
Updated pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/scrapy/scrapy/pull/5328/files#r933163347 is still not addressed, as far as I can tell. I think we need to apply per-slot settings conc
after the call to _get_concurrency_delay
, to make sure conc
from per-slot settings takes precedence.
How does this solve #3529? |
This pull request doesn't solve #3529 itself I suppose that @Gallaecio misunderstood my mention of #3529 when we discussed this pull request |
Thanks! |
Implementation of per slot(downloader slot) settings.
On general case scrapy create new slot for each domain using
DOWNLOAD_DELAY
,CONCURRENT_REQUESTS_PER_DOMAIN
(orCONCURRENT_REQUESTS_PER_IP
) andRANDOMIZE_DOWNLOAD_DELAY
from settings.Following changes aimed to enable configure delay, concurency, randomized delay per Downloader.Slot (per domain) to set custom delay, concurrency for individually for each domain.
Related part of downloader code updated to accept per slot settings from
PER_SLOT_SETTINGS
setting:example test .py script