Skip to content

Per slot settings #5328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Mar 9, 2023
Merged

Per slot settings #5328

merged 26 commits into from
Mar 9, 2023

Conversation

GeorgeA92
Copy link
Contributor

@GeorgeA92 GeorgeA92 commented Nov 22, 2021

Implementation of per slot(downloader slot) settings.

On general case scrapy create new slot for each domain using DOWNLOAD_DELAY, CONCURRENT_REQUESTS_PER_DOMAIN (or CONCURRENT_REQUESTS_PER_IP) and RANDOMIZE_DOWNLOAD_DELAY from settings.

Following changes aimed to enable configure delay, concurency, randomized delay per Downloader.Slot (per domain) to set custom delay, concurrency for individually for each domain.

Related part of downloader code updated to accept per slot settings from PER_SLOT_SETTINGS setting:

    custom_settings = {
        "DOWNLOAD_DELAY": 3,
        "PER_SLOT_SETTINGS": {
            'quotes.toscrape.com': {
                'concurrency':1,
                'delay': 15,
                'randomize_delay': False
            },
            'books.toscrape.com': {
                'delay': 39,
            }
        }
    }
example test .py script
import scrapy
from scrapy.crawler import CrawlerProcess
import logging

class PerSlotTestSpider(scrapy.Spider):
    name = "per_slot_settings_set"
    custom_settings = {
        "DOWNLOAD_DELAY": 3,
        "PER_SLOT_SETTINGS": {
            'quotes.toscrape.com': {
                'concurrency':1,
                'delay': 15,
                'randomize_delay': False
            },
            'books.toscrape.com': {
                'delay': 39,
            }
        }
    }

    def start_requests(self):
        yield scrapy.Request(url="http://books.toscrape.com/", callback=self.not_parse) # per slot settings (concurrency, randomized -> from spider/project settings)
        yield scrapy.Request(url="http://quotes.toscrape.com/page/1/", callback=self.parse) # 100% from per slot settings
        yield scrapy.Request(url="http://example.com/", callback=self.not_parse) # from 100% from general settings

    def parse(self, response):
        next =  response.css("li.next a::attr(href)").extract_first()
        if next:
            yield scrapy.Request(url=response.urljoin(next), callback=self.parse)
        else:
            yield scrapy.Request(url='http://books.toscrape.com/', callback=self.not_parse, dont_filter=True)
            yield scrapy.Request(url="http://example.com/", callback=self.not_parse, dont_filter=True)

    def not_parse(self, response):
        pass
process = CrawlerProcess()
process.crawl(PerSlotTestSpider)
process.start()

@codecov
Copy link

codecov bot commented Nov 22, 2021

Codecov Report

Merging #5328 (d3d474b) into master (eecc035) will increase coverage by 0.01%.
The diff coverage is 96.92%.

❗ Current head d3d474b differs from pull request most recent head 218829b. Consider uploading reports for the commit 218829b to get more accurate results

@@            Coverage Diff             @@
##           master    #5328      +/-   ##
==========================================
+ Coverage   88.95%   88.96%   +0.01%     
==========================================
  Files         162      162              
  Lines       11007    11021      +14     
  Branches     1798     1796       -2     
==========================================
+ Hits         9791     9805      +14     
- Misses        937      938       +1     
+ Partials      279      278       -1     
Impacted Files Coverage Δ
scrapy/commands/__init__.py 74.71% <ø> (ø)
scrapy/commands/bench.py 100.00% <ø> (ø)
scrapy/commands/crawl.py 60.00% <ø> (ø)
scrapy/commands/edit.py 51.85% <ø> (ø)
scrapy/commands/fetch.py 89.13% <ø> (ø)
scrapy/commands/genspider.py 87.38% <ø> (ø)
scrapy/commands/list.py 77.77% <ø> (ø)
scrapy/commands/runspider.py 93.47% <ø> (ø)
scrapy/commands/settings.py 71.87% <ø> (ø)
scrapy/commands/shell.py 92.85% <ø> (ø)
... and 84 more

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, nice job! Tests and documentation should go next, before we merge.

@Gallaecio
Copy link
Member

Re-running failing jobs…

@GeorgeA92
Copy link
Contributor Author

This PR is ready for review.
Tests for per slot settings implemented using mockserver.

@Gallaecio
Copy link
Member

@GeorgeA92 #5328 (comment) is still unaddressed.

tolerance = 0.3

delays_real = {k: v[1] - v[0] for k, v in times.items()}
error_delta = {k: 1 - delays_real[k] / v.delay for k, v in slots.items()}
Copy link
Member

@kmike kmike Oct 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, for one slot in the tests the delay is 2s, for another it's 1.5s, and it's 1s for the default. Let's say implementation is incorrect, and in all cases delay of 2s is used. error_delta values would be 1 - 2/1.5 = -0.33 and (1-2/1 = 01). The assertion below will pass: max(-0.33, -1) = -0.33 < 0.3 is True.

If the 1.5s delay is used instead of 2s delay, then error_delta values would be 1 - 1.5/2 = 0.25. 0.25 < 0.3 is True again. It may catch the default 1s download delay though.

Maybe there are reasons these tests may detect an issue (e.g. default DOWNLOAD_DELAY), but I think it could make sense to try improving them. Have you encountered issues with a simpler implementation, something like abs(delays_real[k] - v.delay) < tolerance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test implemented directly in the same way as existing test for DOWNLOAD_DELAY setting (counting that we need to test multiple downloader slots now)

tolerance = (1 - (0.6 if randomize else 0.2))
settings = {"DOWNLOAD_DELAY": delay,
'RANDOMIZE_DOWNLOAD_DELAY': randomize}
crawler = get_crawler(FollowAllSpider, settings)
yield crawler.crawl(**crawl_kwargs)
times = crawler.spider.times
total_time = times[-1] - times[0]
average = total_time / (len(times) - 1)
self.assertTrue(average > delay * tolerance,
f"download delay too small: {average}")

On my local tests delays_real calculated from time.time calls in spider by some unknown reason always a bit lower that delays from settings:
for delays [1, 1.5, 2] - I received [~0.90, ~1.43, ~ 1.90] in delays_real. So on this condition I don't expect issues from value that became negative.
Anyway error calculation method updated to be sure that it will not happen.

I also thought about increasing delay for test from [1, 1.5, 2] to for example [3, 5, 7] in this case with increased delays in test we can safely reduce tolerance.

The main question what is acceptable total time for this test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main question what is acceptable total time for this test?

The minimum value that still makes the test reliable, both in the sense that it does not break randomly, and that it indeed validates what it is meant to validate. If we need 30 seconds for that, so be it. But if it can be done in 5 seconds, that would be better.

@kmike
Copy link
Member

kmike commented Oct 16, 2022

Hey! The feature makes sense, and the implementation looks good 👍 I think to merge it, we need to fix a few small issues and add the docs.

@GeorgeA92
Copy link
Contributor Author

Updated pull request.
As this #3585 is open and we don't have any other mention of downloader slot component in docs - at this stage it is not clear how detailed this setting needs to be documented.

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/scrapy/scrapy/pull/5328/files#r933163347 is still not addressed, as far as I can tell. I think we need to apply per-slot settings conc after the call to _get_concurrency_delay, to make sure conc from per-slot settings takes precedence.

@wRAR
Copy link
Member

wRAR commented Feb 24, 2023

How does this solve #3529?

@GeorgeA92
Copy link
Contributor Author

How does this solve #3529?

This pull request doesn't solve #3529 itself
However applying functionality from this PR into SitemapSpider (I think) can.
By asigning requests to sitemaps to custom downloader slot ... sitemaps with concurrency 1 defined in DOWNLOAD_SLOTS

I suppose that @Gallaecio misunderstood my mention of #3529 when we discussed this pull request

@wRAR wRAR merged commit 3659a8c into scrapy:master Mar 9, 2023
@wRAR
Copy link
Member

wRAR commented Mar 9, 2023

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants