Skip to content

Scheduler: minimal interface, API docs #3559

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Apr 26, 2021

Conversation

elacuesta
Copy link
Member

@elacuesta elacuesta commented Dec 31, 2018

Fixes #3537

@codecov
Copy link

codecov bot commented Dec 31, 2018

Codecov Report

Merging #3559 (8522f4b) into master (6837919) will decrease coverage by 0.17%.
The diff coverage is 93.67%.

@@            Coverage Diff             @@
##           master    #3559      +/-   ##
==========================================
- Coverage   88.25%   88.07%   -0.18%     
==========================================
  Files         162      162              
  Lines       10430    10467      +37     
  Branches     1514     1517       +3     
==========================================
+ Hits         9205     9219      +14     
- Misses        951      972      +21     
- Partials      274      276       +2     
Impacted Files Coverage Δ
scrapy/core/engine.py 83.96% <84.61%> (-0.23%) ⬇️
scrapy/core/scheduler.py 93.79% <95.23%> (+0.59%) ⬆️
scrapy/utils/job.py 75.00% <100.00%> (+8.33%) ⬆️
scrapy/robotstxt.py 75.30% <0.00%> (-22.23%) ⬇️
scrapy/core/downloader/__init__.py 90.97% <0.00%> (-1.51%) ⬇️

@kmike
Copy link
Member

kmike commented Jan 16, 2019

Hey @elacuesta! Thanks, I think we should really document this API, though I'd

  1. wait for [MRG+1] Downloader-aware Priority Queue for Scrapy #3520 to settle, and
  2. move most of this documentation to Scheduler docstrings, using autodocs to pull from them.

@elacuesta
Copy link
Member Author

That's good @kmike, thanks! I can wait on this, but I wonder how far away is that from being merged? The same use case that motivated me to understand the Scheduler would also benefit greatly by the addition of the request_left_downloader signal. Perhaps the signal could be added independently if the PR is not yet ready as a whole.

@kmike
Copy link
Member

kmike commented Jan 16, 2019

Finishing that PR is kind-of priority :)

The same use case that motivated me to understand the Scheduler would also benefit greatly by the addition of the request_left_downloader signal.

what's your use case?

@Gallaecio
Copy link
Member

Ping! #3520 has been merged 🙂

@elacuesta elacuesta marked this pull request as draft March 18, 2021 13:39
@elacuesta elacuesta changed the title [Doc] Scheduler API [WIP] Scheduler: minimal interface, API docs Mar 24, 2021
@elacuesta
Copy link
Member Author

I think we need to make a clear distinction between the API that a scheduler class must provide, and the API of the default scheduler class.

I think the latest changes make it clear that all the queue management stuff that the default scheduler does is not technically essential to perform the scheduler functions.

Now this extremely useful example is possible!
from scrapy import Spider


class FriendlyScheduler:
    def __init__(self):
        self.requests = dict()

    def has_pending_requests(self):
        return bool(self.requests)

    def open(self, spider):
        print(f"Hello {spider.__class__.__name__}, thanks for using this scheduler")
        return None

    def close(self, reason):
        print("Farewell my friend")
        return None

    def enqueue_request(self, request):
        if request.url in self.requests:
            return False
        print("By all means, I will store this request for you")
        self.requests[request.url] = request
        return True

    def next_request(self):
        if self.has_pending_requests():
            _, request = self.requests.popitem()
            print(f"Enjoy your next request: {request.url}")
            return request
        return None


class QuotesSpider(Spider):
    name = "quotes"
    start_urls = [
        "http://quotes.toscrape.com/tag/friends/",
        "http://quotes.toscrape.com/tag/life/",
        "http://quotes.toscrape.com/tag/humor/",
    ]
    custom_settings = {
        "SCHEDULER": FriendlyScheduler,
        "LOG_LEVEL": "INFO",
    }

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "author": quote.xpath("span/small/text()").get(),
                "text": quote.css("span.text::text").get(),
            }
$ scrapy runspider test-spiders/friendly_scheduler.py
2021-03-24 19:44:50 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: scrapybot)
2021-03-24 19:44:50 [scrapy.utils.log] INFO: Versions: lxml 4.6.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.8.6 (v3.8.6:db455296be, Sep 23 2020, 13:31:39) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1j  16 Feb 2021), cryptography 3.4.6, Platform macOS-10.15.7-x86_64-i386-64bit
2021-03-24 19:44:50 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-03-24 19:44:50 [scrapy.crawler] INFO: Overridden settings:
{'EDITOR': 'nano',
 'LOG_LEVEL': 'INFO',
 'SCHEDULER': <class 'friendly_scheduler.FriendlyScheduler'>,
 'SPIDER_LOADER_WARN_ONLY': True}
2021-03-24 19:44:50 [scrapy.extensions.telnet] INFO: Telnet Password: f376c0e3bc669451
2021-03-24 19:44:50 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2021-03-24 19:44:50 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-03-24 19:44:50 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-03-24 19:44:50 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-03-24 19:44:50 [scrapy.core.engine] INFO: Spider opened
Hello QuotesSpider, thanks for using this scheduler
2021-03-24 19:44:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-03-24 19:44:50 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
By all means, I will store this request for you
Enjoy your next request: http://quotes.toscrape.com/tag/friends/
By all means, I will store this request for you
Enjoy your next request: http://quotes.toscrape.com/tag/life/
By all means, I will store this request for you
Enjoy your next request: http://quotes.toscrape.com/tag/humor/
2021-03-24 19:44:51 [scrapy.core.engine] INFO: Closing spider (finished)
Farewell my friend
2021-03-24 19:44:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 688,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 6900,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 3,
 'elapsed_time_seconds': 0.509905,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 3, 24, 22, 44, 51, 325836),
 'item_scraped_count': 24,
 'log_count/INFO': 10,
 'memusage/max': 51703808,
 'memusage/startup': 51703808,
 'response_received_count': 3,
 'start_time': datetime.datetime(2021, 3, 24, 22, 44, 50, 815931)}
2021-03-24 19:44:51 [scrapy.core.engine] INFO: Spider closed (finished)

state = self.dqs.close()
assert isinstance(self.dqdir, str)
Copy link
Member Author

@elacuesta elacuesta Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion is only to avoid a typing error in the next line (_write_dqs_state expects a str but it gets Optional[str]). At this point, if self.dqs is not None it's only because self.dqdir is a str.

@elacuesta elacuesta changed the title [WIP] Scheduler: minimal interface, API docs Scheduler: minimal interface, API docs Apr 20, 2021
@elacuesta elacuesta marked this pull request as ready for review April 20, 2021 17:25
Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

elacuesta and others added 2 commits April 22, 2021 12:52
Co-authored-by: Adrián Chaves <adrian@chaves.io>
@elacuesta elacuesta merged commit ddea6b7 into scrapy:master Apr 26, 2021
@elacuesta elacuesta deleted the doc_scheduler_api branch April 26, 2021 19:16
@elacuesta elacuesta added this to the 2.6 milestone Apr 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Document Scheduler API
3 participants