docs: Remove top-level reactor imports from CrawlerProces/CrawlerRunner examples #6374

Laerte · 2024-05-22T10:51:11Z

…er examples

docs/topics/practices.rst

wRAR · 2024-05-22T14:28:29Z

I've just recalled we have testcases for similar things, in tests/Crawler{Process,Runner} (the actual tests are in tests/test_crawler.py), can you please check if there are some cases that make sense to put into docs but are not covered by those, and add them if needed?

Co-authored-by: Andrey Rakhmatullin <wrar@wrar.name>

Laerte · 2024-05-23T00:59:30Z

I've just recalled we have testcases for similar things, in tests/Crawler{Process,Runner} (the actual tests are in tests/test_crawler.py), can you please check if there are some cases that make sense to put into docs but are not covered by those, and add them if needed?

@wRAR I've taken a look and seems that sleeping.py is the only one that would be good to add to docs (let me know if you find other interesting), e.g: scenarios where user needs to wait few seconds to something get processed by site but don't want to I/O block other requests using time.sleep, this example is only for asyncio reactor, using https://stackoverflow.com/a/36985249/5314295 as reference I created this snippet for not async:

from time import time
from urllib.parse import parse_qsl

from twisted.internet import reactor
from twisted.internet.defer import Deferred

from scrapy import Spider
from scrapy.crawler import CrawlerRunner
from scrapy.http import Request
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings


class MySpider1(Spider):
    name = "my_spider"

    io_block = False  # change to simulate I/O blocking

    def start_requests(self):
        for x in range(10):
            yield Request(url=f"https://httpbin.org/anything?sleep_for={x}")

    def parse(self, response):
        from twisted.internet import reactor

        d = Deferred()
        start = time()

        def yield_item():
            end = time()
            self.logger.info(f"[finished] zZzZZ {sleep_for}, took: {end - start}")
            yield response.json()

        query_params = dict(parse_qsl(response.url.split("?")[-1]))
        sleep_for = int(query_params["sleep_for"])

        if sleep_for % 2 == 0:
            self.logger.info(f"[start] zZzZ... {sleep_for}")

            if self.io_block:
                from time import sleep

                sleep(sleep_for)
                return yield_item()
            else:
                reactor.callLater(sleep_for, d.callback, yield_item())
                return d
        else:
            return response.json()


settings = get_project_settings()
configure_logging(settings)
runner = CrawlerRunner(settings)

d = runner.crawl(MySpider1)
d.addBoth(lambda _: reactor.stop())
reactor.run()

But it seems too verbose to me, is there a way to simplify this? So we don't need to write yield_item function and use return instead of yield keyword?

codecov · 2024-05-23T01:50:24Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.92%. Comparing base (180bc9b) to head (9d5a0d2).
Report is 13 commits behind head on master.

❗ Current head 9d5a0d2 differs from pull request most recent head 6cd0857

Please upload reports for the commit 6cd0857 to get more accurate results.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6374      +/-   ##
==========================================
- Coverage   85.00%   84.92%   -0.08%     
==========================================
  Files         161      162       +1     
  Lines       11962    12112     +150     
  Branches     1872     1839      -33     
==========================================
+ Hits        10168    10286     +118     
- Misses       1512     1538      +26     
- Partials      282      288       +6

see 9 files with indirect coverage changes

wRAR · 2024-05-23T09:56:36Z

I see I phrased that poorly, I meant that we should add (or modify) test cases so that things we have put into docs are actually tested, can you please check that if you didn't already?

As for sleeping.py, that file is for testing graceful/forced shutdowns so it doesn't need to be copied to docs itself. It indeed may be useful to have a sleeping example in docs (that doesn't use asyncio.sleep()), as a separate PR though. As for how to do it - I'm actually not sure how does your example work and it looks like it has too many functions, but I don't think it's possible to do in a clean way on pure Twisted deferreds, and probably the only sane way is making the callback async def and awaiting on a deferred that gets called via callLater.

docs/topics/practices.rst

wRAR · 2024-05-23T09:59:22Z

(I've forgot that I wanted the test changes if they are needed, please don't merge this without them)

Co-authored-by: Andrey Rakhmatullin <wrar@wrar.name>

Laerte · 2024-05-23T10:02:26Z

(I've forgot that I wanted the test changes if they are needed, please don't merge this without them)

Oh yeah I will check later.

Laerte · 2024-05-26T23:20:41Z

@wRAR Things that I noticed while creating the test:

If you define TWISTED_REACTOR but don't install when using CrawlerRunner the script just hang:

2024-05-26 20:00:17 [scrapy.addons] INFO: Enabled addons:
[]

To get the exception we need to define addErrback with call to log then the exception is printed but the reactor is not stoped, so I added another addErrback with sys.exit(1) call but don't seems right, here the snippet:

from scrapy import Spider
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging


class NoRequestsSpider(Spider):
    name = "no_request"

    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
    }

    def start_requests(self):
        return []


configure_logging({"LOG_FORMAT": "%(levelname)s: %(message)s", "LOG_LEVEL": "DEBUG"})

INSTALL_REACTOR = False

if INSTALL_REACTOR:
    from scrapy.utils.reactor import install_reactor

    install_reactor("twisted.internet.asyncioreactor.AsyncioSelectorReactor")

runner = CrawlerRunner()

d = runner.crawl(NoRequestsSpider)

from twisted.internet import reactor
from twisted.python import log
import sys

d.addErrback(log.err)
d.addErrback(sys.exit(1))
d.addBoth(callback=lambda _: reactor.stop())
reactor.run()

Let me know if you think we should add this test or is there a better way to write this script.

wRAR · 2024-05-27T08:48:08Z

Not having the error logging is #6047, or at least directly related to it. Not sure why we need to call sys.exit() to properly finish the process, though (maybe it's not a very interesting question though). In any case it seems out of scope here.

Laerte · 2024-05-27T10:34:41Z

Not having the error logging is #6047, or at least directly related to it. Not sure why we need to call sys.exit() to properly finish the process, though (maybe it's not a very interesting question though). In any case it seems out of scope here.

Nice, so is good to merge then!

wRAR · 2024-05-27T10:36:39Z

Thanks!

docs: Remove top-level reactor imports from CrawlerProces/CrawlerRunn…

e676cd3

…er examples

Laerte force-pushed the master branch from 1222166 to e676cd3 Compare May 22, 2024 10:56

Laerte self-assigned this May 22, 2024

Laerte requested a review from wRAR May 22, 2024 13:53

wRAR reviewed May 22, 2024

View reviewed changes

docs/topics/practices.rst Show resolved Hide resolved

wRAR reviewed May 22, 2024

View reviewed changes

docs/topics/practices.rst Outdated Show resolved Hide resolved

wRAR reviewed May 22, 2024

View reviewed changes

docs/topics/practices.rst Show resolved Hide resolved

wRAR reviewed May 22, 2024

View reviewed changes

docs/topics/practices.rst Show resolved Hide resolved

Laerte and others added 2 commits May 22, 2024 18:50

Update docs/topics/practices.rst

8210fae

Co-authored-by: Andrey Rakhmatullin <wrar@wrar.name>

Update docs/topics/practices.rst

dc6a495

Co-authored-by: Andrey Rakhmatullin <wrar@wrar.name>

Laerte and others added 3 commits May 22, 2024 22:01

fix: checks

3f66b66

Update tests-macos.yml

e143dc7

Retrigger CI

9d5a0d2

Laerte requested a review from wRAR May 23, 2024 01:32

wRAR reviewed May 23, 2024

View reviewed changes

docs/topics/practices.rst Show resolved Hide resolved

wRAR reviewed May 23, 2024

View reviewed changes

docs/topics/practices.rst Show resolved Hide resolved

wRAR reviewed May 23, 2024

View reviewed changes

docs/topics/practices.rst Show resolved Hide resolved

wRAR approved these changes May 23, 2024

View reviewed changes

Laerte and others added 3 commits May 23, 2024 07:00

Update docs/topics/practices.rst

17e623c

Co-authored-by: Andrey Rakhmatullin <wrar@wrar.name>

Update docs/topics/practices.rst

8ec67ca

Co-authored-by: Andrey Rakhmatullin <wrar@wrar.name>

Update docs/topics/practices.rst

62c89aa

Co-authored-by: Andrey Rakhmatullin <wrar@wrar.name>

Laerte added 2 commits May 26, 2024 19:55

Add change reactor test to CrawlerRunner

2facdd4

Move path

6cd0857

wRAR merged commit f9a9860 into scrapy:master May 27, 2024
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Remove top-level reactor imports from CrawlerProces/CrawlerRunner examples #6374

docs: Remove top-level reactor imports from CrawlerProces/CrawlerRunner examples #6374

Laerte commented May 22, 2024

wRAR commented May 22, 2024 •

edited

Loading

Laerte commented May 23, 2024 •

edited

Loading

codecov bot commented May 23, 2024 •

edited

Loading

wRAR commented May 23, 2024

wRAR commented May 23, 2024

Laerte commented May 23, 2024

Laerte commented May 26, 2024 •

edited

Loading

wRAR commented May 27, 2024

Laerte commented May 27, 2024

wRAR commented May 27, 2024

docs: Remove top-level reactor imports from CrawlerProces/CrawlerRunner examples #6374

docs: Remove top-level reactor imports from CrawlerProces/CrawlerRunner examples #6374

Conversation

Laerte commented May 22, 2024

wRAR commented May 22, 2024 • edited Loading

Laerte commented May 23, 2024 • edited Loading

codecov bot commented May 23, 2024 • edited Loading

Codecov Report

wRAR commented May 23, 2024

wRAR commented May 23, 2024

Laerte commented May 23, 2024

Laerte commented May 26, 2024 • edited Loading

wRAR commented May 27, 2024

Laerte commented May 27, 2024

wRAR commented May 27, 2024

wRAR commented May 22, 2024 •

edited

Loading

Laerte commented May 23, 2024 •

edited

Loading

codecov bot commented May 23, 2024 •

edited

Loading

Laerte commented May 26, 2024 •

edited

Loading