Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Remove top-level reactor imports from CrawlerProces/CrawlerRunner examples #6374

Merged
merged 11 commits into from
May 27, 2024

Conversation

Laerte
Copy link
Member

@Laerte Laerte commented May 22, 2024

fix #6361

docs/topics/practices.rst Outdated Show resolved Hide resolved
@wRAR
Copy link
Member

wRAR commented May 22, 2024

I've just recalled we have testcases for similar things, in tests/Crawler{Process,Runner} (the actual tests are in tests/test_crawler.py), can you please check if there are some cases that make sense to put into docs but are not covered by those, and add them if needed?

Laerte and others added 2 commits May 22, 2024 18:50
Co-authored-by: Andrey Rakhmatullin <wrar@wrar.name>
Co-authored-by: Andrey Rakhmatullin <wrar@wrar.name>
@Laerte
Copy link
Member Author

Laerte commented May 23, 2024

I've just recalled we have testcases for similar things, in tests/Crawler{Process,Runner} (the actual tests are in tests/test_crawler.py), can you please check if there are some cases that make sense to put into docs but are not covered by those, and add them if needed?

@wRAR I've taken a look and seems that sleeping.py is the only one that would be good to add to docs (let me know if you find other interesting), e.g: scenarios where user needs to wait few seconds to something get processed by site but don't want to I/O block other requests using time.sleep, this example is only for asyncio reactor, using https://stackoverflow.com/a/36985249/5314295 as reference I created this snippet for not async:

from time import time
from urllib.parse import parse_qsl

from twisted.internet import reactor
from twisted.internet.defer import Deferred

from scrapy import Spider
from scrapy.crawler import CrawlerRunner
from scrapy.http import Request
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings


class MySpider1(Spider):
    name = "my_spider"

    io_block = False  # change to simulate I/O blocking

    def start_requests(self):
        for x in range(10):
            yield Request(url=f"https://httpbin.org/anything?sleep_for={x}")

    def parse(self, response):
        from twisted.internet import reactor

        d = Deferred()
        start = time()

        def yield_item():
            end = time()
            self.logger.info(f"[finished] zZzZZ {sleep_for}, took: {end - start}")
            yield response.json()

        query_params = dict(parse_qsl(response.url.split("?")[-1]))
        sleep_for = int(query_params["sleep_for"])

        if sleep_for % 2 == 0:
            self.logger.info(f"[start] zZzZ... {sleep_for}")

            if self.io_block:
                from time import sleep

                sleep(sleep_for)
                return yield_item()
            else:
                reactor.callLater(sleep_for, d.callback, yield_item())
                return d
        else:
            return response.json()


settings = get_project_settings()
configure_logging(settings)
runner = CrawlerRunner(settings)

d = runner.crawl(MySpider1)
d.addBoth(lambda _: reactor.stop())
reactor.run()

But it seems too verbose to me, is there a way to simplify this? So we don't need to write yield_item function and use return instead of yield keyword?

@Laerte Laerte requested a review from wRAR May 23, 2024 01:32
Copy link

codecov bot commented May 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.92%. Comparing base (180bc9b) to head (9d5a0d2).
Report is 13 commits behind head on master.

Current head 9d5a0d2 differs from pull request most recent head 6cd0857

Please upload reports for the commit 6cd0857 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6374      +/-   ##
==========================================
- Coverage   85.00%   84.92%   -0.08%     
==========================================
  Files         161      162       +1     
  Lines       11962    12112     +150     
  Branches     1872     1839      -33     
==========================================
+ Hits        10168    10286     +118     
- Misses       1512     1538      +26     
- Partials      282      288       +6     

see 9 files with indirect coverage changes

@wRAR
Copy link
Member

wRAR commented May 23, 2024

I see I phrased that poorly, I meant that we should add (or modify) test cases so that things we have put into docs are actually tested, can you please check that if you didn't already?

As for sleeping.py, that file is for testing graceful/forced shutdowns so it doesn't need to be copied to docs itself. It indeed may be useful to have a sleeping example in docs (that doesn't use asyncio.sleep()), as a separate PR though. As for how to do it - I'm actually not sure how does your example work and it looks like it has too many functions, but I don't think it's possible to do in a clean way on pure Twisted deferreds, and probably the only sane way is making the callback async def and awaiting on a deferred that gets called via callLater.

@wRAR
Copy link
Member

wRAR commented May 23, 2024

(I've forgot that I wanted the test changes if they are needed, please don't merge this without them)

Laerte and others added 3 commits May 23, 2024 07:00
Co-authored-by: Andrey Rakhmatullin <wrar@wrar.name>
Co-authored-by: Andrey Rakhmatullin <wrar@wrar.name>
Co-authored-by: Andrey Rakhmatullin <wrar@wrar.name>
@Laerte
Copy link
Member Author

Laerte commented May 23, 2024

(I've forgot that I wanted the test changes if they are needed, please don't merge this without them)

Oh yeah I will check later.

@Laerte
Copy link
Member Author

Laerte commented May 26, 2024

@wRAR Things that I noticed while creating the test:

  1. If you define TWISTED_REACTOR but don't install when using CrawlerRunner the script just hang:
2024-05-26 20:00:17 [scrapy.addons] INFO: Enabled addons:
[]

To get the exception we need to define addErrback with call to log then the exception is printed but the reactor is not stoped, so I added another addErrback with sys.exit(1) call but don't seems right, here the snippet:

from scrapy import Spider
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging


class NoRequestsSpider(Spider):
    name = "no_request"

    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
    }

    def start_requests(self):
        return []


configure_logging({"LOG_FORMAT": "%(levelname)s: %(message)s", "LOG_LEVEL": "DEBUG"})

INSTALL_REACTOR = False

if INSTALL_REACTOR:
    from scrapy.utils.reactor import install_reactor

    install_reactor("twisted.internet.asyncioreactor.AsyncioSelectorReactor")

runner = CrawlerRunner()

d = runner.crawl(NoRequestsSpider)

from twisted.internet import reactor
from twisted.python import log
import sys

d.addErrback(log.err)
d.addErrback(sys.exit(1))
d.addBoth(callback=lambda _: reactor.stop())
reactor.run()

Let me know if you think we should add this test or is there a better way to write this script.

@wRAR
Copy link
Member

wRAR commented May 27, 2024

Not having the error logging is #6047, or at least directly related to it. Not sure why we need to call sys.exit() to properly finish the process, though (maybe it's not a very interesting question though). In any case it seems out of scope here.

@Laerte
Copy link
Member Author

Laerte commented May 27, 2024

Not having the error logging is #6047, or at least directly related to it. Not sure why we need to call sys.exit() to properly finish the process, though (maybe it's not a very interesting question though). In any case it seems out of scope here.

Nice, so is good to merge then!

@wRAR wRAR merged commit f9a9860 into scrapy:master May 27, 2024
26 checks passed
@wRAR
Copy link
Member

wRAR commented May 27, 2024

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove top-level reactor imports from CrawlerProces/CrawlerRunner examples
2 participants