Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove top-level reactor imports from CrawlerProces/CrawlerRunner examples #6361

Closed
wRAR opened this issue May 14, 2024 · 5 comments · Fixed by #6374
Closed

Remove top-level reactor imports from CrawlerProces/CrawlerRunner examples #6361

wRAR opened this issue May 14, 2024 · 5 comments · Fixed by #6374

Comments

@wRAR
Copy link
Member

wRAR commented May 14, 2024

There are several code examples on https://docs.scrapy.org/en/latest/topics/practices.html that have a top-level from twisted.internet import reactor, which is problematic (breaks when the settings specify a non-default reactor) and needs to be fixed.

@Laerte
Copy link
Member

Laerte commented May 14, 2024

For this we should check if we have TWISTED_REACTOR setting defined (get_project_settings) and if is we call install_reactor before importing reactor?

@wRAR
Copy link
Member Author

wRAR commented May 14, 2024

I think it's enough to move the imports inside blocks so that they only run after the setting is applied (i.e. after Crawler.crawl(), so after CrawlerRunner.crawl()). The changed examples should be tested with a non-default reactor setting value in any case.

If/when that's not possible to do it makes sense to add install_reactor() to examples I think.

@Laerte
Copy link
Member

Laerte commented May 21, 2024

@wRAR I was testing this and noticed that if we have TWISTED_REACTOR in custom_settings but we don't call install_reactor we always get an exception when Scrapy runs _apply_settings method (using CrawlerRunner):

Exception: The installed reactor (twisted.internet.selectreactor.SelectReactor) does not match the requested one (twisted.internet.asyncioreactor.AsyncioSelectorReactor)

Because init_reactor is False:

if self._init_reactor:

I see init_reactor parameter

self._init_reactor: bool = init_reactor
but when we use runner.crawl from CrawlerRunner theres no way to override this parameter, when is created:

scrapy/scrapy/crawler.py

Lines 330 to 334 in 631fc65

def _create_crawler(self, spidercls: Union[str, Type[Spider]]) -> Crawler:
if isinstance(spidercls, str):
spidercls = self.spider_loader.load(spidercls)
# temporary cast until self.spider_loader is typed
return Crawler(cast(Type[Spider], spidercls), self.settings)

P.S: If I switch to CrawlerProcess works (even without calling install_reactor), just to confirm if this is expected.

Here my snippet:

from scrapy import Spider
from scrapy.http import Request
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings


class MySpider1(Spider):
    name = "my_spider"

    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
    }

    def start_requests(self):
        yield Request(url="https://httpbin.org/anything")

    def parse(self, response):
        yield response.json()


class MySpider2(Spider):
    name = "my_spider2"

    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
    }

    def start_requests(self):
        yield Request(url="https://httpbin.org/anything")

    def parse(self, response):
        yield response.json()


configure_logging()
settings = get_project_settings()
runner = CrawlerRunner(settings)
# from scrapy.utils.reactor import install_reactor
# install_reactor("twisted.internet.asyncioreactor.AsyncioSelectorReactor")
runner.crawl(MySpider1)
runner.crawl(MySpider2)
from twisted.internet import reactor
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()

@wRAR
Copy link
Member Author

wRAR commented May 22, 2024

CrawlerRunner indeed requires you to install (and start) the reactor in your code, so it makes sense that CrawlerRunner examples show installing a non-default reactor manually.

@Laerte
Copy link
Member

Laerte commented May 22, 2024

CrawlerRunner indeed requires you to install (and start) the reactor in your code, so it makes sense that CrawlerRunner examples show installing a non-default reactor manually.

Got it, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants