Remove top-level reactor imports from CrawlerProces/CrawlerRunner examples #6361

wRAR · 2024-05-14T09:52:47Z

There are several code examples on https://docs.scrapy.org/en/latest/topics/practices.html that have a top-level from twisted.internet import reactor, which is problematic (breaks when the settings specify a non-default reactor) and needs to be fixed.

The text was updated successfully, but these errors were encountered:

Laerte · 2024-05-14T11:50:21Z

For this we should check if we have TWISTED_REACTOR setting defined (get_project_settings) and if is we call install_reactor before importing reactor?

wRAR · 2024-05-14T12:10:49Z

I think it's enough to move the imports inside blocks so that they only run after the setting is applied (i.e. after Crawler.crawl(), so after CrawlerRunner.crawl()). The changed examples should be tested with a non-default reactor setting value in any case.

If/when that's not possible to do it makes sense to add install_reactor() to examples I think.

Laerte · 2024-05-21T23:09:38Z

@wRAR I was testing this and noticed that if we have TWISTED_REACTOR in custom_settings but we don't call install_reactor we always get an exception when Scrapy runs _apply_settings method (using CrawlerRunner):

Exception: The installed reactor (twisted.internet.selectreactor.SelectReactor) does not match the requested one (twisted.internet.asyncioreactor.AsyncioSelectorReactor)

Because init_reactor is False:

scrapy/scrapy/crawler.py

Line 119 in 631fc65

if self._init_reactor:

I see init_reactor parameter

scrapy/scrapy/crawler.py

Line 79 in 631fc65

self._init_reactor: bool = init_reactor

but when we use runner.crawl from CrawlerRunner theres no way to override this parameter, when is created:

scrapy/scrapy/crawler.py

Lines 330 to 334 in 631fc65

    
           def _create_crawler(self, spidercls: Union[str, Type[Spider]]) -> Crawler: 
        
               if isinstance(spidercls, str): 
        
                   spidercls = self.spider_loader.load(spidercls) 
        
               # temporary cast until self.spider_loader is typed 
        
               return Crawler(cast(Type[Spider], spidercls), self.settings)

P.S: If I switch to CrawlerProcess works (even without calling install_reactor), just to confirm if this is expected.

Here my snippet:

from scrapy import Spider
from scrapy.http import Request
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings


class MySpider1(Spider):
    name = "my_spider"

    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
    }

    def start_requests(self):
        yield Request(url="https://httpbin.org/anything")

    def parse(self, response):
        yield response.json()


class MySpider2(Spider):
    name = "my_spider2"

    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
    }

    def start_requests(self):
        yield Request(url="https://httpbin.org/anything")

    def parse(self, response):
        yield response.json()


configure_logging()
settings = get_project_settings()
runner = CrawlerRunner(settings)
# from scrapy.utils.reactor import install_reactor
# install_reactor("twisted.internet.asyncioreactor.AsyncioSelectorReactor")
runner.crawl(MySpider1)
runner.crawl(MySpider2)
from twisted.internet import reactor
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()

wRAR · 2024-05-22T06:10:27Z

CrawlerRunner indeed requires you to install (and start) the reactor in your code, so it makes sense that CrawlerRunner examples show installing a non-default reactor manually.

Laerte · 2024-05-22T10:53:02Z

CrawlerRunner indeed requires you to install (and start) the reactor in your code, so it makes sense that CrawlerRunner examples show installing a non-default reactor manually.

Got it, thanks!

wRAR added bug good first issue docs labels May 14, 2024

Laerte mentioned this issue May 22, 2024

docs: Remove top-level reactor imports from CrawlerProces/CrawlerRunner examples #6374

Merged

wRAR closed this as completed in #6374 May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove top-level reactor imports from CrawlerProces/CrawlerRunner examples #6361

Remove top-level reactor imports from CrawlerProces/CrawlerRunner examples #6361

wRAR commented May 14, 2024 •

edited

Laerte commented May 14, 2024 •

edited

wRAR commented May 14, 2024

Laerte commented May 21, 2024 •

edited

wRAR commented May 22, 2024

Laerte commented May 22, 2024

Remove top-level reactor imports from CrawlerProces/CrawlerRunner examples #6361

Remove top-level reactor imports from CrawlerProces/CrawlerRunner examples #6361

Comments

wRAR commented May 14, 2024 • edited

Laerte commented May 14, 2024 • edited

wRAR commented May 14, 2024

Laerte commented May 21, 2024 • edited

wRAR commented May 22, 2024

Laerte commented May 22, 2024

wRAR commented May 14, 2024 •

edited

Laerte commented May 14, 2024 •

edited

Laerte commented May 21, 2024 •

edited