Return Deferred object from open_spider in pipeline blocks spider #4855

synodriver · 2020-10-26T10:40:25Z

Description

"open_spider" method in pipeline can't return Deferred object in scrapy 2.4, otherwise it would block spider. However, in earlier versions(2.3), this do work.

Steps to Reproduce

1.config in settings.py

asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'

2.use the following code

import asyncio
import scrapy
from twisted.internet.defer import Deferred

class MyPipeline:
    @classmethod
    def from_crawler(cls, crawler):
        return cls()

    async def _open_spider(self, spider: scrapy.Spider):
        spider.logger.debug("async pipeline opened!")
        self.db = await connect_to_db()

    def open_spider(self, spider):
        loop = asyncio.get_event_loop()
        # asyncio.run_coroutine_threadsafe(self._open_spider(spider),loop)
        return Deferred.fromFuture(loop.create_task(self._open_spider(spider)))

3.enable this pipeline

Expected behavior:
Would execute _open_spider method.

Actual behavior:

It worked good in scrapy 2.3. However, it blocks spider in scrapy2.4.
After output the following, spider get stuck and no output anymore,and did not close

2020-10-26 15:28:50 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: scrapy_unitest)
2020-10-26 15:28:50 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.5 | packaged by conda-forge | (default, Sep 16 2020, 17:19:16) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform Windows-10-10.0.18362-SP0
2020-10-26 15:28:50 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2020-10-26 15:28:50 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2020-10-26 15:28:50 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapy_unitest',
 'COOKIES_ENABLED': False,
 'NEWSPIDER_MODULE': 'scrapy_unitest.spiders',
 'SPIDER_MODULES': ['scrapy_unitest.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor',
 'USER_AGENT': 'scrapy_unitest (+http://www.yourdomain.com)'}
2020-10-26 15:28:50 [scrapy.extensions.telnet] INFO: Telnet Password: 2aec1713f90e2ba2
2020-10-26 15:28:50 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2020-10-26 15:28:51 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-26 15:28:51 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-26 15:28:51 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy_unitest.pipelines.MyPipeline']
2020-10-26 15:28:51 [scrapy.core.engine] INFO: Spider opened
2020-10-26 15:28:51 [asyncio] DEBUG: Using selector: SelectSelector

Reproduces how often:
Appear when you try to return a Deferred object from open_spider method

Versions

Scrapy       : 2.4.0
lxml         : 4.6.1.0
libxml2      : 2.9.10
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.21.0
Twisted      : 20.3.0
Python       : 3.8.5 | packaged by conda-forge | (default, Sep 16 2020, 17:19:16) [MSC v.1916 64 bit (AMD64)]
pyOpenSSL    : 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020)
cryptography : 3.1.1
Platform     : Windows-10-10.0.18362-SP0

Additional context

After changing back to a normal pipeline, the spider works again.
Obviously, this is pipeline's problem.

I wonder if there is any way to call coroutine functions from "open_spider".

I tried loop.create_task, asyncio.run_coroutine_threadsafe, but none of them works,
they just skip over the coroutine function.

Would this be fixed in new versions?

The text was updated successfully, but these errors were encountered:

wRAR · 2020-10-26T15:58:39Z

May be related to the new_event_loop/get_event_loop changes from #4414

Note that you can use scrapy.utils.defer.deferred_from_coro instead of doing wrapping manually (but it looks like asyncio.ensure_future is mostly the same as asyncio.get_event_loop().create_task()).

synodriver · 2020-10-27T10:20:15Z

May be related to the new_event_loop/get_event_loop changes from #4414

Note that you can use scrapy.utils.defer.deferred_from_coro instead of doing wrapping manually (but it looks like asyncio.ensure_future is mostly the same as asyncio.get_event_loop().create_task()).

Thx.

Although none of them works,I come up with an idea to deal with it,
My code just like the following

import scrapy

class MyPipeline:
    def __init__(self):
        self._opened = False

    @classmethod
    def from_crawler(cls, crawler):
        return cls()

    async def _open_spider(self, spider: scrapy.Spider):
        # not sure if asyncio.Lock is necessary
        if not self._opened:
            spider.logger.debug("async pipeline opened!And you should see this only one time")
            self.db = await connect_to_db()
            self._opened = True

    def open_spider(self, spider):
        spider.logger.debug("ready to open！")

    async def process_item(self, item, spider):
        if not self._opened:
            await self._open_spider(spider)
        await self.db.insert(item)
        pass

It works just like before.~~At least I didn't see anythong wrong.~~
Since the only method which can be defined with async def in pipeline is process_item,
just init the database connection in this method would be ok.
Hope there will be a better way to deal with it.

hakanutku · 2020-10-29T13:32:11Z

I'm not really sure if it's a better way to deal with it but you can subscribe to spider_opened and spider_closed signals and return a deferred there. Scrapy will wait for some signal handlers (spider_opened and spider_closed both support returning deferred from handlers) according to the documentation:

https://docs.scrapy.org/en/latest/topics/signals.html#spider-opened

synodriver · 2020-10-29T15:26:10Z

I'm not really sure if it's a better way to deal with it but you can subscribe to spider_opened and spider_closed signals and return a deferred there. Scrapy will wait for some signal handlers (spider_opened and spider_closed both support returning deferred from handlers) according to the documentation:

https://docs.scrapy.org/en/latest/topics/signals.html#spider-opened

It doesn't works either.The signal handler which return Dederred also blocks spider.The most troublesome is that no exception is raised,making it difficult to debug.

elacuesta · 2020-11-04T14:36:03Z

As @wRAR points out, the problem seems to be related to the handling of the asyncio event loop.

Consider the following snippet:

# asyncio_deferred.py
import asyncio

import scrapy
from twisted.internet.defer import Deferred

class UppercasePipeline:
    async def _open_spider(self, spider: scrapy.Spider):
        spider.logger.debug("async pipeline opened!")
        self.db = await asyncio.sleep(0.5)

    def open_spider(self, spider):
        loop = asyncio.get_event_loop()
        return Deferred.fromFuture(loop.create_task(self._open_spider(spider)))

    def process_item(self, item, spider):
        return {"url": item["url"].upper()}

class UrlSpider(scrapy.Spider):
    name = "url_spider"
    start_urls = ["https://example.org"]
    custom_settings = {
        "ITEM_PIPELINES": {UppercasePipeline: 100}
    }

    def parse(self, response):
        yield {"url": response.url}

Executed on top of the current master branch (a5872a0) with

scrapy runspider asyncio_deferred.py -s TWISTED_REACTOR=twisted.internet.asyncioreactor.AsyncioSelectorReactor

This hangs, however it does work with the following change:

diff --git scrapy/utils/reactor.py scrapy/utils/reactor.py
index 831d2946..6723d9b3 100644
--- scrapy/utils/reactor.py
+++ scrapy/utils/reactor.py
@@ -60,8 +60,9 @@ def install_reactor(reactor_path, event_loop_path=None):
             if event_loop_path is not None:
                 event_loop_class = load_object(event_loop_path)
                 event_loop = event_loop_class()
+                asyncio.set_event_loop(event_loop)
             else:
-                event_loop = asyncio.new_event_loop()
+                event_loop = asyncio.get_event_loop()
             asyncioreactor.install(eventloop=event_loop)
     else:
         *module, _ = reactor_path.split(".")

I'm not entirely sure about the full implications of this change, we probably need to do a little bit more of research. At the time I asked for the reason of this change, here is the relevant conversation: #4414 (comment).

elacuesta · 2020-11-04T20:42:43Z

The Twisted asyncio reactor uses get_event_loop when it's installed. This means that if the user code does not specify a custom loop (via the ASYNCIO_EVENT_LOOP setting) get_event_loop calling set_event_loop happens anyway, and not within the Scrapy realm.

I think it should be safe for us to use get_event_loop as well, considering how not using it breaks things. Perhaps we can add a warning to the docs so users are aware that cleaning might be necessary if they use a custom loop within a Scrapy process (and they intend to keep using the loop in the same Python process). This sounds like an advanced scenario to me, enough to assume that users would understand the implications of setting a non-default loop.

elacuesta · 2020-11-06T16:57:19Z

By the way, this is affecting scrapy-pyppeteer: elacuesta/scrapy-pyppeteer#6

kmike · 2020-11-07T10:47:50Z

It would be nice to support async def open_spider directly - @wRAR was it just not implemented, or is there somethng more fundamental here?

wRAR · 2020-11-07T14:32:33Z

@kmike it was implemented only for process_item but after a very short glance I think it can be implemented for open_spider too (and probably for some other middleware/pipeline methods that are called using the same code).

This was referenced Nov 5, 2020

Call asyncio.get_event_loop when installing the asyncio reactor #4872

Merged

scrapy-pyppeteer not working with Scrapy==2.4 elacuesta/scrapy-pyppeteer#6

Closed

kmike closed this as completed in #4872 Nov 10, 2020

elacuesta mentioned this issue Nov 16, 2020

Not started engine if async open_spider #4887

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return Deferred object from open_spider in pipeline blocks spider #4855

Return Deferred object from open_spider in pipeline blocks spider #4855

synodriver commented Oct 26, 2020

wRAR commented Oct 26, 2020

synodriver commented Oct 27, 2020

hakanutku commented Oct 29, 2020 •

edited

Loading

synodriver commented Oct 29, 2020

elacuesta commented Nov 4, 2020 •

edited

Loading

elacuesta commented Nov 4, 2020

elacuesta commented Nov 6, 2020

kmike commented Nov 7, 2020 •

edited

Loading

wRAR commented Nov 7, 2020

Return Deferred object from open_spider in pipeline blocks spider #4855

Return Deferred object from open_spider in pipeline blocks spider #4855

Comments

synodriver commented Oct 26, 2020

Description

Steps to Reproduce

Versions

Additional context

wRAR commented Oct 26, 2020

synodriver commented Oct 27, 2020

hakanutku commented Oct 29, 2020 • edited Loading

synodriver commented Oct 29, 2020

elacuesta commented Nov 4, 2020 • edited Loading

elacuesta commented Nov 4, 2020

elacuesta commented Nov 6, 2020

kmike commented Nov 7, 2020 • edited Loading

wRAR commented Nov 7, 2020

hakanutku commented Oct 29, 2020 •

edited

Loading

elacuesta commented Nov 4, 2020 •

edited

Loading

kmike commented Nov 7, 2020 •

edited

Loading