Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return Deferred object from open_spider in pipeline blocks spider #4855

Closed
synodriver opened this issue Oct 26, 2020 · 9 comments · Fixed by #4872
Closed

Return Deferred object from open_spider in pipeline blocks spider #4855

synodriver opened this issue Oct 26, 2020 · 9 comments · Fixed by #4872

Comments

@synodriver
Copy link

Description

"open_spider" method in pipeline can't return Deferred object in scrapy 2.4, otherwise it would block spider. However, in earlier versions(2.3), this do work.

Steps to Reproduce

1.config in settings.py

asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'

2.use the following code

import asyncio
import scrapy
from twisted.internet.defer import Deferred

class MyPipeline:
    @classmethod
    def from_crawler(cls, crawler):
        return cls()

    async def _open_spider(self, spider: scrapy.Spider):
        spider.logger.debug("async pipeline opened!")
        self.db = await connect_to_db()

    def open_spider(self, spider):
        loop = asyncio.get_event_loop()
        # asyncio.run_coroutine_threadsafe(self._open_spider(spider),loop)
        return Deferred.fromFuture(loop.create_task(self._open_spider(spider)))

3.enable this pipeline

Expected behavior:
Would execute _open_spider method.

Actual behavior:

It worked good in scrapy 2.3. However, it blocks spider in scrapy2.4.
After output the following, spider get stuck and no output anymore,and did not close

2020-10-26 15:28:50 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: scrapy_unitest)
2020-10-26 15:28:50 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.5 | packaged by conda-forge | (default, Sep 16 2020, 17:19:16) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform Windows-10-10.0.18362-SP0
2020-10-26 15:28:50 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2020-10-26 15:28:50 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2020-10-26 15:28:50 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapy_unitest',
 'COOKIES_ENABLED': False,
 'NEWSPIDER_MODULE': 'scrapy_unitest.spiders',
 'SPIDER_MODULES': ['scrapy_unitest.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor',
 'USER_AGENT': 'scrapy_unitest (+http://www.yourdomain.com)'}
2020-10-26 15:28:50 [scrapy.extensions.telnet] INFO: Telnet Password: 2aec1713f90e2ba2
2020-10-26 15:28:50 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2020-10-26 15:28:51 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-26 15:28:51 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-26 15:28:51 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy_unitest.pipelines.MyPipeline']
2020-10-26 15:28:51 [scrapy.core.engine] INFO: Spider opened
2020-10-26 15:28:51 [asyncio] DEBUG: Using selector: SelectSelector

Reproduces how often:
Appear when you try to return a Deferred object from open_spider method

Versions

Scrapy       : 2.4.0
lxml         : 4.6.1.0
libxml2      : 2.9.10
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.21.0
Twisted      : 20.3.0
Python       : 3.8.5 | packaged by conda-forge | (default, Sep 16 2020, 17:19:16) [MSC v.1916 64 bit (AMD64)]
pyOpenSSL    : 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020)
cryptography : 3.1.1
Platform     : Windows-10-10.0.18362-SP0

Additional context

After changing back to a normal pipeline, the spider works again.
Obviously, this is pipeline's problem.

I wonder if there is any way to call coroutine functions from "open_spider".

I tried loop.create_task, asyncio.run_coroutine_threadsafe, but none of them works,
they just skip over the coroutine function.

Would this be fixed in new versions?

@wRAR
Copy link
Contributor

wRAR commented Oct 26, 2020

May be related to the new_event_loop/get_event_loop changes from #4414

Note that you can use scrapy.utils.defer.deferred_from_coro instead of doing wrapping manually (but it looks like asyncio.ensure_future is mostly the same as asyncio.get_event_loop().create_task()).

@synodriver
Copy link
Author

May be related to the new_event_loop/get_event_loop changes from #4414

Note that you can use scrapy.utils.defer.deferred_from_coro instead of doing wrapping manually (but it looks like asyncio.ensure_future is mostly the same as asyncio.get_event_loop().create_task()).

Thx.

Although none of them works,I come up with an idea to deal with it,
My code just like the following

import scrapy

class MyPipeline:
    def __init__(self):
        self._opened = False

    @classmethod
    def from_crawler(cls, crawler):
        return cls()

    async def _open_spider(self, spider: scrapy.Spider):
        # not sure if asyncio.Lock is necessary
        if not self._opened:
            spider.logger.debug("async pipeline opened!And you should see this only one time")
            self.db = await connect_to_db()
            self._opened = True

    def open_spider(self, spider):
        spider.logger.debug("ready to open!")

    async def process_item(self, item, spider):
        if not self._opened:
            await self._open_spider(spider)
        await self.db.insert(item)
        pass

It works just like before.At least I didn't see anythong wrong.
Since the only method which can be defined with async def in pipeline is process_item,
just init the database connection in this method would be ok.
Hope there will be a better way to deal with it.

@hakanutku
Copy link

hakanutku commented Oct 29, 2020

I'm not really sure if it's a better way to deal with it but you can subscribe to spider_opened and spider_closed signals and return a deferred there. Scrapy will wait for some signal handlers (spider_opened and spider_closed both support returning deferred from handlers) according to the documentation:

https://docs.scrapy.org/en/latest/topics/signals.html#spider-opened

@synodriver
Copy link
Author

I'm not really sure if it's a better way to deal with it but you can subscribe to spider_opened and spider_closed signals and return a deferred there. Scrapy will wait for some signal handlers (spider_opened and spider_closed both support returning deferred from handlers) according to the documentation:

https://docs.scrapy.org/en/latest/topics/signals.html#spider-opened

It doesn't works either.The signal handler which return Dederred also blocks spider.The most troublesome is that no exception is raised,making it difficult to debug.

@elacuesta
Copy link
Member

elacuesta commented Nov 4, 2020

As @wRAR points out, the problem seems to be related to the handling of the asyncio event loop.

Consider the following snippet:

# asyncio_deferred.py
import asyncio

import scrapy
from twisted.internet.defer import Deferred

class UppercasePipeline:
    async def _open_spider(self, spider: scrapy.Spider):
        spider.logger.debug("async pipeline opened!")
        self.db = await asyncio.sleep(0.5)

    def open_spider(self, spider):
        loop = asyncio.get_event_loop()
        return Deferred.fromFuture(loop.create_task(self._open_spider(spider)))

    def process_item(self, item, spider):
        return {"url": item["url"].upper()}

class UrlSpider(scrapy.Spider):
    name = "url_spider"
    start_urls = ["https://example.org"]
    custom_settings = {
        "ITEM_PIPELINES": {UppercasePipeline: 100}
    }

    def parse(self, response):
        yield {"url": response.url}

Executed on top of the current master branch (a5872a0) with

scrapy runspider asyncio_deferred.py -s TWISTED_REACTOR=twisted.internet.asyncioreactor.AsyncioSelectorReactor

This hangs, however it does work with the following change:

diff --git scrapy/utils/reactor.py scrapy/utils/reactor.py
index 831d2946..6723d9b3 100644
--- scrapy/utils/reactor.py
+++ scrapy/utils/reactor.py
@@ -60,8 +60,9 @@ def install_reactor(reactor_path, event_loop_path=None):
             if event_loop_path is not None:
                 event_loop_class = load_object(event_loop_path)
                 event_loop = event_loop_class()
+                asyncio.set_event_loop(event_loop)
             else:
-                event_loop = asyncio.new_event_loop()
+                event_loop = asyncio.get_event_loop()
             asyncioreactor.install(eventloop=event_loop)
     else:
         *module, _ = reactor_path.split(".")

I'm not entirely sure about the full implications of this change, we probably need to do a little bit more of research. At the time I asked for the reason of this change, here is the relevant conversation: #4414 (comment).

@elacuesta
Copy link
Member

The Twisted asyncio reactor uses get_event_loop when it's installed. This means that if the user code does not specify a custom loop (via the ASYNCIO_EVENT_LOOP setting) get_event_loop calling set_event_loop happens anyway, and not within the Scrapy realm.

I think it should be safe for us to use get_event_loop as well, considering how not using it breaks things. Perhaps we can add a warning to the docs so users are aware that cleaning might be necessary if they use a custom loop within a Scrapy process (and they intend to keep using the loop in the same Python process). This sounds like an advanced scenario to me, enough to assume that users would understand the implications of setting a non-default loop.

@elacuesta
Copy link
Member

By the way, this is affecting scrapy-pyppeteer: elacuesta/scrapy-pyppeteer#6

@kmike
Copy link
Member

kmike commented Nov 7, 2020

It would be nice to support async def open_spider directly - @wRAR was it just not implemented, or is there somethng more fundamental here?

@wRAR
Copy link
Contributor

wRAR commented Nov 7, 2020

@kmike it was implemented only for process_item but after a very short glance I think it can be implemented for open_spider too (and probably for some other middleware/pipeline methods that are called using the same code).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants