-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return Deferred object from open_spider in pipeline blocks spider #4855
Comments
May be related to the Note that you can use |
Thx. Although none of them works,I come up with an idea to deal with it, import scrapy
class MyPipeline:
def __init__(self):
self._opened = False
@classmethod
def from_crawler(cls, crawler):
return cls()
async def _open_spider(self, spider: scrapy.Spider):
# not sure if asyncio.Lock is necessary
if not self._opened:
spider.logger.debug("async pipeline opened!And you should see this only one time")
self.db = await connect_to_db()
self._opened = True
def open_spider(self, spider):
spider.logger.debug("ready to open!")
async def process_item(self, item, spider):
if not self._opened:
await self._open_spider(spider)
await self.db.insert(item)
pass It works just like before. |
I'm not really sure if it's a better way to deal with it but you can subscribe to https://docs.scrapy.org/en/latest/topics/signals.html#spider-opened |
It doesn't works either.The signal handler which return Dederred also blocks spider.The most troublesome is that no exception is raised,making it difficult to debug. |
As @wRAR points out, the problem seems to be related to the handling of the Consider the following snippet: # asyncio_deferred.py
import asyncio
import scrapy
from twisted.internet.defer import Deferred
class UppercasePipeline:
async def _open_spider(self, spider: scrapy.Spider):
spider.logger.debug("async pipeline opened!")
self.db = await asyncio.sleep(0.5)
def open_spider(self, spider):
loop = asyncio.get_event_loop()
return Deferred.fromFuture(loop.create_task(self._open_spider(spider)))
def process_item(self, item, spider):
return {"url": item["url"].upper()}
class UrlSpider(scrapy.Spider):
name = "url_spider"
start_urls = ["https://example.org"]
custom_settings = {
"ITEM_PIPELINES": {UppercasePipeline: 100}
}
def parse(self, response):
yield {"url": response.url} Executed on top of the current
This hangs, however it does work with the following change: diff --git scrapy/utils/reactor.py scrapy/utils/reactor.py
index 831d2946..6723d9b3 100644
--- scrapy/utils/reactor.py
+++ scrapy/utils/reactor.py
@@ -60,8 +60,9 @@ def install_reactor(reactor_path, event_loop_path=None):
if event_loop_path is not None:
event_loop_class = load_object(event_loop_path)
event_loop = event_loop_class()
+ asyncio.set_event_loop(event_loop)
else:
- event_loop = asyncio.new_event_loop()
+ event_loop = asyncio.get_event_loop()
asyncioreactor.install(eventloop=event_loop)
else:
*module, _ = reactor_path.split(".") I'm not entirely sure about the full implications of this change, we probably need to do a little bit more of research. At the time I asked for the reason of this change, here is the relevant conversation: #4414 (comment). |
The Twisted asyncio reactor uses I think it should be safe for us to use |
By the way, this is affecting |
It would be nice to support |
@kmike it was implemented only for |
Description
"open_spider" method in pipeline can't return Deferred object in scrapy 2.4, otherwise it would block spider. However, in earlier versions(2.3), this do work.
Steps to Reproduce
1.config in settings.py
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
2.use the following code
3.enable this pipeline
Expected behavior:
Would execute _open_spider method.
Actual behavior:
It worked good in scrapy 2.3. However, it blocks spider in scrapy2.4.
After output the following, spider get stuck and no output anymore,and did not close
Reproduces how often:
Appear when you try to return a Deferred object from open_spider method
Versions
Additional context
After changing back to a normal pipeline, the spider works again.
Obviously, this is pipeline's problem.
I wonder if there is any way to call coroutine functions from "open_spider".
I tried loop.create_task, asyncio.run_coroutine_threadsafe, but none of them works,
they just skip over the coroutine function.
Would this be fixed in new versions?
The text was updated successfully, but these errors were encountered: