-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow start_requests method running forever #456
Comments
I like the idea instead of catching spider_idle signal and raising DontCloseSpider, I recall it was this way in the early days but got removed to "simplify" the number of data types spider middlewares deals with. |
I like the implementation |
@dangra @kmike @eliasdorneles @lopuhin @pawelmhm , what do you think? |
My take on this: #1051 (comment). A related issue is that currently start_requests are not exhausted:
This is inconvenient if you e.g. have 100K websites to crawl and want to crawl their front pages (requests issued in start_requests), and follow some links on them (extracted in parse method). In this case even if you set a very high priority for requests sent in start_requests, they won't be processed. One have to use private APIs like It seems this is done in order to
but I think this is implicit and confusing (and not really documented), and explicit |
I also believe |
This fixes the explanation to use Requests instead of URLs, which is what actually happens, and is also consistent with the new tutorial, which already explains how URLs become Request objects. I've also changed the "loop", jumping from 9 to step 2.
I really like the idea of yielding a special value as @kmike suggested, that's what I'd expect as an user (a name suggestion: Would it be hard to implement? Would it break compatibility too much? |
@eliasdorneles I like |
|
Heh, the original plan was to make it a py3-only feature; in Python 2 there is still plain start_requests with old behavior. But we can add a ..decorator? It seems that compared to a plain generator |
By the way, if we're to yield scrapy.Request from async def start_requests, it'd be Python 3.6+ feature, not only Python 3 feature. I'm fine with that. |
Yes, IMO that will be the case by the time we can ditch PY2 |
An example workaround (not very reliable, because the first request might fail) for the issue of start_requests not consumed, assuming they can be loaded at the start (e.g. from a file):
Edit: but this causes loading all requests into memory, which might be an issue. |
any news on this ticket? |
Am i correct that currently |
I think this should be a better solution: class MySpider(Spider):
start_url = 'https://example.com'
def start_requests(self, /):
yield Request(
self.start_url,
self._yield_requests,
dont_filter=True,
errback=self._yield_requests,
)
def _yield_requests(self, _, /):
# yield requests infinitely
pass With such approach requests can be loaded later, after initial dummy request succeeded or failed. The second argument of Dummy request can be also populated with necessary metadata to avoid processing with middlewares. The URL of dummy request can be changed to always fail producing no response. |
@Prometheus3375 you can use a The callback can be implemented as an asynchronous generator so you can feed the engine whenever the spider evaluates it has to, introducing delays to returning new requests if necessary. btw, |
@dangra my concert is whether it is safe to put Kafka consuming loop inside def start_requests(self, /):
yield Request(
self.start_url,
self._yield_requests,
dont_filter=True,
errback=self._yield_requests,
)
def _yield_requests(self, _, /):
with self._consumer:
self._consumer.subscribe('topic_name')
for message in self._consumer:
if message:
data = json.loads(message.value())
yield self.create_request(data['url'], **data['meta'])
else:
# Yield None, so the spider does not block any other operation
# (stats collection and export, for example) if Kafka has failed
yield None I would like to move |
I'm afraid that won't be possible, That is why I was suggesting the following: import scrapy
class FoobarSpider(scrapy.Spider):
name = "foobar"
def start_requests(self):
yield scrapy.Request("data:,start", self._start_requests)
async def _start_requests(self, response):
for n in range(10):
yield scrapy.Request(f"data:,{n}")
def parse(self, response):
print("GOT: ", response.body) You can return If you are using asyncio reactor, you can await on async def _start_requests(self, response):
for n in range(10):
while busy():
await asyncio.sleep(1)
yield scrapy.Request(f"data:,{n}") In case you don't want to use the asyncio reactor, it is still possible to simulate the python's asyncio.sleep with Twisted builtins: from twisted.internet import reactor
from twisted.internet.task import deferLater
def sleep(delay, result=None):
return deferLater(reactor, delay, lambda: result) The last thing I must mention is that your kafka consumer code is "blocking" and won't play nice with the cooperative multitasking async model provided by Scrapy. There are some ways to make it work, but I would try aiokafka instead of kafka-python. For a longer explanation read this. This comment has grown larger than a thought!! |
Thank you for a comprehensive answer! In the code above a wrapper over confluent-kafka Consumer is used with added context manager and iterator protocols. confluent-kafka Consumer at least does commits asynchronously by default. There is still We will see how current implementation behaves, and will switch to aiokafka if necessary. |
For
version 0.18.4
Situation
A Spider gets one Reuqest from
start_requests
, andstart_requests
won't stop because it depends on the MQ.I know spider is sheduled by "yield". But if the MQ hands up because of no message coming in, the
start_requests
also hands up. That's not what I want.Solution
So, I have hacked the source scrapy/core/engine.py like below:
Then, method can be like this:
Thus,
start_requests
method can run forever, and not hands up any more.The text was updated successfully, but these errors were encountered: