Allow start_requests method running forever #456

ipyleaf · 2013-11-07T04:06:01Z

For

version 0.18.4

Situation

A Spider gets one Reuqest from start_requests, and start_requests won't stop because it depends on the MQ.

I know spider is sheduled by "yield". But if the MQ hands up because of no message coming in, the start_requests also hands up. That's not what I want.

Solution

So, I have hacked the source scrapy/core/engine.py like below:

    def _next_request(self, spider):
        try:
            slot = self.slots[spider]
        except KeyError:
            return

        if self.paused:
            slot.nextcall.schedule(5)
            return

        while not self._needs_backout(spider):
            if not self._next_request_from_scheduler(spider):
                break

        if slot. and not self._needs_backout(spider):
            try:
                request = slot..next()
            except StopIteration:
                slot. = 
            except Exception, exc:
                log.err(, 'Obtaining request from start requests', \
                        spider=spider)
            else:
                # ------------Start Hacking-----------------------
                if request is None:
                    slot.nextcall.schedule(5)
                    return
                # ------------End Hacking-----------------------
                self.crawl(request, spider)

        if self.spider_is_idle(spider) and slot.close_if_idle:
            self._spider_idle(spider)

Then, method can be like this:

    def start_requests(self):
        while True:
            data = queue.get()
            if not isinstance(data, (tuple,)):
                yield 
            else:
                yield Request(........)

Thus, start_requests method can run forever, and not hands up any more.

The text was updated successfully, but these errors were encountered:

dangra · 2013-11-19T16:17:17Z

I like the idea instead of catching spider_idle signal and raising DontCloseSpider, I recall it was this way in the early days but got removed to "simplify" the number of data types spider middlewares deals with.

nramirezuy · 2013-12-02T20:23:50Z

I like the implementation

redapple · 2016-09-16T14:35:22Z

@dangra @kmike @eliasdorneles @lopuhin @pawelmhm , what do you think?
Should we plan on supporting this?

kmike · 2016-09-16T14:57:35Z

My take on this: #1051 (comment). A related issue is that currently start_requests are not exhausted:

Scrapy calls start_requests and gets enough requests to fill downloader
When new requests are scheduled (e.g. from responses) then scrapy pauses getting more requests from start_requests.

This is inconvenient if you e.g. have 100K websites to crawl and want to crawl their front pages (requests issued in start_requests), and follow some links on them (extracted in parse method). In this case even if you set a very high priority for requests sent in start_requests, they won't be processed. One have to use private APIs like self.crawler.engine.schedule(request) to make sure all 100K initial requests are scheduled.

It seems this is done in order to

support start_requests with infinite generators;
send less requests to scheduler at the same time,

but I think this is implicit and confusing (and not really documented), and explicit yield scrapy.WaitUntilIdle (as in #1051 (comment)) is better. Currently something like WaitUntilIdle (but not quite the same) is inserted after each Request yielded from start_requests.

redapple · 2016-09-16T15:10:33Z

I also believe start_requests should be consumed, if not completely (infinite generator), at least very regularly: it's after all the requests you start with, not that you leave aside if the spider generates new requests along the way in callback. Non-start-requests Requests needing to go out sooner could need to be set with higher priority only.

This fixes the explanation to use Requests instead of URLs, which is what actually happens, and is also consistent with the new tutorial, which already explains how URLs become Request objects. I've also changed the "loop", jumping from 9 to step 2.

eliasdorneles · 2016-11-14T13:22:21Z

I really like the idea of yielding a special value as @kmike suggested, that's what I'd expect as an user (a name suggestion: WaitUntilQueueEmpty, to avoid the had to read UntilIdle).

Would it be hard to implement? Would it break compatibility too much?

kmike · 2016-12-09T17:55:35Z

@eliasdorneles I like WaitUntilQueueEmpty name. It will be backwards incompatible indeed. We may introduce a new method with explicit behavior and deprecate start_requests; deprecating start_requests is be a bold move though :) Maybe we can also add support for async def start_requests(self): ... and fix it only for async def start_requests.

dangra · 2016-12-09T18:06:00Z

async def ? you're looking in the future far forward. What do you plan to do with PY2 users?

kmike · 2016-12-09T18:12:52Z

Heh, the original plan was to make it a py3-only feature; in Python 2 there is still plain start_requests with old behavior. But we can add a ..decorator?

It seems that compared to a plain generator async def could give us an ability to wait for deferreds/futures without special-casing WaitUntilQueueEmpty and implementing our own dispatching code, but I see this is hand-wavy at this point.

kmike · 2016-12-09T18:16:41Z

By the way, if we're to yield scrapy.Request from async def start_requests, it'd be Python 3.6+ feature, not only Python 3 feature. I'm fine with that.

dangra · 2016-12-09T18:26:17Z

By the way, if we're to still yield scrapy.Request from async def start_requests, it'd be Python 3.6+ feature, not only Python 3 feature. I'm fine with that.

Yes, IMO that will be the case by the time we can ditch PY2

lopuhin · 2018-02-09T10:13:04Z

An example workaround (not very reliable, because the first request might fail) for the issue of start_requests not consumed, assuming they can be loaded at the start (e.g. from a file):

    def start_requests(self):
        start_urls = ...
        yield Request(start_urls[0], callback=self.start_rest, dont_filter=True)
        self._start_urls = start_urls[1:]

    def start_rest(self, response):
        yield from self.parse(response)
        for url in self._start_urls:
            yield Request(url, dont_filter=True)

Edit: but this causes loading all requests into memory, which might be an issue.

psdon · 2021-03-24T23:25:17Z

any news on this ticket?

Prometheus3375 · 2023-07-24T14:01:45Z

Am i correct that currently start_requests does not support infinite generators?

Prometheus3375 · 2023-07-24T16:10:18Z

An example workaround (not very reliable, because the first request might fail) for the issue of start_requests not consumed, assuming they can be loaded at the start (e.g. from a file):

I think this should be a better solution:

class MySpider(Spider):
    start_url = 'https://example.com'

    def start_requests(self, /):
        yield Request(
            self.start_url,
            self._yield_requests,
            dont_filter=True,
            errback=self._yield_requests,
            )

    def _yield_requests(self, _, /):
        # yield requests infinitely
        pass

With such approach requests can be loaded later, after initial dummy request succeeded or failed. The second argument of _yield_requests can be Failure or Response, but it does not matter, because the request is a dummy one.

Dummy request can be also populated with necessary metadata to avoid processing with middlewares. The URL of dummy request can be changed to always fail producing no response.

dangra · 2023-07-24T16:27:24Z

@Prometheus3375 you can use a data:// url as start url. And no network interaction in that case.

The callback can be implemented as an asynchronous generator so you can feed the engine whenever the spider evaluates it has to, introducing delays to returning new requests if necessary.

btw, start_requests can be an infinite generator at the moment, happens that it can't control when the next request is to be fed into the engine. It is the engine who decides when to consume the generator (it does when it thinks it is idling)

Prometheus3375 · 2023-07-27T18:02:06Z

@dangra my concert is whether it is safe to put Kafka consuming loop inside start_requests. Currently there is such code:

    def start_requests(self, /):
        yield Request(
            self.start_url,
            self._yield_requests,
            dont_filter=True,
            errback=self._yield_requests,
            )

    def _yield_requests(self, _, /):
        with self._consumer:
            self._consumer.subscribe('topic_name')

            for message in self._consumer:
                if message:
                    data = json.loads(message.value())
                    yield self.create_request(data['url'], **data['meta'])
                else:
                    # Yield None, so the spider does not block any other operation
                    # (stats collection and export, for example) if Kafka has failed
                    yield None

I would like to move _yield_requests code to start_requests. Is it allowed to start_requests generator to yield None values? How long can it take for the engine to decide to consume more from start_requests generator?

dangra · 2023-07-28T02:56:00Z

I would like to move _yield_requests code to start_requests.
Is it allowed to start_requests generator to yield None values?

I'm afraid that won't be possible, start_requests can't be an async generator nor accepts None values.

That is why I was suggesting the following:

import scrapy

class FoobarSpider(scrapy.Spider):
    name = "foobar"

    def start_requests(self):
        yield scrapy.Request("data:,start", self._start_requests)

    async def _start_requests(self, response):
        for n in range(10):
            yield scrapy.Request(f"data:,{n}")

    def parse(self, response):
        print("GOT: ", response.body)

You can return None on request's callbacks to tell the engine it has to came back later for more requests, but I don't think this is ideal, the engine can be aggressive when it is idling. Sometimes it is more convenient to delay yielding from the spider.

If you are using asyncio reactor, you can await on asyncio.sleep like in:

    async def _start_requests(self, response):
        for n in range(10):
              while busy():
                  await asyncio.sleep(1)
              yield scrapy.Request(f"data:,{n}")

In case you don't want to use the asyncio reactor, it is still possible to simulate the python's asyncio.sleep with Twisted builtins:

from twisted.internet import reactor
from twisted.internet.task import deferLater

def sleep(delay, result=None):
    return deferLater(reactor, delay, lambda: result)

The last thing I must mention is that your kafka consumer code is "blocking" and won't play nice with the cooperative multitasking async model provided by Scrapy.

There are some ways to make it work, but I would try aiokafka instead of kafka-python. For a longer explanation read this.
I'm assuming you are using kafka-python or some other non-async friendly consumer, sorry if not.

This comment has grown larger than a thought!!

Prometheus3375 · 2023-07-28T10:59:27Z

Thank you for a comprehensive answer! In the code above a wrapper over confluent-kafka Consumer is used with added context manager and iterator protocols. confluent-kafka Consumer at least does commits asynchronously by default. There is still poll method that is blocking.

We will see how current implementation behaves, and will switch to aiokafka if necessary.

redapple added the discuss label Sep 16, 2016

kmike mentioned this issue Mar 10, 2018

GSoC 2018: Intro & questions about adding async/await support #3148

Closed

whalebot-helmsman mentioned this issue Apr 25, 2018

Ability to control consumption of start_requests from spider #3237

Open

kmike mentioned this issue Oct 8, 2019

Scrapy won't follow all Requests, generated by the overridden start_requests method #4063

Closed

This was referenced Oct 18, 2021

Full async start_requests support #4467

Open

Implement a decorator to feed start requests into the scheduler #5282

Open

[bug?] while True in start_requests(self): make scrapy is unable to consume the yields. #3259

Closed

wRAR mentioned this issue Jun 27, 2023

Support infinite generators for start_requests that can be starved temporarily #5962

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow start_requests method running forever #456

Allow start_requests method running forever #456

ipyleaf commented Nov 7, 2013

dangra commented Nov 19, 2013

nramirezuy commented Dec 2, 2013

redapple commented Sep 16, 2016

kmike commented Sep 16, 2016

redapple commented Sep 16, 2016 •

edited

Loading

eliasdorneles commented Nov 14, 2016

kmike commented Dec 9, 2016

dangra commented Dec 9, 2016

kmike commented Dec 9, 2016

kmike commented Dec 9, 2016 •

edited

Loading

dangra commented Dec 9, 2016

lopuhin commented Feb 9, 2018 •

edited

Loading

psdon commented Mar 24, 2021

Prometheus3375 commented Jul 24, 2023

Prometheus3375 commented Jul 24, 2023

dangra commented Jul 24, 2023 •

edited

Loading

Prometheus3375 commented Jul 27, 2023 •

edited

Loading

dangra commented Jul 28, 2023

Prometheus3375 commented Jul 28, 2023 •

edited

Loading

Allow start_requests method running forever #456

Allow start_requests method running forever #456

Comments

ipyleaf commented Nov 7, 2013

For

Situation

Solution

dangra commented Nov 19, 2013

nramirezuy commented Dec 2, 2013

redapple commented Sep 16, 2016

kmike commented Sep 16, 2016

redapple commented Sep 16, 2016 • edited Loading

eliasdorneles commented Nov 14, 2016

kmike commented Dec 9, 2016

dangra commented Dec 9, 2016

kmike commented Dec 9, 2016

kmike commented Dec 9, 2016 • edited Loading

dangra commented Dec 9, 2016

lopuhin commented Feb 9, 2018 • edited Loading

psdon commented Mar 24, 2021

Prometheus3375 commented Jul 24, 2023

Prometheus3375 commented Jul 24, 2023

dangra commented Jul 24, 2023 • edited Loading

Prometheus3375 commented Jul 27, 2023 • edited Loading

dangra commented Jul 28, 2023

Prometheus3375 commented Jul 28, 2023 • edited Loading

redapple commented Sep 16, 2016 •

edited

Loading

kmike commented Dec 9, 2016 •

edited

Loading

lopuhin commented Feb 9, 2018 •

edited

Loading

dangra commented Jul 24, 2023 •

edited

Loading

Prometheus3375 commented Jul 27, 2023 •

edited

Loading

Prometheus3375 commented Jul 28, 2023 •

edited

Loading