Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow start_requests method running forever #456

Open
ipyleaf opened this issue Nov 7, 2013 · 19 comments · May be fixed by #4467
Open

Allow start_requests method running forever #456

ipyleaf opened this issue Nov 7, 2013 · 19 comments · May be fixed by #4467
Labels

Comments

@ipyleaf
Copy link

ipyleaf commented Nov 7, 2013

For

version 0.18.4

Situation

A Spider gets one Reuqest from start_requests, and start_requests won't stop because it depends on the MQ.

I know spider is sheduled by "yield". But if the MQ hands up because of no message coming in, the start_requests also hands up. That's not what I want.

Solution

So, I have hacked the source scrapy/core/engine.py like below:

    def _next_request(self, spider):
        try:
            slot = self.slots[spider]
        except KeyError:
            return

        if self.paused:
            slot.nextcall.schedule(5)
            return

        while not self._needs_backout(spider):
            if not self._next_request_from_scheduler(spider):
                break

        if slot. and not self._needs_backout(spider):
            try:
                request = slot..next()
            except StopIteration:
                slot. = 
            except Exception, exc:
                log.err(, 'Obtaining request from start requests', \
                        spider=spider)
            else:
                # ------------Start Hacking-----------------------
                if request is None:
                    slot.nextcall.schedule(5)
                    return
                # ------------End Hacking-----------------------
                self.crawl(request, spider)

        if self.spider_is_idle(spider) and slot.close_if_idle:
            self._spider_idle(spider)

Then, method can be like this:

    def start_requests(self):
        while True:
            data = queue.get()
            if not isinstance(data, (tuple,)):
                yield 
            else:
                yield Request(........)

Thus, start_requests method can run forever, and not hands up any more.

@dangra
Copy link
Member

dangra commented Nov 19, 2013

I like the idea instead of catching spider_idle signal and raising DontCloseSpider, I recall it was this way in the early days but got removed to "simplify" the number of data types spider middlewares deals with.

@nramirezuy
Copy link
Contributor

I like the implementation

@redapple
Copy link
Contributor

@dangra @kmike @eliasdorneles @lopuhin @pawelmhm , what do you think?
Should we plan on supporting this?

@kmike
Copy link
Member

kmike commented Sep 16, 2016

My take on this: #1051 (comment). A related issue is that currently start_requests are not exhausted:

  1. Scrapy calls start_requests and gets enough requests to fill downloader
  2. When new requests are scheduled (e.g. from responses) then scrapy pauses getting more requests from start_requests.

This is inconvenient if you e.g. have 100K websites to crawl and want to crawl their front pages (requests issued in start_requests), and follow some links on them (extracted in parse method). In this case even if you set a very high priority for requests sent in start_requests, they won't be processed. One have to use private APIs like self.crawler.engine.schedule(request) to make sure all 100K initial requests are scheduled.

It seems this is done in order to

  1. support start_requests with infinite generators;
  2. send less requests to scheduler at the same time,

but I think this is implicit and confusing (and not really documented), and explicit yield scrapy.WaitUntilIdle (as in #1051 (comment)) is better. Currently something like WaitUntilIdle (but not quite the same) is inserted after each Request yielded from start_requests.

@redapple
Copy link
Contributor

redapple commented Sep 16, 2016

I also believe start_requests should be consumed, if not completely (infinite generator), at least very regularly: it's after all the requests you start with, not that you leave aside if the spider generates new requests along the way in callback. Non-start-requests Requests needing to go out sooner could need to be set with higher priority only.

kmike referenced this issue Sep 29, 2016
This fixes the explanation to use Requests instead of URLs,
which is what actually happens, and is also consistent with the
new tutorial, which already explains how URLs become Request objects.

I've also changed the "loop", jumping from 9 to step 2.
@eliasdorneles
Copy link
Member

I really like the idea of yielding a special value as @kmike suggested, that's what I'd expect as an user (a name suggestion: WaitUntilQueueEmpty, to avoid the had to read UntilIdle).

Would it be hard to implement? Would it break compatibility too much?

@kmike
Copy link
Member

kmike commented Dec 9, 2016

@eliasdorneles I like WaitUntilQueueEmpty name. It will be backwards incompatible indeed. We may introduce a new method with explicit behavior and deprecate start_requests; deprecating start_requests is be a bold move though :) Maybe we can also add support for async def start_requests(self): ... and fix it only for async def start_requests.

@dangra
Copy link
Member

dangra commented Dec 9, 2016

async def ? you're looking in the future far forward. What do you plan to do with PY2 users?

@kmike
Copy link
Member

kmike commented Dec 9, 2016

Heh, the original plan was to make it a py3-only feature; in Python 2 there is still plain start_requests with old behavior. But we can add a ..decorator?

It seems that compared to a plain generator async def could give us an ability to wait for deferreds/futures without special-casing WaitUntilQueueEmpty and implementing our own dispatching code, but I see this is hand-wavy at this point.

@kmike
Copy link
Member

kmike commented Dec 9, 2016

By the way, if we're to yield scrapy.Request from async def start_requests, it'd be Python 3.6+ feature, not only Python 3 feature. I'm fine with that.

@dangra
Copy link
Member

dangra commented Dec 9, 2016

By the way, if we're to still yield scrapy.Request from async def start_requests, it'd be Python 3.6+ feature, not only Python 3 feature. I'm fine with that.

Yes, IMO that will be the case by the time we can ditch PY2

@lopuhin
Copy link
Member

lopuhin commented Feb 9, 2018

An example workaround (not very reliable, because the first request might fail) for the issue of start_requests not consumed, assuming they can be loaded at the start (e.g. from a file):

    def start_requests(self):
        start_urls = ...
        yield Request(start_urls[0], callback=self.start_rest, dont_filter=True)
        self._start_urls = start_urls[1:]

    def start_rest(self, response):
        yield from self.parse(response)
        for url in self._start_urls:
            yield Request(url, dont_filter=True)

Edit: but this causes loading all requests into memory, which might be an issue.

@psdon
Copy link

psdon commented Mar 24, 2021

any news on this ticket?

@Prometheus3375
Copy link

Am i correct that currently start_requests does not support infinite generators?

@Prometheus3375
Copy link

An example workaround (not very reliable, because the first request might fail) for the issue of start_requests not consumed, assuming they can be loaded at the start (e.g. from a file):

I think this should be a better solution:

class MySpider(Spider):
    start_url = 'https://example.com'

    def start_requests(self, /):
        yield Request(
            self.start_url,
            self._yield_requests,
            dont_filter=True,
            errback=self._yield_requests,
            )

    def _yield_requests(self, _, /):
        # yield requests infinitely
        pass

With such approach requests can be loaded later, after initial dummy request succeeded or failed. The second argument of _yield_requests can be Failure or Response, but it does not matter, because the request is a dummy one.

Dummy request can be also populated with necessary metadata to avoid processing with middlewares. The URL of dummy request can be changed to always fail producing no response.

@dangra
Copy link
Member

dangra commented Jul 24, 2023

@Prometheus3375 you can use a data:// url as start url. And no network interaction in that case.

The callback can be implemented as an asynchronous generator so you can feed the engine whenever the spider evaluates it has to, introducing delays to returning new requests if necessary.

btw, start_requests can be an infinite generator at the moment, happens that it can't control when the next request is to be fed into the engine. It is the engine who decides when to consume the generator (it does when it thinks it is idling)

@Prometheus3375
Copy link

Prometheus3375 commented Jul 27, 2023

@dangra my concert is whether it is safe to put Kafka consuming loop inside start_requests. Currently there is such code:

    def start_requests(self, /):
        yield Request(
            self.start_url,
            self._yield_requests,
            dont_filter=True,
            errback=self._yield_requests,
            )

    def _yield_requests(self, _, /):
        with self._consumer:
            self._consumer.subscribe('topic_name')

            for message in self._consumer:
                if message:
                    data = json.loads(message.value())
                    yield self.create_request(data['url'], **data['meta'])
                else:
                    # Yield None, so the spider does not block any other operation
                    # (stats collection and export, for example) if Kafka has failed
                    yield None

I would like to move _yield_requests code to start_requests. Is it allowed to start_requests generator to yield None values? How long can it take for the engine to decide to consume more from start_requests generator?

@dangra
Copy link
Member

dangra commented Jul 28, 2023

I would like to move _yield_requests code to start_requests.
Is it allowed to start_requests generator to yield None values?

I'm afraid that won't be possible, start_requests can't be an async generator nor accepts None values.

That is why I was suggesting the following:

import scrapy

class FoobarSpider(scrapy.Spider):
    name = "foobar"

    def start_requests(self):
        yield scrapy.Request("data:,start", self._start_requests)

    async def _start_requests(self, response):
        for n in range(10):
            yield scrapy.Request(f"data:,{n}")

    def parse(self, response):
        print("GOT: ", response.body)

You can return None on request's callbacks to tell the engine it has to came back later for more requests, but I don't think this is ideal, the engine can be aggressive when it is idling. Sometimes it is more convenient to delay yielding from the spider.

If you are using asyncio reactor, you can await on asyncio.sleep like in:

    async def _start_requests(self, response):
        for n in range(10):
              while busy():
                  await asyncio.sleep(1)
              yield scrapy.Request(f"data:,{n}")

In case you don't want to use the asyncio reactor, it is still possible to simulate the python's asyncio.sleep with Twisted builtins:

from twisted.internet import reactor
from twisted.internet.task import deferLater

def sleep(delay, result=None):
    return deferLater(reactor, delay, lambda: result)

The last thing I must mention is that your kafka consumer code is "blocking" and won't play nice with the cooperative multitasking async model provided by Scrapy.

There are some ways to make it work, but I would try aiokafka instead of kafka-python. For a longer explanation read this.
I'm assuming you are using kafka-python or some other non-async friendly consumer, sorry if not.

This comment has grown larger than a thought!!

@Prometheus3375
Copy link

Prometheus3375 commented Jul 28, 2023

Thank you for a comprehensive answer! In the code above a wrapper over confluent-kafka Consumer is used with added context manager and iterator protocols. confluent-kafka Consumer at least does commits asynchronously by default. There is still poll method that is blocking.

We will see how current implementation behaves, and will switch to aiokafka if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants