Requests scheduled when idle never go through the spider middlewares #542

nside · 2014-01-17T06:16:45Z

When my spider enters the idle state I schedule a request through the engine like this:

self.crawler.engine.schedule( req , self)

and I expect that request to go through the spider middlewares. But after some investigation it looks like only requests returned by Spider.start_requests and those returned from callbacks are processed by these middlewares.

Maybe I'm scheduling the request in a bad way but if so it shouldn't be public (ie I'd prefix schedule with _).

The text was updated successfully, but these errors were encountered:

rmax · 2014-01-17T13:45:46Z

The spider middlewares are meant to process the spider output/input. By using the engine directly you are returning nothing from the spider and thus bypassing the spider middlewares.

I suppose you have an spider middleware, can you make it a extension or downloader middleware?

dangra · 2014-01-17T15:55:33Z

@darkrho I dissagree, requests scheduled with engine.schedule() will download and their responses should be processed by process_spider_input middleware hook. That is how engine does it for start requests and redirections.

@nside: it's more convenient to call engine.crawl() instead, it will trigger the download immediately after you schedule the request.

nside · 2014-01-17T16:09:35Z

@dangra crawl() fixed it for me. Still I'd expect any requests scheduled to have the same treatment, whatever their "entry point" in the pipeline is. Feel free to close if you disagree.

dangra · 2014-01-17T16:30:12Z

@nside: what Scrapy version are you using?

historically, there were three engine entrypoints for requests: .download(), .schedule() and .crawl()

Used to be a big difference between .schedule() and .crawl() but now (>0.20) the only difference is that .crawl() triggers the loop that fetch requests from the scheduler immediately after scheduling a request.

.schedule() used to be more similar to .download(), skipping spider middlewares and the request's callback, but now it is very similar to .crawl() and only left for backward compatibility.

That said, engine api is not documented and can't be considered stable.

nside · 2014-01-17T16:33:00Z

I'm on 0.21 (dev). These are good subtleties to know!

dangra · 2014-01-17T16:35:42Z

TL;DR:

Use engine.crawl() if you expect your requests to be injected into the spider<->downloader pipeline.
Use engine.download() if you want a response back, only downloadmiddlewares are applied.
Don't use .schedule(), it's legacy and needs a warning.

rmax · 2014-01-17T16:47:36Z

Nice TL;DR. I think it should be somewhere in the docs because there are a
few examples in the wild using .schedule:
https://www.google.com/search?q=%22crawler.engine.schedule%22

On Fri, Jan 17, 2014 at 12:35 PM, Daniel Graña notifications@github.comwrote:

TL;DR:

Use engine.crawl() if you expect your requests to be injected into
the spider<->downloader pipeline.

Use engine.download() if you want a response back, only
downloadmiddlewares are applied.

Don't use .schedule(), it's legacy and needs a warning.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/542#issuecomment-32621164
.

nside closed this as completed Jan 17, 2014

elacuesta mentioned this issue Dec 10, 2018

Make requests via message queues #3477

Open

elacuesta mentioned this issue Jan 14, 2020

defer.inlineCallbacks in spider ? #4263

Closed

elacuesta mentioned this issue Apr 9, 2021

Engine: deprecations and type hints #5090

Merged

This was referenced Sep 21, 2023

Fix broken API with new Scrapy versions vladkvit/scrapy-wayback-machine#1

Closed

Fix broken API with new Scrapy versions sangaline/scrapy-wayback-machine#9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requests scheduled when idle never go through the spider middlewares #542

Requests scheduled when idle never go through the spider middlewares #542

nside commented Jan 17, 2014

rmax commented Jan 17, 2014

dangra commented Jan 17, 2014

nside commented Jan 17, 2014

dangra commented Jan 17, 2014

nside commented Jan 17, 2014

dangra commented Jan 17, 2014

rmax commented Jan 17, 2014

Requests scheduled when idle never go through the spider middlewares #542

Requests scheduled when idle never go through the spider middlewares #542

Comments

nside commented Jan 17, 2014

rmax commented Jan 17, 2014

dangra commented Jan 17, 2014

nside commented Jan 17, 2014

dangra commented Jan 17, 2014

nside commented Jan 17, 2014

dangra commented Jan 17, 2014

rmax commented Jan 17, 2014